Copying Execute-Only Binaries on Linux – Broken Pipe's Tech Blog

I recently completed a C programming course at university. It was fun :). I guess the most interesting part was stealing the assignment solutions. Okay, just to be clear, I didn’t cheat.

Basically, we had programming assignments, and the lecturers provided compiled assignment solutions as demos for us to test our implementations against. When I heard that compiled solution demos would be given along with the assignments, I got pretty excited, as I had a few years of experience reverse engineering C code.

I asked one of the lecturers for permission to “steal” those demos and reverse engineer them, and I was given the go-ahead, which surprised me a bit. His original words were “we’ve made sure that you can’t just copy the executables and feed them into a reverse compiler, and if you are still able to do it, then there’s nothing much we can teach you in this course, but sure, you can try”. What could they have possibly done to protect their binaries?

Execute-Only Permissions

Binaries can be made execute-only, so that you only have execute permissions but not read or write permissions. Here’s an example of what an execute-only binary looks like:

-rwx--x--x 1 root  root  15776 Aug 21 11:13 test

-rwx--x--x 1 root  root  15776 Aug 21 11:13 test

The permission bits can be divided up into 4 chunks and in this case, they are -, rwx, --x and --x. The first chunk, - is the file type, d means directory and - means regular file. The second chunk, rwx is the owner permissions, and in the case of my example, it’s the permission that root has (root always has full access anyway). The third chunk, --x is the group permissions, and in the above example, it’s the permissions that people in the group root have. The last chunk, --x is the other permissions, the permissions that everyone else not root and not in the group root have. If I’m not root, then I only have permission to execute the binary.

This extends to almost every aspect of dealing with the binary, for example, you cannot load it in debuggers like gdb or attach a debugger to it after it is executed. While it’s running, you also don’t have read access to the virtual /proc/$pid/mem file, to prevent you from dumping the binary while it’s running.

Say I’m running the execute-only binary test, I can find its process ID:

epipe@HP-Laptop:~$ ps
    PID TTY          TIME CMD
 317245 pts/0    00:00:00 bash
 317329 pts/0    00:00:01 test
 317344 pts/0    00:00:00 ps

epipe@HP-Laptop:~$ ps
    PID TTY          TIME CMD
 317245 pts/0    00:00:00 bash
 317329 pts/0    00:00:01 test
 317344 pts/0    00:00:00 ps

Then I could list its files:

epipe@HP-Laptop:~$ ls -l /proc/317329/maps /proc/317329/mem
-r--r--r-- 1 root root 0 Aug 21 11:28 /proc/317329/maps
-rw------- 1 root root 0 Aug 21 11:28 /proc/317329/mem

epipe@HP-Laptop:~$ ls -l /proc/317329/maps /proc/317329/mem
-r--r--r-- 1 root root 0 Aug 21 11:28 /proc/317329/maps
-rw------- 1 root root 0 Aug 21 11:28 /proc/317329/mem

The file mem could only be read by the owner (root), to prevent you from dumping it that way. You don’t even get to access the memory map, maps, despite having read permissions.

epipe@HP-Laptop:~$ cat /proc/317329/maps
cat: /proc/317329/maps: Permission denied

epipe@HP-Laptop:~$ cat /proc/317329/maps
cat: /proc/317329/maps: Permission denied

I eventually gave up trying to copy/dump the binary and did the assignment myself. But, just a week before it was due, we had a lecture on file permissions and one of the lecturers talked about how Linux permissions were used to prevent students from stealing and decompiling the demo binaries, and that if anyone could bypass that, they’d get 100% on the assignment. The “get 100% on the assignment” part sparked my interest in stealing their demos again.

Existing Methods

`ptrace` `PEEKTEXT` and `PEEKDATA`

The most commonly documented way is to use ptrace. Basically, start the executable as a child process and then use PTRACE_PEEKTEXT and PTRACE_PEEKDATA to read its memory. This Stack Exchange answer gives a few examples, and there is even a tool called XOCopy written just for this.

There are 2 problems with this approach:

To use ptrace, you need to know the starting address and length of the content you’d like to dump. There are 3 ways suggested by people on the internet:
- Read /proc/$pid/maps, but as seen earlier, you don’t have permission to read it
- Use strace to figure out where it starts executing, but you cannot guarantee that the binary is mapped at the same place each time, especially with ASLR
- Walk the entire address space scanning for ELF headers, but it’s next to impossible on 64-bit machines

Now pretending that you somehow know where to start dumping and how long to dump, the biggest problem is PTRACE_PEEKTEXT and PTRACE_PEEKDATA don’t work on execute-only binaries, they always return permission error!

Now judging by what people said ~10 years ago, this existing approach worked, but my guess is things have changed in the last decade and every technique used by this method no longer works.

`LD_PRELOAD` Abuse

Another method is to abuse LD_PRELOAD to inject some code to have the program dump itself to disk as a readable file, as described in this Stack Exchange answer by Stéphane Chazelas.

This method works! It works because it can inject arbitrary code into a process at startup. As we know, the process itself can read its own memory and memory map, so there’s nobody stopping the process from dumping itself.

While it works, there are a few downsides:

While the code runs before hitting main(), the runtime has already made changes to global data
It doesn’t work on statically linked executables
It doesn’t work on SUID or SGID binaries, as injecting arbitrary code into processes with higher privileges is a security meltdown
The dumped binary cannot be executed, as various variables have already been initialised

Despite these downsides, I cobbled together a tool called dynadump based on this method and used it to dump the assignment 1 demo executable, which I then decompiled. In fact, I made the decompiled code compile back to the exact same binary. I told the course staff about it and they said they’d start doing more static linking.

At the same time, a second assignment was released, and it was in the form of a client/server, where students would invoke a client given to them and the client would send the student’s attempts to the server via a pipe. The server then checked if the answers were correct and logged the attempts. Because of this, the log files which contained the students’ attempts and results must not be accessible by students, so they made the server SGID.

I wanted to dump the server to figure out how the backend worked, however the LD_PRELOAD trick doesn’t work on SGID binaries. They also said they’d start doing static linking, so I thought I soon wouldn’t be able to dump assignment solution demos that way either. For these reasons, I started thinking of another way to dump execute-only binaries.

Exploiting `printf(3)`

The main thing about dumping execute-only binaries is to get the process to dump itself, and everything revolves around that. A process dumps itself via syscalls – at the very bottom layer, it creates a dump file using creat(2), writes itself to that file using write(2), and closes the file with close(2).

A lot of these syscalls existed in the binaries that I wanted to dump, for example a printf(2) probably ends up somewhere with a write(2) call. So, if I could catch that write(2) call, extend the length of the write, then it would print more than just the string it wanted to print and leak whatever is after the string in memory. If I then changed the pointer as well, I could start printing at an arbitrary address, for an arbitrary length, and this would effectively give me the ability to dump anything to the standard output.

The real challenge lies in changing the parameters buf and count of a write(2) call. Luckily, while ptrace(2) blocks accessing the memory of an execute-only tracee, it does not block reading and modifying the tracee’s registers. In Linux, syscall parameters are passed via registers for most architectures, so stopping at a write(2) call, modifying the registers which hold buf and count, and continuing is sufficient.

This method is not perfect, however. First, finding the starting address to dump – the ELF header, is not trivial. There are some heuristics for finding it, such as scanning from the original buf pointer minus the size of the executable, but it’s really just guesswork without a proper memory map. Then finding the right write(2) call to modify is also a challenge, an executable may not have any write(2) calls at all. Even if a correct write(2) call is found, by the time it gets there, you’re in the middle of running the program so the initialised global variables may have already been modified and lost their original values.

Despite the limitations, I put together writedump which changed the buf and count parameters to write(2) and implemented some basic guessing of ELF header location, and used it to dump the SGID assignment 2 server backend.

Arbitrary Syscall Injection

A thought that later came up was, if the registers could be modified, why not just modify the syscall number and all the registers to execute an arbitrary syscall? This way, we don’t have to rely on a particular syscall like write(2), and we can catch the very first syscall, before the program had a chance to mess with the global variables.

I quickly cobbled together a function to perform arbitrary syscall injection:

bool inject_syscall(pid_t pid,
	unsigned long rax,
	unsigned long rdi,
	unsigned long rsi,
	unsigned long rdx,
	unsigned long r10,
	unsigned long r8,
	unsigned long r9,
	unsigned long *result);

bool inject_syscall(pid_t pid,
	unsigned long rax,
	unsigned long rdi,
	unsigned long rsi,
	unsigned long rdx,
	unsigned long r10,
	unsigned long r8,
	unsigned long r9,
	unsigned long *result);

The assumption is that when this function is called, the tracee has already been stopped at the beginning of a syscall, with PTRACE_SYSCALL. The logic (sans error handling) is like:

1. Get the original registers:

struct user_regs_struct orig_regs;
ptrace(PTRACE_GETREGS, pid, NULL, &orig_regs);

struct user_regs_struct orig_regs;
ptrace(PTRACE_GETREGS, pid, NULL, &orig_regs);

2. Create new register set with manipulated registers:

struct user_regs_struct temp_regs = orig_regs;
temp_regs.orig_rax = rax;
temp_regs.rdi = rdi;
temp_regs.rsi = rsi;
temp_regs.rdx = rdx;
temp_regs.r10 = r10;
temp_regs.r8 = r8;
temp_regs.r9 = r9;

struct user_regs_struct temp_regs = orig_regs;
temp_regs.orig_rax = rax;
temp_regs.rdi = rdi;
temp_regs.rsi = rsi;
temp_regs.rdx = rdx;
temp_regs.r10 = r10;
temp_regs.r8 = r8;
temp_regs.r9 = r9;

3. Replace original registers with manipulated ones:

ptrace(PTRACE_SETREGS, pid, NULL, &temp_regs);

ptrace(PTRACE_SETREGS, pid, NULL, &temp_regs);

4. Execute the syscall and wait for it to finish:

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);
wait_child(pid);

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);
wait_child(pid);

5. Get the return value of the syscall:

ptrace(PTRACE_GETREGS, pid, NULL, &temp_regs);
*result = temp_regs.rax;

ptrace(PTRACE_GETREGS, pid, NULL, &temp_regs);
*result = temp_regs.rax;

6. Restore original registers and syscall number:

orig_regs.rip -= 2; // Step back to before syscall instruction.
orig_regs.rax = orig_regs.orig_rax;
ptrace(PTRACE_SETREGS, pid, NULL, &orig_regs);

orig_regs.rip -= 2; // Step back to before syscall instruction.
orig_regs.rax = orig_regs.orig_rax;
ptrace(PTRACE_SETREGS, pid, NULL, &orig_regs);

7. Break at the current syscall again:

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);
wait_child(pid);

ptrace(PTRACE_SYSCALL, pid, NULL, NULL);
wait_child(pid);

This function then serves as the basis for syscall wrappers. I created wrappers for each syscall, like this:

ssize_t inject_write(pid_t pid, int fd, const void *buf, size_t count)
{
	unsigned long result = 0;
	if (inject_syscall(pid, 1, fd, (unsigned long)buf, (unsigned long)count,
	    0, 0, 0, &result)) {
		if (result > -4096ULL) {
			errno = -(int)result;
			return -1;
		}
		return (ssize_t)result;
	} else {
		errno = 800;
		return -1;
	}
}

ssize_t inject_write(pid_t pid, int fd, const void *buf, size_t count)
{
	unsigned long result = 0;
	if (inject_syscall(pid, 1, fd, (unsigned long)buf, (unsigned long)count,
	    0, 0, 0, &result)) {
		if (result > -4096ULL) {
			errno = -(int)result;
			return -1;
		}
		return (ssize_t)result;
	} else {
		errno = 800;
		return -1;
	}
}

Given that syscall injection fully restores the execution context to before the injection, multiple injections can be done in succession. This allows us to inject syscalls to open and read the memory map, so we know exactly what and where to dump. We can also create a proper dump file instead of writing to stdout or stderr and then redirecting to a file.

A new challenge that arose was how to share information between the tracer and tracee – for example, we can read the memory map by injecting an open(2) and a read(2) call, but what’s the path to the memory map and where to read it to? The tracer will have to figure out the path, and put it into the tracee’s address space for the tracee to execute the open(2) syscall. The solution I came up with was to use pipe(2) and mmap(2) – have a pipe between the tracer and tracee for communication, and allocate memory in the tracee by injecting mmap(2) calls. For example, if the tracer wants to open the tracee’s /proc/self/maps, it would:

Map a page in the tracee’s address space by injecting mmap(2)
Write the string "/proc/self/maps" into the pipe
Read the string from the pipe into the mapped page by injecting read(2)
Pass the address of the page into open(2) as the path parameter
Unmap the page from the tracee’s address space with munmap(2)

Mapping (allocating) memory into the tracee is done with:

void *child_alloc(pid_t pid, size_t size)
{
	void *result = inject_mmap(pid, NULL, size, PROT_READ | PROT_WRITE |
	    PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	return (result == (void *)-1) ? NULL : result;
}

void *child_alloc(pid_t pid, size_t size)
{
	void *result = inject_mmap(pid, NULL, size, PROT_READ | PROT_WRITE |
	    PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	return (result == (void *)-1) ? NULL : result;
}

Copying memory from tracer to tracee (sans error handling):

void *parent_to_child(pid_t pid, int *fds, const void *buff, size_t length)
{
	char *chbuf = (char *)child_alloc(pid, length);
	while (length) {
		size_t chunk = PIPE_SIZE < length ? PIPE_SIZE : length;
		write(fds[1], buff, chunk);
		buff = (const char *)buff + chunk;
		inject_read(pid, fds[0], chbuf, wr);
		chbuf += chunk;
		length -= chunk;
	}
	return chbuf;
}

void *parent_to_child(pid_t pid, int *fds, const void *buff, size_t length)
{
	char *chbuf = (char *)child_alloc(pid, length);
	while (length) {
		size_t chunk = PIPE_SIZE < length ? PIPE_SIZE : length;
		write(fds[1], buff, chunk);
		buff = (const char *)buff + chunk;
		inject_read(pid, fds[0], chbuf, wr);
		chbuf += chunk;
		length -= chunk;
	}
	return chbuf;
}

And the full code for calling open(2) from the tracer as the tracee:

int child_open(pid_t pid, int *fds, const char *pathname, int flags, mode_t mode)
{
	size_t pathlen = strlen(pathname) + 1;
	char *chbuf = parent_to_child(pid, fds, pathname, pathlen);
	int result = inject_open(pid, chbuf, flags, mode);
	child_free(pid, chbuf, pathlen);
	return result;
}

int child_open(pid_t pid, int *fds, const char *pathname, int flags, mode_t mode)
{
	size_t pathlen = strlen(pathname) + 1;
	char *chbuf = parent_to_child(pid, fds, pathname, pathlen);
	int result = inject_open(pid, chbuf, flags, mode);
	child_free(pid, chbuf, pathlen);
	return result;
}

Other syscalls which require sharing of memory, like read(2), are done in the same way. With them, the memory map of the tracee can be read:

static char *get_mem_map(pid_t pid, int *fds)
{
	int fd_map = child_open(pid, fds, "/proc/self/maps", O_RDONLY, 0);
	char *mapbuf = (char *)malloc(MAP_SIZE);
	ssize_t bytesrd = child_read(pid, fds, fd_map, mapbuf, MAP_SIZE);
	inject_close(pid, fd_map);
	mapbuf[bytesrd] = '\0';
	return mapbuf;
}

static char *get_mem_map(pid_t pid, int *fds)
{
	int fd_map = child_open(pid, fds, "/proc/self/maps", O_RDONLY, 0);
	char *mapbuf = (char *)malloc(MAP_SIZE);
	ssize_t bytesrd = child_read(pid, fds, fd_map, mapbuf, MAP_SIZE);
	inject_close(pid, fd_map);
	mapbuf[bytesrd] = '\0';
	return mapbuf;
}

Once we have the memory map, dumping the entire binary is just a matter of parsing the memory map and injecting write(2) for each section, like this:

char *memmap = get_mem_map(pid, fds);
int outfd = child_open(pid, fds, outfile, O_WRONLY, 00777);
FILE *fpmap = fmemopen(memmap, strlen(memmap), "r");
while((n = fscanf(fpmap, "%p-%p%*s%lx%*s%*s%*[ ]%[^\n ]\n", (void **)&start,
    (void **)&end, (long unsigned int *)&offset, segpath)) != EOF) {
	if (n == 4 && !strcmp(segpath, selfpath)) {
		inject_lseek(pid, outfd, offset, SEEK_SET);
		inject_write(pid, outfd, start, (size_t)(end - start));
	}
}
fclose(fpmap);
inject_close(pid, outfd);

char *memmap = get_mem_map(pid, fds);
int outfd = child_open(pid, fds, outfile, O_WRONLY, 00777);
FILE *fpmap = fmemopen(memmap, strlen(memmap), "r");
while((n = fscanf(fpmap, "%p-%p%*s%lx%*s%*s%*[ ]%[^\n ]\n", (void **)&start,
    (void **)&end, (long unsigned int *)&offset, segpath)) != EOF) {
	if (n == 4 && !strcmp(segpath, selfpath)) {
		inject_lseek(pid, outfd, offset, SEEK_SET);
		inject_write(pid, outfd, start, (size_t)(end - start));
	}
}
fclose(fpmap);
inject_close(pid, outfd);

This way, a binary can be dumped as long as ptrace(2) is enabled, and the dumped binary can be executed just like the original. The full program – ExeOnlyDump, can be found here on GitHub: https://github.com/TheBrokenPipe/ExeOnlyDump.

Execute-Only Permissions

Existing Methods

ptrace PEEKTEXT and PEEKDATA

LD_PRELOAD Abuse

Exploiting printf(3)

Arbitrary Syscall Injection

Leave a Reply Cancel reply

`ptrace` `PEEKTEXT` and `PEEKDATA`

`LD_PRELOAD` Abuse

Exploiting `printf(3)`