In Linux >=4.4, when the CONFIG_BPF_SYSCALL config option is set and the kernel.unprivileged_bpf_disabled sysctl is not explicitly set to 1 at runtime, unprivileged code can use the bpf() syscall to load eBPF socket filter programs. These conditions are fulfilled in Ubuntu 16.04.
When an eBPF program is loaded using bpf(BPF_PROG_LOAD, ...), the first function that touches the supplied eBPF instructions is replace_map_fd_with_map_ptr(), which looks for instructions that reference eBPF map file descriptors and looks up pointers for the corresponding map files. This is done as follows:
/* look for pseudo eBPF instructions that access map FDs and * replace them with actual map pointers */ static int replace_map_fd_with_map_ptr(struct verifier_env *env) { struct bpf_insn *insn = env->prog->insnsi; int insn_cnt = env->prog->len; int i, j;
for (i = 0; i < insn_cnt; i++, insn++) { [checks for bad instructions]
f = fdget(insn->imm); map = __bpf_map_get(f); if (IS_ERR(map)) { verbose("fd %d is not pointing to valid bpf_map\n", insn->imm); fdput(f); return PTR_ERR(map); }
[...] } } [...] }
__bpf_map_get contains the following code:
/* if error is returned, fd is released. * On success caller should complete fd access with matching fdput() */ struct bpf_map *__bpf_map_get(struct fd f) { if (!f.file) return ERR_PTR(-EBADF); if (f.file->f_op != &bpf_map_fops) { fdput(f); return ERR_PTR(-EINVAL); }
return f.file->private_data; }
The problem is that when the caller supplies a file descriptor number referring to a struct file that is not an eBPF map, both __bpf_map_get() and replace_map_fd_with_map_ptr() will call fdput() on the struct fd. If __fget_light() detected that the file descriptor table is shared with another task and therefore the FDPUT_FPUT flag is setin the struct fd, this will cause the reference count of the struct file to be over-decremented, allowing an attacker to create a use-after-free situation where a struct file is freed although there are still references to it.
A simple proof of concept that causes oopses/crashes on a kernel compiled with memory debugging options is attached as crasher.tar.
One way to exploit this issue is to create a writable file descriptor, start a write operation on it, waitfor the kernel to verify the file's writability, then free the writable file and open a readonly file that is allocated in the same place before the kernel writes into the freed file, allowing an attacker to write data to a readonly file. By e.g. writing to /etc/crontab, root privileges can then be obtained. There are two problems with this approach: The attacker should ideally be able to determine whether a newly allocated struct file is located at the same address as the previously freed one. Linux provides a syscall that performs exactly this comparison for the caller: kcmp(getpid(), getpid(), KCMP_FILE, uaf_fd, new_fd). In order to make exploitation more reliable, the attacker should be able to pause code execution in the kernel between the writability check of the target file and the actual write operation. This can be done by abusing the writev() syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays read accesses, then mmap()s a file containing a struct iovec from that FUSE filesystem and passes the result of mmap() to writev(). (Another way to do this would be to use the userfaultfd() syscall.) writev() calls do_writev(), which looks up the struct file * corresponding to the file descriptor number and then calls vfs_writev(). vfs_writev() verifies that the target file is writable, then calls do_readv_writev(), which first copies the struct iovec from userspace using import_iovec(), then performs the rest of the write operation. Because import_iovec() performs a userspace memory access, it may have to wait for pages to be faulted in - and in this case, it has to wait for the attacker-owned FUSE filesystem to resolve the pagefault, allowing the attacker to suspend code execution in the kernel at that point arbitrarily. An exploit that puts all this together is in exploit.tar. Usage: user@host:~/ebpf_mapfd_doubleput$ ./compile.sh user@host:~/ebpf_mapfd_doubleput$ ./doubleput starting writev woohoo, got pointer reuse writev returned successfully. if this worked, you'll have a root shell in <=60 seconds. suid file detected, launching rootshell... we have root privs now... root@host:~/ebpf_mapfd_doubleput# id uid=0(root) gid=0(root) groups=0(root),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare),999(vboxsf),1000(user)
This exploit was tested on a Ubuntu 16.04 Desktop system.