On Fri, Feb 6, 2015 at 11:32 AM, Andy Lutomirski <email@example.com> wrote:
> On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <firstname.lastname@example.org> wrote:
>> On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <email@example.com> wrote:
>>> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <firstname.lastname@example.org> wrote:
>>>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote:
>>>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <email@example.com> wrote:
>>>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <firstname.lastname@example.org>
>>>>> > wrote:
>>>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO
>>>>> >> used
>>>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop
>>>>> >> event,
>>>>> >> which was awful because userspace cannot distinguish syscall-enter-stop
>>>>> >> from syscall-exit-stop and therefore relies on the kernel that
>>>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's
>>>>> >> death, etc.).
>>>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes
>>>>> >> syscall-exit-stop
>>>>> >> events to be suppressed, but now the syscall number is lost.
>>>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp
>>>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you
>>>>> > think here?
>>>>> I still don't quite see how this change caused this.
>>>> I have a test for this at
>>>>> I can play with
>>>>> it a bit more. But RET_ERRNO *has* to be some kind of skip event,
>>>>> because it needs to skip the syscall.
>>>>> We could change this by treating RET_ERRNO as an instruction to enter
>>>>> phase 2 and then asking for a skip in phase 2 without changing
>>>>> orig_ax, but IMO this is pretty ugly.
>>>>> I think this all kind of sucks. We're trying to run ptrace after
>>>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp.
>>>>> That means that if we use RET_TRAP, then ptrace will see the
>>>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO
>>>>> correctly given the current design) showing syscall -1, and if we use
>>>>> RET_KILL, then ptrace just sees the process mysteriously die.
>>>> Userspace is usually not prepared to see syscall -1.
>>>> For example, strace had to be patched, otherwise it just skipped such
>>>> syscalls as "not a syscall" events or did other improper things:
>>> The x32 thing is a legit ABI bug, I'd argue. I'd be happy to submit a
>>> patch to fix that (clear the x32 bit if we're not x32).
>>>> A slightly different but related story: userspace is also not prepared
>>>> to handle large errno values produced by seccomp filters like this:
>>>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA)
>>>> For example, glibc assumes that syscalls do not return errno values
>>>> greater than 0xfff:
>> To save others the link reading: "Linus said he will make sure the no
>> syscall returns a value in -1 .. -4095 as a valid result so we can
>> savely test with -4095."
>> Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a
>> full int, though digging around I find this in include/linux/err.h:
>> * Kernel pointers have redundant information, so we can use a
>> * scheme where we can return either an error code or a normal
>> * pointer with the same return value.
>> * This should be a per-architecture thing, to allow different
>> * error and pointer decisions.
>> #define MAX_ERRNO 4095
>> #ifndef __ASSEMBLY__
>> #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
>> But no architecture overrides this.
>>>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask
>>>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff.
>> I'm not opposed to this. I would want to more explicitly document the
>> 4095 max value in man pages, though.
>>> I think this is solidly in the "don't do that" category. Seccomp lets
>>> you tamper with syscalls. If you tamper wrong, then you lose.
>>> Kees, what do you think about reversing the whole thing so that
>>> ptracers always see the original syscall?
>> What do you mean by "reversing"? The interactions I see here are:
>> Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only
>> ptrace will trigger via _TIF_WORK_SYSCALL_EXIT.
>> For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned
>> arch/x86/kernel/entry_32.S ...
>> movl $-ENOSYS,PT_EAX(%esp)
>> movl %esp, %eax
>> call syscall_trace_enter
>> /* What it returned is what we'll actually use. */
>> cmpl $(NR_syscalls), %eax
>> jnae syscall_call
>> jmp syscall_exit
>> Both before and after the 2-phase change, syscall_trace_enter would
>> return -1 if it hit SECCOMP_RET_ERRNO, before calling
>> tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit
>> tracehook_report_syscall_exit during syscall_trace_leave, which means
>> a ptracer will see a syscall-exit-stop without a matching
>> Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally
>> crazy, as the ptracer would need to be the same program, and if it
>> chose to skip a syscall, it would be in the same place: it would see
>> PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a
>> syscall-exit-stop. I think we can ignore this pathological case.
>> Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip,
>> which produces the same "only syscall-exit-stop seen" problem.
>> In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and
>> isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't
>> change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr
>> _could_ change, but the ptracer would be doing it, so the crazy
>> situation around PTRACE_SYSCALL is probably safe to ignore (as long as
>> we document what is expected to happen).
>> So, the question is: should PTRACE_SYSCALL see a syscall that is _not_
>> being executed (due to seccomp)? Audit doesn't see it currently, and
>> neither does ptrace. I would argue that it should continue to not see
>> the syscall. That said, if it shouldn't be shown, we also shouldn't
>> trigger syscall-exit-stop. If you can convince me it should see
>> syscall-enter-stop, then I have two questions:
> I think PTRACE_SYSCALL should see syscalls that are skipped due to
> seccomp. I think that the exit event should see the modified errno,
> if any, so that strace will show whatever the traced process thinks is
>> 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I
>> think we probably must, since it can already interfere via
>> syscall-exit-stop and change the errno.
> I think this is fine.
>> And especially since a ptracer
>> can change syscalls during syscall-enter-stop to any syscall it wants,
>> bypassing seccomp. This condition is already documented.
> If a ptracer (using PTRACE_SYSCALL) were to get the entry callback
> before seccomp, then this oddity would go away, which might be a good
> thing. A ptracer could change the syscall, but seccomp would based on
> what the ptracer changed the syscall to.
I want kill events to trigger immediately. I don't want to leave the
ptrace surface available on a SECCOMP_RET_KILL. So maybe it can be
seccomp phase 1, then ptrace, then seccomp phase 2? And pass more
information between phases to determine how things should behave
beyond just "skip"?
>> 2) What do we do with audit? Suddenly we have ptrace seeing a syscall
>> that audit doesn't?
> Is this a problem? I'd be amazed if program uses both ptrace and
> audit -- after all, audit is a global thing, and it only has one
> implementation (AFAIK): auditd. auditd doesn't ptrace the world.
>> And an unrelated thought:
>> 3) Can't we find some way to fix the inability of a ptracer to
>> distinguish between syscall-enter-stop and syscall-exit-stop?
> Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along
> the lines of PTRACE_O_TRACESYSGOOD?
That might be a nice idea. I haven't written a test to see, but what
does PTRACE_GETEVENTMSG return on syscall-enter/exit-stop? If we can't
add something there, then yeah, adding PTRACE_O_TRACESYSENTRY and
PTRACE_O_TRACESYSEXIT with their own event msgs would be nice. Could
even add the syscall nr to the event msg so ptracers don't have to dig
around in per-arch registers, too.
Chrome OS Security