PaX and Clang sanitizers, or some random notes on the crossbreeding of adders and hedgehogs

Tagged:

TL;DR: patches for running Clang sanitizers under PaX-enabled kernels are at the bottom of the post.

Epigraph

- What happens if one crossbreeds an adder and a hedgehog?

- Two meters of the barbed wire.

Old Russian joke.

Disclaimer: I’m not an expert nor in Clang / sanitizers, neither in PaX. I’ve learned the things explained below over the course of last Sunday + several evenings, so some inaccuracies (or even errors) are expected.

What are Clang sanitizers?

Clang sanitizers are error detectors for some common types of errors, such as memory leaks, accessing invalid memory addresses, data races, or doing some other stupid things in your code.

What sanitizers are available?

As of clang 3.5.1, there are the following sanitizers which are directly usable:

Because UndefinedBehaviorSanitizer works with PaX out of the box, I won’t be talking about it below and will concentrate on the other sanitizers instead.

Why bother?

There are other projects with similar scope, one of the best known is Valgrind.

Given that, why implement yet another tool for the same purpose?

The main reason is speed.

Given that the instrumentation happens at the compiler stage, the necessary code can be directly injected in the output binary, resulting in much better performance than rewriting (or even interpreting) the code at runtime.

In theory, another advantage is that having all the information that is available to the compiler can help to catch more errors. Whether that’s the case in practice is another question, though.

There’s also one major disadvantage: the only code that is instrumented is the one that is being compiled as part of your project.

For some sanitizers it doesn’t matter too much - but for MemorySanitizer, for example, this leads to the requirement that all libraries your project uses (including all standard libraries, like glibc / libc) have to be compiled with sanitizer enabled - this is not too practical for any but large projects that can tolerate high levels of build complexity.

How do they work?

I won’t go into the details of algorithms of various sanitizers, there are descriptions elsewhere for those who are interested:

There are only two key points that are common to most of these sanitizers and that are important to understand for the rest of the article:

  • Code generation vs runtime.
  • Shadow memory.

Code generation vs runtime

The operation of sanitizer can be split into two different phases:

  • Instrumentation - happens when your program code is being compiled, consists of inserting some extra instructions into the code that either inform the runtime component of the sanitizer about some events or perform some checks (e.g. whether given memory access is allowed).
  • Actual run - there’s a runtime component of the sanitizer that is linked into your program. When the program is executed, the sanitizer performs various checks and bookkeeping activities, such as allocating the necessary extra memory to store the state of the sanitizer.

These parts are quite heavily tied to each other, so whenever one part is changed - another part has to be adjusted as well.

Shadow memory

Most of the sanitizers above need to store some information about each byte of the process memory that is being checked.

For example, AddressSanitizer for each byte of the process memory stores the information whether this byte is valid to access or not - e.g. the memory was freed.

The memory region where this extra information is stored is called “shadow memory”.

How the data structure and allocation process for that shadow memory should look like?

Obviously, storing this information in smth like std::hash_map<void*, State> would be too slow and would require too much additional memory.

Using smth like std::vector<State> and mapping process memory bytes to vector elements also wouldn’t work because the process memory can be allocated in non-continuous chunks.

The solution that Clang sanitizers take is as follows:

  • Reserve virtual memory range that is enough to cover in 1:1 mapping all potential process memory - that is, essentially take std::vector<State> approach above, but without actually allocating this memory.
  • When process memory is being allocated - simultaneously allocate extra pages and map them to the appropriate shadow memory ranges.

This whole approach relies on virtual memory concept, specifically “paged virtual memory”.

As an example, let’s assume we know that all process memory is always allocated in the 0x7cf000000000 - 0x800000000000 virtual memory range. Then, if our state requires 1 byte per 1 byte of process memory, we can reserve (but not allocate!) the memory range 0x6cf000000000 - 0x700000000000.

When the process actually allocates a page of memory at, say, 0x7cf800000000 address, the sanitizer will also allocate a page of memory at 0x6cf800000000 address and will store all the corresponding state there for the bytes of the newly allocated process memory page.

What are the properties of Clang shadow memory scheme?

The advantages of this scheme are as follows:

  • Having process memory address, calculating the corresponding shadow memory address is trivial: in the example above it’s just 'Mem - 0x100000000000'. Therefore, getting the state for the particular process memory address is as easy as doing one subtraction and one memory access.

    This is very important as for some sanitizers this state fetching has to happen for every single memory access of the checked program, therefore the faster it is - the better.

  • The sanitizer doesn’t have to use more shadow memory than it needs: even though it reserves the huge amount of virtual space, it actually allocates only those pages that are really needed.

Of course, there are also some disadvantages as well:

  • In order for this to work, there must be enough non-used virtual memory space to map the whole process memory space - and additionally, that space has to be continuous. This is trivial in the example above, but imagine we need to store not 1 byte of state per 1 byte of process space, but 64 bytes. Given the size of process memory space above is 0x800000000000 - 0x7cf000000000 = 0x31000000000, we’d need 0x31000000000 * 64 = 0xc40000000000 of virtual memory space for the shadow memory - which is more than the whole virtual memory space given by the standard Linux kernel to the applications (1 << 47 = 0x800000000000).

    Current Clang sanitizers don’t require that much state - but as we’ll see below, there can be other reasons leading to this issue.

  • For the highest possible speed, it’s desirable to know the location of this shadow memory space during the instrumentation - so that the compiler can generate the subtraction instructions explained above using fixed constant instead of fetching this value from the variable, which would lead to one more memory access per each process memory access. This, in turn, implies that the location of the process memory space has to be known at compilation time.

    While this location is (mostly) known for normal Linux distributions, things can be quite different for PaX, as we’ll see below.

Ok, so how is this all related to PaX?

What is PaX?

PaX is a patch for Linux kernel that implements certain security practices that should make the life harder for people trying to compromise your system.

There are several key properties of PaX that are relevant to our discussion:

  • Non-executable kernel pages” option (PAX_KERNEXEC) + “prevent invalid userland pointer dereference” option (PAX_MEMORY_UDEREF): any of these options forces the option PAX_PER_CPU_PGD, which causes the reduction of the application virtual address space from 47 bits to 42 bits.
  • PAX_RANDMMAP option, that causes pages mappings to be randomized, and as a special case - leads to the process memory space not having known location, but being randomly located in quite large memory range.
  • PAX_MPROTECT option, that disables the creation of memory pages that are simultaneously writable and executable.

So, what is the problem?

Here’s an approximate memory layout of the program that is executed on the PaX patched kernel with options mentioned above turned on:

  • Start of the main program executable and the heap: anywhere between 0x400000 and 0x8000000000.
  • Extra modules (e.g. shared libraries): anywhere between 0x20000000000 and ~0x0x3fe00000000.
  • Stack: 0x37ffff00000 - 0x3fffffff000.

Given the whole application virtual address space is 1 << 42 = 0x40000000000 bytes, we cannot map anything above 0x20000000000 (as that could potentially interfere with extra modules) and below 0x8000000000 + WhateverMaxSizeForYourMainProgramPlusHeapYouCanThinkOf (as that could interfere with main program + heap).

Let’s see how these restrictions play out with Clang sanitizers.

Porting LeakSanitizer: “I’m too young to die”

This sanitizer is the easiest to change to make it work with PaX, as it doesn’t use the shadow memory. The only reason it doesn’t work with PaX is that it uses custom allocator that tries to mmap the pages in the range 0x600000000000 - 0x640000000000. Obviously, given PaX memory ranges discussed above, this is not going to work. However, just moving it to the range 0x018000000000 - 0x1c000000000 solves the problem. This reduces the amount of memory available to allocator from 4T to 256G, but PaX memory layout wouldn’t let you to allocate 4T of memory anyway, and honestly - who needs that much memory? “256G ought to be enough for anybody” (even though this quote is disputed).

Let’s see how it works in practice.

Compile the code:

int main()
{
    int* i = new int;
    return 0;
}

as follows:

clang++ -std=c++14 -fsanitize=leak -g test_lsan.cpp -o test_lsan

and run it like this:

./test_lsan

to produce the output:

=================================================================
==3953==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x41ddfb in operator new(unsigned long) (/var/tmp/test_lsan+0x41ddfb)
    #1 0x42fc8d in main /var/tmp/test_lsan.cpp:3:5
    #2 0x28f08075133 in __libc_start_main (/lib64/libc.so.6+0x20133)
    #3 0x42fb7c in _start (/var/tmp/test_lsan+0x42fb7c)

SUMMARY: LeakSanitizer: 4 byte(s) leaked in 1 allocation(s).

Porting AddressSanitizer: “Hey, not too rough”

This sanitizer does use shadow memory, so things ought to get more interesting here.

What makes things easier, though, is that it needs only 1 byte of state per 8 bytes of the process memory - essentially, just 1 bit for each byte indicating whether the byte is valid to address or not.

Quoting AddressSanitizerAlgorithm page, this is how its shadow mapping looks like:

0000 0000 0000 - 0000 7fff 8000: LowMem
0000 7fff 8000 - 0000 8fff 7000: LowShadow
0000 8fff 7000 - 0200 8fff 7000: ShadowGap
0200 8fff 7000 - 1000 7fff 8000: HighShadow
1000 7fff 8000 - 7fff ffff ffff: HighMem

… and this is the mapping formula: Shadow = (Mem >> 3) + 0x7fff8000

Note that because of this shift by 3 bits, there’s not one, but two memory ranges (LowMem and HighMem) that can be shadowed without overlapping between them or shadow ranges.

Essentially, what this means is that the process memory can reside anywhere in LowMem or HighMem memory ranges, and there will be a valid shadow memory range that can be used to store the state of that memory.

Of course, this doesn’t work for PaX - HighMem range is too high and LowMem range is not enough to cover the whole possible range of main process memory.

Let’s consider the alternative layout:

0000 0000 0000 - 0100 0000 0000: LowMem
0100 0000 0000 - 0120 0000 0000: LowShadow
0120 0000 0000 - 0130 0000 0000: ShadowGap
0130 0000 0000 - 0180 0000 0000: HighShadow
0180 0000 0000 - 03ff ffff ffff: HighMem

… and the following mapping formula: Shadow = (Mem >> 3) + 0x10000000000

Note that in this case LowMem covers the entire range of main program memory range for PaX (which is 0x400000 - 0x8000000000) and in the worst case still leaves 512G for main memory and the heap.

The HighMem range does cover the range used by PaX for extra modules + stack (0x20000000000 - 0x3ffffffffff).

The only remaining question is where to put the allocator memory range?

The extra difficulty with placing the allocator memory range is that the allocator has the following restrictions:

  • AllocatorStart = N * AllocatorSize
  • AllocatorSize >= 256G
  • It’s memory has to belong to the “process memory space”, that is - has the corresponding shadow memory.

I think that at least the second restriction can be lifted, but I didn’t analyze it’s code in enough detail to say that for sure - so let’s assume it’s in place and try to find the suitable location.

Fortunately, for AddressSanitizer it’s easy - just put it at 0x018000000000 with the size 0x004000000000, then all the conditions above are satisfied - and there’s still no overlap between the allocator memory range and the process memory range.

Example follows.

Compile the code:

int main()
{
    int* i = new int;
    delete i;
    *i = 0;
    return 0;
}

as follows:

clang++ -std=c++14 -fsanitize=address -g test_asan.cpp -o test_asan

and run it like this:

./test_asan

to produce the output:

=================================================================
==4070==ERROR: AddressSanitizer: heap-use-after-free on address
    0x01820000eff0 at pc 0x0000004ba9bb bp 0x03dd0d63a370 sp 0x03dd0d63a368
WRITE of size 4 at 0x01820000eff0 thread T0
    #0 0x4ba9ba in main /var/tmp/test_asan.cpp:5:5
    #1 0x2dd8626c133 in __libc_start_main (/lib64/libc.so.6+0x20133)
    #2 0x4ba6ac in _start (/var/tmp/test_asan+0x4ba6ac)

0x01820000eff0 is located 0 bytes inside of 4-byte region [0x01820000eff0,0x01820000eff4)
freed by thread T0 here:
    #0 0x434ffb in operator delete(void*) (/var/tmp/test_asan+0x434ffb)
    #1 0x4bab83 in operator delete(void*, unsigned long) (/var/tmp/test_asan+0x4bab83)
    #2 0x4ba935 in main /var/tmp/test_asan.cpp:4:5
    #3 0x2dd8626c133 in __libc_start_main (/lib64/libc.so.6+0x20133)
    #4 0x4ba6ac in _start (/var/tmp/test_asan+0x4ba6ac)

previously allocated by thread T0 here:
    #0 0x434abb in operator new(unsigned long) (/var/tmp/test_asan+0x434abb)
    #1 0x4ba8ce in main /var/tmp/test_asan.cpp:3:5
    #2 0x2dd8626c133 in __libc_start_main (/lib64/libc.so.6+0x20133)
    #3 0x4ba6ac in _start (/var/tmp/test_asan+0x4ba6ac)

SUMMARY: AddressSanitizer: heap-use-after-free /var/tmp/test_asan.cpp:5 main
Shadow bytes around the buggy address:
  0x013040001da0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001dd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001de0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x013040001df0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fd]fa
  0x013040001e00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001e10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001e20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001e30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x013040001e40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  ASan internal:           fe
==4070==ABORTING

Porting MemorySanitizer: “Hurt me plenty”

Here things start becoming even more interesting.

I haven’t seen the explicit memory layout mentioned anywhere in MemorySanitizer documentation, but looking at the code - it should be as follows:

0000 0000 0000 - 2000 0000 0000: -
2000 0000 0000 - 4000 0000 0000: shadow memory
4000 0000 0000 - 6000 0000 0000: origin tracking memory
6000 0000 0000 - 7fff ffff ffff: main application, heap, extra modules and stack

… with the following mapping formulas:

  • Shadow = Mem & ~0x400000000000 (= just stripping the highest bit)
  • Origin = Shadow(Mem) + 0x200000000000

This sanitizer uses two shadow memory ranges, one to track process memory state and another one to track the origins of the allocated memory.

Both these states are 1:1 mapping, i.e. 1 byte of state per 1 byte of the process memory - thus, the overall shadow memory range has to be twice as large as the process memory.

Let’s see how that plays out with PaX restrictions. Summing the sizes of both memory regions that PaX uses, it looks like we need >= 2 * (0x28000000000 + ThatProgramExtraSize) bytes, which is >= 0x50000000000, which is >= of the whole 0x40000000000 address space that is provided by PaX to the applications!

That does seem like a show-stopper. How are we going to solve this?

Disabling PAX_RANDMMAP

It’s important to realize that the issue above is largely caused by the randomization that is performed by PaX / kernel - that is, if we could somehow avoid the process memory range moving around the address space, then we’d be able to use smaller shadow memory region.

But is it OK to disable randomization, given that the whole purpose of PaX is to introduce it? The answer is ‘hell no’ if it would mean disabling it system-wide. However, if we could do it on a per-process basis - this should be acceptable, as the program that is built with sanitizers enabled is unlikely to be the one used in production anyhow, and if the build is for testing purposes - the decrease in security due to disabled randomization should not be an issue.

The more subtle point is that disabling randomization means one will be testing the executable in a different configuration than it’s supposed to run in, so if there are some bugs that depend on the exact process memory layout - the sanitizer might miss them. I think this is an acceptable risk, plus it’s not like there are that many choices anyhow (but look at the “alternative approach” section in the end of the post).

Fortunately, there’s a way to disable PAX_RANDMMAP randomization on a per-process basis: paxctl-ng.

This is not enough to fully suppress the randomization, though, as once PAX_RANDMMAP is disabled - the default Linux ASLR kicks in, and the process memory range becomes smth like 0x2fe00000000 - 0x3fffffff000.

Would this work for us?

The size of the process memory range itself is manageable: 0x3fffffff000 - 0x2fe00000000 = 0x101fffff000, given 2:1 ratio it becomes 0x203ffffe000 - still within our address space range, even though we’re getting quite close.

However, there’s also allocator space that we should take care of - and we cannot place it anywhere in the program memory range, as it is randomized.

If we try to place the allocator at 0x28000000000 (the closest we can get to the process memory range while taking into account allocator alignment requirements and miminum size), the memory range that we need to shadow now has 0x3fffffff000 - 0x28000000000 = 0x17ffffff000 size, given 2:1 ratio - we need to fit 0x2ffffffe000 bytes, there’s no space for this in our memory layout!

What shall we do?

Disabling Linux ASLR

One thing that can be done is to disable Linux ASLR as well, which can also be done on a per-process basis: ‘setarch x86_64 -R your_program_name’ when running the program.

The layout becomes then as follows:

  • 0x2aaaaaaa000: start address of main program + heap.
  • 0x3fffffff000: end address of extra modules + stack.

At first sight it might look like the situation is now worse: we had 0x2fe00000000 - 0x3fffffff000 memory range and now it’s 0x2aaaaaaa000 - 0x3fffffff000.

The key observation is that before the program could have been located anywhere in that range - but now it’s pinned, so making some modest assumptions about the size of the main program, heap, extra modules and stack it’s possible to have the layout that does work within our constraints.

Here’s one possible layout:

0000 0000 0000 - 0000 1000 0000: protected (256M)
0000 1000 0000 - 0080 1000 0000: shadow for main application and heap (512G)
0080 1000 0000 - 0140 1000 0000: shadow for allocator, modules and main thread stack (768G)
0140 1000 0000 - 01c0 1000 0000: origin for main application and heap (512G)
01c0 1000 0000 - 0280 1000 0000: origin for allocator, modules and main thread stack (768G)
0280 1000 0000 - 02aa aaaa a000: - (~170G)
02aa aaaa a000 - 032a aaaa a000: main application and heap (512G)
032a aaaa a000 - 0340 0000 0000: - (~85G)
0340 0000 0000 - 03ff ffff ffff: allocator, modules and main thread stack (768G)

A couple of comments for this layout:

  • The initial ‘protected’ range is there because, even when PAX_RANDMMAP is disabled, kernel still refuses to map some low addresses, like 0x400000 - not entirely sure why. But we can just skip that range.
  • Allocator space is put into the ‘extra modules & main thread stack’ region, so that out of 768G the 256G goes to allocator and 512G to everything else.

One elephant in the room is that the original mapping formulas for MemorySanitizer are not applicable anymore: the mapping for main application memory and extra modules memory is now different.

This requires generation of slightly different code at the instrumentation stage that doesn’t just mask the memory address with ‘AND’ operation to get the corresponding shadow address, but needs to perform a comparison first and then, based on whether the address is for the ‘main application’ or ‘extra modules’ range - subtract different offsets.

This might result in lower performance of the instrumented binary, although I haven’t measured this effect and not sure how significant the slowdown will be (I suspect it shouldn’t be too bad).

Here’s an example of how it works.

Compile the code:

int main()
{
    int i;
    return i != 0;
}

as follows:

clang++ -std=c++14 -fsanitize=memory -fsanitize-memory-track-origins -g test_msan.cpp -o test_msan

and run it like this:

/usr/sbin/paxctl-ng -lr ./test_msan
setarch x86_64 -R ./test_msan

to produce the output:

==4125== WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x2aaaab4ec52 in main /var/tmp/test_msan.cpp:4:5
    #1 0x3fff712b133 in __libc_start_main (/lib64/libc.so.6+0x20133)
    #2 0x2aaaab4e99c in _start (/var/tmp/test_msan+0xa499c)

  Uninitialized value was created by an allocation of 'i' in the stack frame of function 'main'
    #0 0x2aaaab4ead0 in main /var/tmp/test_msan.cpp:2

SUMMARY: MemorySanitizer: use-of-uninitialized-value /var/tmp/test_msan.cpp:4 main
Exiting

Porting ThreadSanitizer: “Ultra-Violence”

Judging from the fact that ThreadSanitizer stores 4 bytes of shadow memory per every byte of the process memory + needs separate ‘traces’ storage, one might guess things are going to be yet even more interesting here - and would be almost mistaken

These requirements do complicate the layout, but all the key elements that were explained above for the MemorySanitizer are still applicable.

Without further ado, here’s the layout:

0000 0000 0000 - 0000 1000 0000: protected (256M)
0000 1000 0000 - 00a0 1000 0000: shadow for main application and heap (640G)
00a0 1000 0000 - 02a0 1000 0000: shadow for allocator, modules and main thread stack (2T)
02a0 1000 0000 - 02aa aaaa a000: - (~42G)
02aa aaaa a000 - 02d2 aaaa a000: main application and heap (160G)
02d2 aaaa a000 - 0326 aaaa a000: metainfo (336G)
0326 aaaa a000 - 0330 0000 0000: - (~37G)
0330 0000 0000 - 0380 0000 0000: traces (320G)
0380 0000 0000 - 03ff ffff ffff: allocator, modules and main thread stack (512G)

I’ll leave understanding this layout as an exercise for the reader

I do recommend reading the patch, though, as it contains some COMPILER_CHECK assertions that further explain the assumptions and decisions behind this layout.

In fact, one thing that makes implementing it easier than MemorySanitizer is that there’s no shadow addresses calculation code in the instrumented binary - all calculations are performed in the ThreadSanitizer runtime library, so there was no need to change any code generation to accommodate this layout.

One more observation: due to the increased size of the extra memory that we need, memory ranges for main application and extra modules were reduced to 160G and 256G respecively (the latter one is after subtracting 256G of allocator memory).

Example, as usual, follows.

Compile the code:

#include <thread>
 
int counter = 0;
 
void fun()
{
    ++counter;
}
 
int main()
{
    std::thread thr1(&fun);
    std::thread thr2(&fun);
    thr1.join();
    thr2.join();
    return 0;
}

as follows:

clang++ -std=c++14 -fsanitize=thread -g test_tsan.cpp -o test_tsan

and run it like this:

/usr/sbin/paxctl-ng -lr ./test_tsan
setarch x86_64 -R ./test_tsan

to produce the output:

==================
WARNING: ThreadSanitizer: data race (pid=4145)
  Write of size 4 at 0x02aaab8ba174 by thread T2:
    #0 fun() /var/tmp/test_tsan.cpp:7:5 (test_tsan+0x0000000c63fd)
    #1 void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>)
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        functional:1699:18 (test_tsan+0x0000000c9bba)
    #2 std::_Bind_simple<void (*())()>::operator()()
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        functional:1688:16 (test_tsan+0x0000000c9b50)
    #3 std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run()
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        thread:115:13 (test_tsan+0x0000000c9af9)
    #4 <null> <null>:0 (libstdc++.so.6+0x0000000d9cd2)

  Previous write of size 4 at 0x02aaab8ba174 by thread T1:
    #0 fun() /var/tmp/test_tsan.cpp:7:5 (test_tsan+0x0000000c63fd)
    #1 void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>)
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        functional:1699:18 (test_tsan+0x0000000c9bba)
    #2 std::_Bind_simple<void (*())()>::operator()()
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        functional:1688:16 (test_tsan+0x0000000c9b50)
    #3 std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run()
        /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/include/g++-v4/
        thread:115:13 (test_tsan+0x0000000c9af9)
    #4 <null> <null>:0 (libstdc++.so.6+0x0000000d9cd2)

  Location is global 'counter' of size 4 at 0x02aaab8ba174 (test_tsan+0x000000e10174)

  Thread T2 (tid=4148, running) created by main thread at:
    #0 pthread_create <null>:0 (test_tsan+0x000000062521)
    #1 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) <null>:0
        (libstdc++.so.6+0x0000000d9e48)
    #2 main /var/tmp/test_tsan.cpp:13:17 (test_tsan+0x0000000c6494)

  Thread T1 (tid=4147, running) created by main thread at:
    #0 pthread_create <null>:0 (test_tsan+0x000000062521)
    #1 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) <null>:0
        (libstdc++.so.6+0x0000000d9e48)
    #2 main /var/tmp/test_tsan.cpp:12:17 (test_tsan+0x0000000c646f)

SUMMARY: ThreadSanitizer: data race /var/tmp/test_tsan.cpp:7 fun()
==================
ThreadSanitizer: reported 1 warnings

The alternative approach: “Nightmare!”

Instead of messing up with all these fixed addresses layouts that are specific to kernel options / patches and require disabling randomization, wouldn’t it be easier to just determine the suitable shadow memory addresses dynamically, once the program starts?

As mentioned above, this can be problematic for some sanitizers due to performance concerns (e.g. AddressSanitizer), but for others, like ThreadSanitizer, that do all calculations in runtime anyway - this is not an issue. Moreover, ThreadSanitizer is the most demanding in terms of virtual address space for shadow memory, so would benefit from this dynamic scheme the most.

I do not plan to work on this, as the current solution is “good enough” for my purposes - but this does seem to be the right approach in the long run to take.

Patches

All these theoretical discussions are nice - but does it actually work?

Here are the patches for LLVM / Clang 3.5.1:

Both of these need to be used together, applying just one would break things spectacularly.

I’ve tested these on Hardened Gentoo distribution with LLVM versions 3.5.0 & 3.5.1 and Linux kernel versions 3.17.7 & 3.18.3.

Limitations

Here are some of the limitations and things to remember when using these patches:

  • They are applicable only when running PaX-patched kernel and only with PAX_PER_CPU_PGD option turned on (in other words - application space has to be 42 bit wide). Trying these with any other kernels (including the standard Linux kernels) will result in binaries that will just crash or exit with an error.
  • While LeakSanitizer & AddressSanitizer work both with randomization enabled and disabled, MemorySanitizer & ThreadSanitizer do require disabling both PAX_RANDMMAP randomization and Linux kernel ASLR randomization on a per-process basis.

Note that naively disabling PAX_RANDMMAP via paxctl (e.g. paxctl -cr your_binary) will lead to executables that are almost working but are unable to create any threads if PaX was compiled with PAX_MPROTECT option.

The issue here is that:

  • PAX_MPROTECT disables the creation of pages that have both ‘writable’ and ‘executable’ bits set.
  • paxctl re-uses stack ELF header to store its flags. Therefore, once paxctl is used to disable PAX_RANDMMAP flag for the binary, the stack ELF header effectively disappears.
  • Unless the ELF header of the binary has an explicit stack section that says it’s OK to create the stack without the ‘executable’ bit, it seems that in some situations (e.g. in pthread_create call) the runtime will attempt to create stacks with ‘executable’ bit set - which is denied by PaX, leading to the inability to create any threads.

The solution is just to use paxctl-ng and make sure to use the XATTR method for storing the flags, which stores them in filesystem extended attributes for the file and doesn’t touch the ELF header. If, due to some reason, it’s not possible to store the flags via XATTR method - one can disable both RANDMMAP & MPROTECT via paxctl to make things work, although I’d strongly encourage to use XATTR approach instead.

Conclusions

I do not plan to work on these patches any further - they do work for my purposes, but I don’t think they will be accepted in their current form into LLVM:

  • They’re specific to particular PaX flags combination. Changing PaX kernel options (e.g. disabling PAX_PER_CPU_PGD) would require different addresses layouts, I haven’t experimented with these and don’t have the corresponding layouts ready.
  • They would require some extra steps in LLVM build configuration to figure out whether they should be enabled or not - this is currently not implemented.
  • Some of the changes are relatively intrusive - like the instrumentation change required for MemorySanitizer, I’m not sure whether LLVM developers would be happy to accept & maintain the changes like this.

If anyone is interested in upstreaming these and has some relevant experience - let me know and we can discuss things further, though.

Syndicate content