bionic/docs/native_allocator.md

# Native Memory Allocator Verification
This document describes how to verify the native memory allocator on Android.
This procedure should be followed when upgrading or moving to a new allocator.
A small minor upgrade might not need to run all of the benchmarks, however,
at least the
[SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark),
[Memory Replay Benchmarks](#memory-replay-benchmarks) and
[Performance Trace Benchmarks](#performance-trace-benchmarks) should be run.

It is important to note that there are two modes for a native allocator
to run in on Android. The first is the normal allocator, the second is
called the low memory config, which is designed to run on memory constrained
systems and be a bit slower, but take less RSS. To enable the low memory
config, add this line to the `BoardConfig.mk` for the given target:

    MALLOC_LOW_MEMORY := true

This is valid starting with Android V (API level 35), before that the
way to enable the low memory config is:

    MALLOC_SVELTE := true

The `BoardConfig.mk` file is usually found in the directory
`device/<DEVICE_NAME>/` or in a sub directory.

When evaluating a native allocator, make sure that you benchmark both
versions.

## Android Extensions
Android supports a few non-standard functions and mallopt controls that
a native allocator needs to implement.

### Iterator Functions
These are functions that are used to implement a memory leak detector
called `libmemunreachable`.

#### malloc\_disable
This function, when called, should pause all threads that are making a
call to an allocation function (malloc/free/etc). When a call
is made to `malloc_enable`, the paused threads should start running again.

#### malloc\_enable
This function, when called, does nothing unless there was a previous call
to `malloc_disable`. This call will unpause any thread which is making
a call to an allocation function (malloc/free/etc) when `malloc_disable`
was called previously.

#### malloc\_iterate
This function enumerates all of the allocations currently live in the
system. It is meant to be called after a call to `malloc_disable` to
prevent further allocations while this call is being executed. To
see what is expected for this function, the best description is the
tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`.

### Mallopt Extensions
These are mallopt options that Android requires for a native allocator
to work efficiently.

#### M\_DECAY\_TIME
When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an
allocator will attempt to purge and release any unused memory back to the
kernel on free calls. This is important in Android to avoid consuming extra
RSS.

When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the
purge and release action. The amount of delay is up to the allocator
implementation, but it should be a reasonable amount of time. The jemalloc
allocator was implemented to have a one second delay.

The drawback to this option is that most allocators do not have a separate
thread to handle the purge, so the decay is only handled when an
allocation operation occurs. For server processes, this can mean that
RSS is slightly higher when the server is waiting for the next connection
and no other allocation calls are made. The `M_PURGE` option is used to
force a purge in this case.

For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is
made by default. The idea is that it allows application frees to run a
bit faster, while only increasing RSS a bit.

#### M\_PURGE
When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release
any unused memory immediately. The argument for this call is ignored. If
possible, this call should clear thread cached memory if it exists. The
idea is that this can be called to purge memory that has not been
purged when `M_DECAY_TIME` is set to one. This is useful if you have a
server application that does a lot of native allocations and the
application wants to purge that memory before waiting for the next connection.

## Correctness Tests
These are the tests that should be run to verify an allocator is
working properly according to Android.

### Bionic Unit Tests
The bionic unit tests contain a small number of allocator tests. These
tests are primarily verifying Android extensions and non-standard behavior
of allocation routines such as what happens when a non-power of two alignment
is passed to memalign.

To run all of the compliance tests:

    adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*"
    adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*"

The allocation tests are not meant to be complete, so it is expected
that a native allocator will have its own set of tests that can be run.

### Libmemunreachable Tests
The libmemunreachable tests verify that the iterator functions are working
properly.

To run all of the tests:

    adb shell /data/nativetest64/memunreachable_binder_test/memunreachable_binder_test
    adb shell /data/nativetest/memunreachable_binder_test/memunreachable_binder_test
    adb shell /data/nativetest64/memunreachable_test/memunreachable_test
    adb shell /data/nativetest/memunreachable_test/memunreachable_test
    adb shell /data/nativetest64/memunreachable_unit_test/memunreachable_unit_test
    adb shell /data/nativetest/memunreachable_unit_test/memunreachable_unit_test

### CTS Entropy Test
In addition to the bionic tests, there is also a CTS test that is designed
to verify that the addresses returned by malloc are sufficiently randomized
to help defeat potential security bugs.

Run this test thusly:

    atest AslrMallocTest

If there are multiple devices connected to the system, use `-s <SERIAL>`
to specify a device.

## Performance
There are multiple different ways to evaluate the performance of a native
allocator on Android. One is allocation speed in various different scenarios,
another is total RSS taken by the allocator.

The last is virtual address space consumed in 32 bit applications. There is
a limited amount of address space available in 32 bit apps, and there have
been allocator bugs that cause memory failures when too much virtual
address space is consumed. For 64 bit executables, this can be ignored.

NOTE: The default native allocator operates differently in an application
versus command-line tools running in the shell. In order to run the same
as an application, follow these instructions:

    > adb shell
    # export MALLOC_USE_APP_DEFAULTS=1
    # <Run command-line benchmarks>

Running without setting this environment variable can result in different
performance and even different RSS usage for the benchmarks mentioned below.
The environment variable has only been available since API level 36.
Applications using different native allocator defaults than command-line
tools has been present since API level 26 (Android O).

### Bionic Benchmarks
These are the microbenchmarks that are part of the bionic benchmarks suite of
benchmarks. These benchmarks can be built using this command:

    mmma -j bionic/benchmarks

These benchmarks are only used to verify the speed of the allocator and
ignore anything related to RSS and virtual address space consumed.

For all of these benchmark runs, it can be useful to add these two options:

    --benchmark_repetitions=XX
    --benchmark_report_aggregates_only=true

This will run the benchmark XX times and then give a mean, median, and stddev
and helps to get a number that can be compared to the new allocator.

In addition, there is another option:

    --bionic_cpu=XX

Which will lock the benchmark to only run on core XX. This also avoids
any issue related to the code migrating from one core to another
with different characteristics. For example, on a big-little cpu, if the
benchmark moves from big to little or vice-versa, this can cause scores
to fluctuate in indeterminate ways.

For most runs, the best set of options to add is:

    --benchmark_repetitions=10 --benchmark_report_aggregates_only=true --bionic_cpu=3

On most phones with a big-little cpu, the third core is the little core.
Choosing to run on the little core can tend to highlight any performance
differences.

#### Allocate/Free Benchmarks
These are the benchmarks to verify the allocation speed of a loop doing a
single allocation, touching every page in the allocation to make it resident
and then freeing the allocation.

To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default

To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1

The last value in the output is the size of the allocation in bytes. It is
useful to look at these kinds of benchmarks to make sure that there are
no outliers, but these numbers should not be used to make a final decision.
If these numbers are slightly worse than the current allocator, the
single thread numbers from trace data is a better representative of
real world situations.

#### Multiple Allocations Retained Benchmarks
These are the benchmarks that examine how the allocator handles multiple
allocations of the same size at the same time.

The first set of these benchmarks does a set number of 8192 byte allocations
in one loop, and then frees all of the allocations at the end of the loop.
Only the time it takes to do the allocations is recorded, the frees are not
counted. The value of 8192 was chosen since the jemalloc native allocator
had issues with this size. It is possible other sizes might show different
results, but, as mentioned before, these microbenchmark numbers should
not be used as absolutes for determining if an allocator is worth using.

This benchmark is designed to verify that there is no performance issue
related to having multiple allocations alive at the same time.

To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default

To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1

For these benchmarks, the last parameter is the total number of allocations to
do in each loop.

The other variation of this benchmark is to always do forty allocations in
each loop, but vary the size of the forty allocations. As with the other
benchmark, only the time it takes to do the allocations is tracked, the
frees are not counted. Forty allocations is an arbitrary number that could
be modified in the future. It was chosen because a version of the native
allocator, jemalloc, showed a problem at forty allocations.

To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default

To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1

For these benchmarks, the last parameter in the output is the size of the
allocation in bytes.

As with the other microbenchmarks, an allocator with numbers in the same
proximity of the current values is usually sufficient to consider making
a switch. The trace benchmarks are more important than these benchmarks
since they simulate real world allocation profiles.

#### SQL Allocation Trace Benchmark
This benchmark is a trace of the allocations performed when running
the SQLite BenchMark app.

This benchmark is designed to verify that the allocator will be performant
in a real world allocation scenario. SQL operations were chosen as a
benchmark because these operations tend to do lots of malloc/realloc/free
calls, and they tend to be on the critical path of applications.

To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default

To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1

These numbers should be as performant as the current allocator.

#### mallinfo Benchmark
This benchmark only verifies that mallinfo is still close to the performance
of the current allocator.

To run the benchmark, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo

Calls to mallinfo are used in ART so a new allocator is required to be
nearly as performant as the current allocator.

#### mallopt M\_PURGE Benchmark
This benchmark tracks the cost of calling `mallopt(M_PURGE, 0)`. As with the
mallinfo benchmark, it's not necessary for this to be better than the previous
allocator, only that the performance be in the same order of magnitude.

To run the benchmark, use these commands:

    adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge
    adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge

These calls are used to free unused memory pages back to the kernel.

### Memory Trace Benchmarks
These benchmarks measure all three axes of a native allocator, RSS, virtual
address space consumed, speed of allocation. They are designed to
run on a trace of the allocations from a real world application or system
process.

To build this benchmark:

    mmma -j system/extras/memory_replay

This will build two executables:

    /system/bin/memory_replay32
    /system/bin/memory_replay64

And these two benchmark executables:

    /data/benchmarktest64/trace_benchmark/trace_benchmark
    /data/benchmarktest/trace_benchmark/trace_benchmark

#### Memory Replay Benchmarks
These benchmarks display RSS, virtual memory consumed (VA space), and do a
bit of performance testing on actual traces taken from running applications.

The trace data includes what thread does each operation, so the replay
mechanism will simulate this by creating threads and replaying the operations
on a thread as if it was rerunning the real trace. The only issue is that
this is a worst case scenario for allocations happening at the same time
in all threads since it collapses all of the allocation operations to occur
one after another. This will cause a lot of threads allocating at the same
time. The trace data does not include timestamps,
so it is not possible to create a completely accurate replay.

To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/main/libc/malloc_debug/README.md),
the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/main/libc/malloc_debug/README.md#record_allocs_total_entries).

To run these benchmarks, first copy the trace files to the target using
these commands:

    adb push system/extras/memory_replay/traces /data/local/tmp

Since all of the traces come from applications, the `memory_replay` program
will always call `mallopt(M_DECAY_TIME, 1)' before running the trace.

Run the benchmark thusly:

    adb shell memory_replay64 /data/local/tmp/traces/XXX.zip
    adb shell memory_replay32 /data/local/tmp/traces/XXX.zip

Where XXX.zip is the name of a zipped trace file. The `memory_replay`
program also can process text files, but all trace files are currently
checked in as zip files.

Every 100000 allocation operations, a dump of the RSS and VA space will be
performed. At the end, a final RSS and VA space number will be printed.
For the most part, the intermediate data can be ignored, but it is always
a good idea to look over the data to verify that no strange spikes are
occurring.

The performance number is a measure of the time it takes to perform all of
the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc).
For any call that allocates a pointer, the time for the call and the time
it takes to make the pointer completely resident in memory is included.

The performance numbers for these runs tend to have a wide variability so
they should not be used as absolute value for comparison against the
current allocator. But, they should be in the same range as the current
values.

When evaluating an allocator, one of the most important traces is the
camera.txt trace. The camera application does very large allocations,
and some allocators might leave large virtual address maps around
rather than delete them. When that happens, it can lead to allocation
failures and would cause the camera app to abort/crash. It is
important to verify that when running this trace using the 32 bit replay
executable, the virtual address space consumed is not much larger than the
current allocator. A small increase (on the order of a few MBs) would be okay.

There is no specific benchmark for memory fragmentation, instead, the RSS
when running the memory traces acts as a proxy for this. An allocator that
is fragmenting badly will show an increase in RSS. The best trace for
tracking fragmentation is system\_server.txt which is an extremely long
trace (~13 million operations). The total number of live allocations goes
up and down a bit, but stays mostly the same so an allocator that fragments
badly would likely show an abnormal increase in RSS on this trace.

NOTE: When a native allocator calls mmap, it is expected that the allocator
will name the map using the call:

    prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc");

If the native allocator creates a different name, then it necessary to
modify the file:

    system/extras/memory_replay/NativeInfo.cpp

The `GetNativeInfo` function needs to be modified to include the name
of the maps that this allocator includes.

In addition, in order for the frameworks code to keep track of the memory
of a process, any named maps must be added to the file:

    frameworks/base/core/jni/android_os_Debug.cpp

Modify the `load_maps` function and add a check of the new expected name.

#### Performance Trace Benchmarks
This is a benchmark that treats the trace data as if all allocations
occurred in a single thread. This is the scenario that could
happen if all of the allocations are spaced out in time so no thread
every does an allocation at the same time as another thread.

Run these benchmarks thusly:

    adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark
    adb shell /data/benchmarktest/trace_benchmark/trace_benchmark

When run without any arguments, the benchmark will run over all of the
traces and display data. It takes many minutes to complete these runs in
order to get as accurate a number as possible.