1# Native Memory Allocator Verification 2This document describes how to verify the native memory allocator on Android. 3This procedure should be followed when upgrading or moving to a new allocator. 4A small minor upgrade might not need to run all of the benchmarks, however, 5at least the 6[SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark), 7[Memory Replay Benchmarks](#memory-replay-benchmarks) and 8[Performance Trace Benchmarks](#performance-trace-benchmarks) should be run. 9 10It is important to note that there are two modes for a native allocator 11to run in on Android. The first is the normal allocator, the second is 12called the low memory config, which is designed to run on memory constrained 13systems and be a bit slower, but take less RSS. To enable the low memory 14config, add this line to the `BoardConfig.mk` for the given target: 15 16 MALLOC_LOW_MEMORY := true 17 18This is valid starting with Android V (API level 35), before that the 19way to enable the low memory config is: 20 21 MALLOC_SVELTE := true 22 23The `BoardConfig.mk` file is usually found in the directory 24`device/<DEVICE_NAME>/` or in a sub directory. 25 26When evaluating a native allocator, make sure that you benchmark both 27versions. 28 29## Android Extensions 30Android supports a few non-standard functions and mallopt controls that 31a native allocator needs to implement. 32 33### Iterator Functions 34These are functions that are used to implement a memory leak detector 35called `libmemunreachable`. 36 37#### malloc\_disable 38This function, when called, should pause all threads that are making a 39call to an allocation function (malloc/free/etc). When a call 40is made to `malloc_enable`, the paused threads should start running again. 41 42#### malloc\_enable 43This function, when called, does nothing unless there was a previous call 44to `malloc_disable`. This call will unpause any thread which is making 45a call to an allocation function (malloc/free/etc) when `malloc_disable` 46was called previously. 47 48#### malloc\_iterate 49This function enumerates all of the allocations currently live in the 50system. It is meant to be called after a call to `malloc_disable` to 51prevent further allocations while this call is being executed. To 52see what is expected for this function, the best description is the 53tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`. 54 55### Mallopt Extensions 56These are mallopt options that Android requires for a native allocator 57to work efficiently. 58 59#### M\_DECAY\_TIME 60When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an 61allocator will attempt to purge and release any unused memory back to the 62kernel on free calls. This is important in Android to avoid consuming extra 63RSS. 64 65When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the 66purge and release action. The amount of delay is up to the allocator 67implementation, but it should be a reasonable amount of time. The jemalloc 68allocator was implemented to have a one second delay. 69 70The drawback to this option is that most allocators do not have a separate 71thread to handle the purge, so the decay is only handled when an 72allocation operation occurs. For server processes, this can mean that 73RSS is slightly higher when the server is waiting for the next connection 74and no other allocation calls are made. The `M_PURGE` option is used to 75force a purge in this case. 76 77For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is 78made by default. The idea is that it allows application frees to run a 79bit faster, while only increasing RSS a bit. 80 81#### M\_PURGE 82When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release 83any unused memory immediately. The argument for this call is ignored. If 84possible, this call should clear thread cached memory if it exists. The 85idea is that this can be called to purge memory that has not been 86purged when `M_DECAY_TIME` is set to one. This is useful if you have a 87server application that does a lot of native allocations and the 88application wants to purge that memory before waiting for the next connection. 89 90## Correctness Tests 91These are the tests that should be run to verify an allocator is 92working properly according to Android. 93 94### Bionic Unit Tests 95The bionic unit tests contain a small number of allocator tests. These 96tests are primarily verifying Android extensions and non-standard behavior 97of allocation routines such as what happens when a non-power of two alignment 98is passed to memalign. 99 100To run all of the compliance tests: 101 102 adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" 103 adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" 104 105The allocation tests are not meant to be complete, so it is expected 106that a native allocator will have its own set of tests that can be run. 107 108### Libmemunreachable Tests 109The libmemunreachable tests verify that the iterator functions are working 110properly. 111 112To run all of the tests: 113 114 adb shell /data/nativetest64/memunreachable_binder_test/memunreachable_binder_test 115 adb shell /data/nativetest/memunreachable_binder_test/memunreachable_binder_test 116 adb shell /data/nativetest64/memunreachable_test/memunreachable_test 117 adb shell /data/nativetest/memunreachable_test/memunreachable_test 118 adb shell /data/nativetest64/memunreachable_unit_test/memunreachable_unit_test 119 adb shell /data/nativetest/memunreachable_unit_test/memunreachable_unit_test 120 121### CTS Entropy Test 122In addition to the bionic tests, there is also a CTS test that is designed 123to verify that the addresses returned by malloc are sufficiently randomized 124to help defeat potential security bugs. 125 126Run this test thusly: 127 128 atest AslrMallocTest 129 130If there are multiple devices connected to the system, use `-s <SERIAL>` 131to specify a device. 132 133## Performance 134There are multiple different ways to evaluate the performance of a native 135allocator on Android. One is allocation speed in various different scenarios, 136another is total RSS taken by the allocator. 137 138The last is virtual address space consumed in 32 bit applications. There is 139a limited amount of address space available in 32 bit apps, and there have 140been allocator bugs that cause memory failures when too much virtual 141address space is consumed. For 64 bit executables, this can be ignored. 142 143NOTE: The default native allocator operates differently in an application 144versus command-line tools running in the shell. In order to run the same 145as an application, follow these instructions: 146 147 > adb shell 148 # export MALLOC_USE_APP_DEFAULTS=1 149 # <Run command-line benchmarks> 150 151Running without setting this environment variable can result in different 152performance and even different RSS usage for the benchmarks mentioned below. 153The environment variable has only been available since API level 36. 154Applications using different native allocator defaults than command-line 155tools has been present since API level 26 (Android O). 156 157### Bionic Benchmarks 158These are the microbenchmarks that are part of the bionic benchmarks suite of 159benchmarks. These benchmarks can be built using this command: 160 161 mmma -j bionic/benchmarks 162 163These benchmarks are only used to verify the speed of the allocator and 164ignore anything related to RSS and virtual address space consumed. 165 166For all of these benchmark runs, it can be useful to add these two options: 167 168 --benchmark_repetitions=XX 169 --benchmark_report_aggregates_only=true 170 171This will run the benchmark XX times and then give a mean, median, and stddev 172and helps to get a number that can be compared to the new allocator. 173 174In addition, there is another option: 175 176 --bionic_cpu=XX 177 178Which will lock the benchmark to only run on core XX. This also avoids 179any issue related to the code migrating from one core to another 180with different characteristics. For example, on a big-little cpu, if the 181benchmark moves from big to little or vice-versa, this can cause scores 182to fluctuate in indeterminate ways. 183 184For most runs, the best set of options to add is: 185 186 --benchmark_repetitions=10 --benchmark_report_aggregates_only=true --bionic_cpu=3 187 188On most phones with a big-little cpu, the third core is the little core. 189Choosing to run on the little core can tend to highlight any performance 190differences. 191 192#### Allocate/Free Benchmarks 193These are the benchmarks to verify the allocation speed of a loop doing a 194single allocation, touching every page in the allocation to make it resident 195and then freeing the allocation. 196 197To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 198 199 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default 200 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default 201 202To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 203 204 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1 205 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1 206 207The last value in the output is the size of the allocation in bytes. It is 208useful to look at these kinds of benchmarks to make sure that there are 209no outliers, but these numbers should not be used to make a final decision. 210If these numbers are slightly worse than the current allocator, the 211single thread numbers from trace data is a better representative of 212real world situations. 213 214#### Multiple Allocations Retained Benchmarks 215These are the benchmarks that examine how the allocator handles multiple 216allocations of the same size at the same time. 217 218The first set of these benchmarks does a set number of 8192 byte allocations 219in one loop, and then frees all of the allocations at the end of the loop. 220Only the time it takes to do the allocations is recorded, the frees are not 221counted. The value of 8192 was chosen since the jemalloc native allocator 222had issues with this size. It is possible other sizes might show different 223results, but, as mentioned before, these microbenchmark numbers should 224not be used as absolutes for determining if an allocator is worth using. 225 226This benchmark is designed to verify that there is no performance issue 227related to having multiple allocations alive at the same time. 228 229To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 230 231 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default 232 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default 233 234To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 235 236 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 237 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 238 239For these benchmarks, the last parameter is the total number of allocations to 240do in each loop. 241 242The other variation of this benchmark is to always do forty allocations in 243each loop, but vary the size of the forty allocations. As with the other 244benchmark, only the time it takes to do the allocations is tracked, the 245frees are not counted. Forty allocations is an arbitrary number that could 246be modified in the future. It was chosen because a version of the native 247allocator, jemalloc, showed a problem at forty allocations. 248 249To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 250 251 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default 252 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default 253 254To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command: 255 256 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 257 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 258 259For these benchmarks, the last parameter in the output is the size of the 260allocation in bytes. 261 262As with the other microbenchmarks, an allocator with numbers in the same 263proximity of the current values is usually sufficient to consider making 264a switch. The trace benchmarks are more important than these benchmarks 265since they simulate real world allocation profiles. 266 267#### SQL Allocation Trace Benchmark 268This benchmark is a trace of the allocations performed when running 269the SQLite BenchMark app. 270 271This benchmark is designed to verify that the allocator will be performant 272in a real world allocation scenario. SQL operations were chosen as a 273benchmark because these operations tend to do lots of malloc/realloc/free 274calls, and they tend to be on the critical path of applications. 275 276To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: 277 278 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default 279 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default 280 281To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: 282 283 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 284 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 285 286These numbers should be as performant as the current allocator. 287 288#### mallinfo Benchmark 289This benchmark only verifies that mallinfo is still close to the performance 290of the current allocator. 291 292To run the benchmark, use these commands: 293 294 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo 295 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallinfo 296 297Calls to mallinfo are used in ART so a new allocator is required to be 298nearly as performant as the current allocator. 299 300#### mallopt M\_PURGE Benchmark 301This benchmark tracks the cost of calling `mallopt(M_PURGE, 0)`. As with the 302mallinfo benchmark, it's not necessary for this to be better than the previous 303allocator, only that the performance be in the same order of magnitude. 304 305To run the benchmark, use these commands: 306 307 adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge 308 adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_mallopt_purge 309 310These calls are used to free unused memory pages back to the kernel. 311 312### Memory Trace Benchmarks 313These benchmarks measure all three axes of a native allocator, RSS, virtual 314address space consumed, speed of allocation. They are designed to 315run on a trace of the allocations from a real world application or system 316process. 317 318To build this benchmark: 319 320 mmma -j system/extras/memory_replay 321 322This will build two executables: 323 324 /system/bin/memory_replay32 325 /system/bin/memory_replay64 326 327And these two benchmark executables: 328 329 /data/benchmarktest64/trace_benchmark/trace_benchmark 330 /data/benchmarktest/trace_benchmark/trace_benchmark 331 332#### Memory Replay Benchmarks 333These benchmarks display RSS, virtual memory consumed (VA space), and do a 334bit of performance testing on actual traces taken from running applications. 335 336The trace data includes what thread does each operation, so the replay 337mechanism will simulate this by creating threads and replaying the operations 338on a thread as if it was rerunning the real trace. The only issue is that 339this is a worst case scenario for allocations happening at the same time 340in all threads since it collapses all of the allocation operations to occur 341one after another. This will cause a lot of threads allocating at the same 342time. The trace data does not include timestamps, 343so it is not possible to create a completely accurate replay. 344 345To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/main/libc/malloc_debug/README.md), 346the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/main/libc/malloc_debug/README.md#record_allocs_total_entries). 347 348To run these benchmarks, first copy the trace files to the target using 349these commands: 350 351 adb push system/extras/memory_replay/traces /data/local/tmp 352 353Since all of the traces come from applications, the `memory_replay` program 354will always call `mallopt(M_DECAY_TIME, 1)' before running the trace. 355 356Run the benchmark thusly: 357 358 adb shell memory_replay64 /data/local/tmp/traces/XXX.zip 359 adb shell memory_replay32 /data/local/tmp/traces/XXX.zip 360 361Where XXX.zip is the name of a zipped trace file. The `memory_replay` 362program also can process text files, but all trace files are currently 363checked in as zip files. 364 365Every 100000 allocation operations, a dump of the RSS and VA space will be 366performed. At the end, a final RSS and VA space number will be printed. 367For the most part, the intermediate data can be ignored, but it is always 368a good idea to look over the data to verify that no strange spikes are 369occurring. 370 371The performance number is a measure of the time it takes to perform all of 372the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc). 373For any call that allocates a pointer, the time for the call and the time 374it takes to make the pointer completely resident in memory is included. 375 376The performance numbers for these runs tend to have a wide variability so 377they should not be used as absolute value for comparison against the 378current allocator. But, they should be in the same range as the current 379values. 380 381When evaluating an allocator, one of the most important traces is the 382camera.txt trace. The camera application does very large allocations, 383and some allocators might leave large virtual address maps around 384rather than delete them. When that happens, it can lead to allocation 385failures and would cause the camera app to abort/crash. It is 386important to verify that when running this trace using the 32 bit replay 387executable, the virtual address space consumed is not much larger than the 388current allocator. A small increase (on the order of a few MBs) would be okay. 389 390There is no specific benchmark for memory fragmentation, instead, the RSS 391when running the memory traces acts as a proxy for this. An allocator that 392is fragmenting badly will show an increase in RSS. The best trace for 393tracking fragmentation is system\_server.txt which is an extremely long 394trace (~13 million operations). The total number of live allocations goes 395up and down a bit, but stays mostly the same so an allocator that fragments 396badly would likely show an abnormal increase in RSS on this trace. 397 398NOTE: When a native allocator calls mmap, it is expected that the allocator 399will name the map using the call: 400 401 prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc"); 402 403If the native allocator creates a different name, then it necessary to 404modify the file: 405 406 system/extras/memory_replay/NativeInfo.cpp 407 408The `GetNativeInfo` function needs to be modified to include the name 409of the maps that this allocator includes. 410 411In addition, in order for the frameworks code to keep track of the memory 412of a process, any named maps must be added to the file: 413 414 frameworks/base/core/jni/android_os_Debug.cpp 415 416Modify the `load_maps` function and add a check of the new expected name. 417 418#### Performance Trace Benchmarks 419This is a benchmark that treats the trace data as if all allocations 420occurred in a single thread. This is the scenario that could 421happen if all of the allocations are spaced out in time so no thread 422every does an allocation at the same time as another thread. 423 424Run these benchmarks thusly: 425 426 adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark 427 adb shell /data/benchmarktest/trace_benchmark/trace_benchmark 428 429When run without any arguments, the benchmark will run over all of the 430traces and display data. It takes many minutes to complete these runs in 431order to get as accurate a number as possible. 432