1This document summarizes the common approaches for performance fine tuning with 2jemalloc (as of 5.1.0). The default configuration of jemalloc tends to work 3reasonably well in practice, and most applications should not have to tune any 4options. However, in order to cover a wide range of applications and avoid 5pathological cases, the default setting is sometimes kept conservative and 6suboptimal, even for many common workloads. When jemalloc is properly tuned for 7a specific application / workload, it is common to improve system level metrics 8by a few percent, or make favorable trade-offs. 9 10 11## Notable runtime options for performance tuning 12 13Runtime options can be set via 14[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). 15 16* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) 17 18 Enabling jemalloc background threads generally improves the tail latency for 19 application threads, since unused memory purging is shifted to the dedicated 20 background threads. In addition, unintended purging delay caused by 21 application inactivity is avoided with background threads. 22 23 Suggested: `background_thread:true` when jemalloc managed threads can be 24 allowed. 25 26* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) 27 28 Allowing jemalloc to utilize transparent huge pages for its internal 29 metadata usually reduces TLB misses significantly, especially for programs 30 with large memory footprint and frequent allocation / deallocation 31 activities. Metadata memory usage may increase due to the use of huge 32 pages. 33 34 Suggested for allocation intensive programs: `metadata_thp:auto` or 35 `metadata_thp:always`, which is expected to improve CPU utilization at a 36 small memory cost. 37 38* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and 39 [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) 40 41 Decay time determines how fast jemalloc returns unused pages back to the 42 operating system, and therefore provides a fairly straightforward trade-off 43 between CPU and memory usage. Shorter decay time purges unused pages faster 44 to reduces memory usage (usually at the cost of more CPU cycles spent on 45 purging), and vice versa. 46 47 Suggested: tune the values based on the desired trade-offs. 48 49* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) 50 51 By default jemalloc uses multiple arenas to reduce internal lock contention. 52 However high arena count may also increase overall memory fragmentation, 53 since arenas manage memory independently. When high degree of parallelism 54 is not expected at the allocator level, lower number of arenas often 55 improves memory usage. 56 57 Suggested: if low parallelism is expected, try lower arena count while 58 monitoring CPU and memory usage. 59 60* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) 61 62 Enable dynamic thread to arena association based on running CPU. This has 63 the potential to improve locality, e.g. when thread to CPU affinity is 64 present. 65 66 Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if 67 thread migration between processors is expected to be infrequent. 68 69Examples: 70 71* High resource consumption application, prioritizing CPU utilization: 72 73 `background_thread:true,metadata_thp:auto` combined with relaxed decay time 74 (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, 75 e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). 76 77* High resource consumption application, prioritizing memory usage: 78 79 `background_thread:true` combined with shorter decay time (decreased 80 `dirty_decay_ms` and / or `muzzy_decay_ms`, 81 e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count 82 (e.g. number of CPUs). 83 84* Low resource consumption application: 85 86 `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased 87 `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. 88 `dirty_decay_ms:1000,muzzy_decay_ms:0`). 89 90* Extremely conservative -- minimize memory usage at all costs, only suitable when 91allocation activity is very rare: 92 93 `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` 94 95Note that it is recommended to combine the options with `abort_conf:true` which 96aborts immediately on illegal options. 97 98## Beyond runtime options 99 100In addition to the runtime options, there are a number of programmatic ways to 101improve application performance with jemalloc. 102 103* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) 104 105 Manually created arenas can help performance in various ways, e.g. by 106 managing locality and contention for specific usages. For example, 107 applications can explicitly allocate frequently accessed objects from a 108 dedicated arena with 109 [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve 110 locality. In addition, explicit arenas often benefit from individually 111 tuned options, e.g. relaxed [decay 112 time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if 113 frequent reuse is expected. 114 115* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) 116 117 Extent hooks allow customization for managing underlying memory. One use 118 case for performance purpose is to utilize huge pages -- for example, 119 [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) 120 uses explicit arenas with customized extent hooks to manage 1GB huge pages 121 for frequently accessed data, which reduces TLB misses significantly. 122 123* [Explicit thread-to-arena 124 binding](http://jemalloc.net/jemalloc.3.html#thread.arena) 125 126 It is common for some threads in an application to have different memory 127 access / allocation patterns. Threads with heavy workloads often benefit 128 from explicit binding, e.g. binding very active threads to dedicated arenas 129 may reduce contention at the allocator level. 130