1When the kernel unmaps or modified the attributes of a range of 2memory, it has two choices: 3 1. Flush the entire TLB with a two-instruction sequence. This is 4 a quick operation, but it causes collateral damage: TLB entries 5 from areas other than the one we are trying to flush will be 6 destroyed and must be refilled later, at some cost. 7 2. Use the invlpg instruction to invalidate a single page at a 8 time. This could potentialy cost many more instructions, but 9 it is a much more precise operation, causing no collateral 10 damage to other TLB entries. 11 12Which method to do depends on a few things: 13 1. The size of the flush being performed. A flush of the entire 14 address space is obviously better performed by flushing the 15 entire TLB than doing 2^48/PAGE_SIZE individual flushes. 16 2. The contents of the TLB. If the TLB is empty, then there will 17 be no collateral damage caused by doing the global flush, and 18 all of the individual flush will have ended up being wasted 19 work. 20 3. The size of the TLB. The larger the TLB, the more collateral 21 damage we do with a full flush. So, the larger the TLB, the 22 more attrative an individual flush looks. Data and 23 instructions have separate TLBs, as do different page sizes. 24 4. The microarchitecture. The TLB has become a multi-level 25 cache on modern CPUs, and the global flushes have become more 26 expensive relative to single-page flushes. 27 28There is obviously no way the kernel can know all these things, 29especially the contents of the TLB during a given flush. The 30sizes of the flush will vary greatly depending on the workload as 31well. There is essentially no "right" point to choose. 32 33You may be doing too many individual invalidations if you see the 34invlpg instruction (or instructions _near_ it) show up high in 35profiles. If you believe that individual invalidations being 36called too often, you can lower the tunable: 37 38 /sys/kernel/debug/x86/tlb_single_page_flush_ceiling 39 40This will cause us to do the global flush for more cases. 41Lowering it to 0 will disable the use of the individual flushes. 42Setting it to 1 is a very conservative setting and it should 43never need to be 0 under normal circumstances. 44 45Despite the fact that a single individual flush on x86 is 46guaranteed to flush a full 2MB [1], hugetlbfs always uses the full 47flushes. THP is treated exactly the same as normal memory. 48 49You might see invlpg inside of flush_tlb_mm_range() show up in 50profiles, or you can use the trace_tlb_flush() tracepoints. to 51determine how long the flush operations are taking. 52 53Essentially, you are balancing the cycles you spend doing invlpg 54with the cycles that you spend refilling the TLB later. 55 56You can measure how expensive TLB refills are by using 57performance counters and 'perf stat', like this: 58 59perf stat -e 60 cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, 61 cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, 62 cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, 63 cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, 64 cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, 65 cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ 66 67That works on an IvyBridge-era CPU (i5-3320M). Different CPUs 68may have differently-named counters, but they should at least 69be there in some form. You can use pmu-tools 'ocperf list' 70(https://github.com/andikleen/pmu-tools) to find the right 71counters for a given CPU. 72 731. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" 74 says: "One execution of INVLPG is sufficient even for a page 75 with size greater than 4 KBytes." 76