1================================ 2PSI - Pressure Stall Information 3================================ 4 5:Date: April, 2018 6:Author: Johannes Weiner <hannes@cmpxchg.org> 7 8When CPU, memory or IO devices are contended, workloads experience 9latency spikes, throughput losses, and run the risk of OOM kills. 10 11Without an accurate measure of such contention, users are forced to 12either play it safe and under-utilize their hardware resources, or 13roll the dice and frequently suffer the disruptions resulting from 14excessive overcommit. 15 16The psi feature identifies and quantifies the disruptions caused by 17such resource crunches and the time impact it has on complex workloads 18or even entire systems. 19 20Having an accurate measure of productivity losses caused by resource 21scarcity aids users in sizing workloads to hardware--or provisioning 22hardware according to workload demand. 23 24As psi aggregates this information in realtime, systems can be managed 25dynamically using techniques such as load shedding, migrating jobs to 26other systems or data centers, or strategically pausing or killing low 27priority or restartable batch jobs. 28 29This allows maximizing hardware utilization without sacrificing 30workload health or risking major disruptions such as OOM kills. 31 32Pressure interface 33================== 34 35Pressure information for each resource is exported through the 36respective file in /proc/pressure/ -- cpu, memory, and io. 37 38The format for CPU is as such:: 39 40 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 41 42and for memory and IO:: 43 44 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 45 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 46 47The "some" line indicates the share of time in which at least some 48tasks are stalled on a given resource. 49 50The "full" line indicates the share of time in which all non-idle 51tasks are stalled on a given resource simultaneously. In this state 52actual CPU cycles are going to waste, and a workload that spends 53extended time in this state is considered to be thrashing. This has 54severe impact on performance, and it's useful to distinguish this 55situation from a state where some tasks are stalled but the CPU is 56still doing productive work. As such, time spent in this subset of the 57stall state is tracked separately and exported in the "full" averages. 58 59The ratios (in %) are tracked as recent trends over ten, sixty, and 60three hundred second windows, which gives insight into short term events 61as well as medium and long term trends. The total absolute stall time 62(in us) is tracked and exported as well, to allow detection of latency 63spikes which wouldn't necessarily make a dent in the time averages, 64or to average trends over custom time frames. 65 66Monitoring for pressure thresholds 67================================== 68 69Users can register triggers and use poll() to be woken up when resource 70pressure exceeds certain thresholds. 71 72A trigger describes the maximum cumulative stall time over a specific 73time window, e.g. 100ms of total stall time within any 500ms window to 74generate a wakeup event. 75 76To register a trigger user has to open psi interface file under 77/proc/pressure/ representing the resource to be monitored and write the 78desired threshold and time window. The open file descriptor should be 79used to wait for trigger events using select(), poll() or epoll(). 80The following format is used:: 81 82 <some|full> <stall amount in us> <time window in us> 83 84For example writing "some 150000 1000000" into /proc/pressure/memory 85would add 150ms threshold for partial memory stall measured within 861sec time window. Writing "full 50000 1000000" into /proc/pressure/io 87would add 50ms threshold for full io stall measured within 1sec time window. 88 89Triggers can be set on more than one psi metric and more than one trigger 90for the same psi metric can be specified. However for each trigger a separate 91file descriptor is required to be able to poll it separately from others, 92therefore for each trigger a separate open() syscall should be made even 93when opening the same psi interface file. Write operations to a file descriptor 94with an already existing psi trigger will fail with EBUSY. 95 96Monitors activate only when system enters stall state for the monitored 97psi metric and deactivates upon exit from the stall state. While system is 98in the stall state psi signal growth is monitored at a rate of 10 times per 99tracking window. 100 101The kernel accepts window sizes ranging from 500ms to 10s, therefore min 102monitoring update interval is 50ms and max is 1s. Min limit is set to 103prevent overly frequent polling. Max limit is chosen as a high enough number 104after which monitors are most likely not needed and psi averages can be used 105instead. 106 107When activated, psi monitor stays active for at least the duration of one 108tracking window to avoid repeated activations/deactivations when system is 109bouncing in and out of the stall state. 110 111Notifications to the userspace are rate-limited to one per tracking window. 112 113The trigger will de-register when the file descriptor used to define the 114trigger is closed. 115 116Userspace monitor usage example 117=============================== 118 119:: 120 121 #include <errno.h> 122 #include <fcntl.h> 123 #include <stdio.h> 124 #include <poll.h> 125 #include <string.h> 126 #include <unistd.h> 127 128 /* 129 * Monitor memory partial stall with 1s tracking window size 130 * and 150ms threshold. 131 */ 132 int main() { 133 const char trig[] = "some 150000 1000000"; 134 struct pollfd fds; 135 int n; 136 137 fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); 138 if (fds.fd < 0) { 139 printf("/proc/pressure/memory open error: %s\n", 140 strerror(errno)); 141 return 1; 142 } 143 fds.events = POLLPRI; 144 145 if (write(fds.fd, trig, strlen(trig) + 1) < 0) { 146 printf("/proc/pressure/memory write error: %s\n", 147 strerror(errno)); 148 return 1; 149 } 150 151 printf("waiting for events...\n"); 152 while (1) { 153 n = poll(&fds, 1, -1); 154 if (n < 0) { 155 printf("poll error: %s\n", strerror(errno)); 156 return 1; 157 } 158 if (fds.revents & POLLERR) { 159 printf("got POLLERR, event source is gone\n"); 160 return 0; 161 } 162 if (fds.revents & POLLPRI) { 163 printf("event triggered!\n"); 164 } else { 165 printf("unknown event received: 0x%x\n", fds.revents); 166 return 1; 167 } 168 } 169 170 return 0; 171 } 172 173Cgroup2 interface 174================= 175 176In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem 177mounted, pressure stall information is also tracked for tasks grouped 178into cgroups. Each subdirectory in the cgroupfs mountpoint contains 179cpu.pressure, memory.pressure, and io.pressure files; the format is 180the same as the /proc/pressure/ files. 181 182Per-cgroup psi monitors can be specified and used the same way as 183system-wide ones. 184