1Started by Paul Jackson <pj@sgi.com> 2 3The robust futex ABI 4-------------------- 5 6Robust_futexes provide a mechanism that is used in addition to normal 7futexes, for kernel assist of cleanup of held locks on task exit. 8 9The interesting data as to what futexes a thread is holding is kept on a 10linked list in user space, where it can be updated efficiently as locks 11are taken and dropped, without kernel intervention. The only additional 12kernel intervention required for robust_futexes above and beyond what is 13required for futexes is: 14 15 1) a one time call, per thread, to tell the kernel where its list of 16 held robust_futexes begins, and 17 2) internal kernel code at exit, to handle any listed locks held 18 by the exiting thread. 19 20The existing normal futexes already provide a "Fast Userspace Locking" 21mechanism, which handles uncontested locking without needing a system 22call, and handles contested locking by maintaining a list of waiting 23threads in the kernel. Options on the sys_futex(2) system call support 24waiting on a particular futex, and waking up the next waiter on a 25particular futex. 26 27For robust_futexes to work, the user code (typically in a library such 28as glibc linked with the application) has to manage and place the 29necessary list elements exactly as the kernel expects them. If it fails 30to do so, then improperly listed locks will not be cleaned up on exit, 31probably causing deadlock or other such failure of the other threads 32waiting on the same locks. 33 34A thread that anticipates possibly using robust_futexes should first 35issue the system call: 36 37 asmlinkage long 38 sys_set_robust_list(struct robust_list_head __user *head, size_t len); 39 40The pointer 'head' points to a structure in the threads address space 41consisting of three words. Each word is 32 bits on 32 bit arch's, or 64 42bits on 64 bit arch's, and local byte order. Each thread should have 43its own thread private 'head'. 44 45If a thread is running in 32 bit compatibility mode on a 64 native arch 46kernel, then it can actually have two such structures - one using 32 bit 47words for 32 bit compatibility mode, and one using 64 bit words for 64 48bit native mode. The kernel, if it is a 64 bit kernel supporting 32 bit 49compatibility mode, will attempt to process both lists on each task 50exit, if the corresponding sys_set_robust_list() call has been made to 51setup that list. 52 53 The first word in the memory structure at 'head' contains a 54 pointer to a single linked list of 'lock entries', one per lock, 55 as described below. If the list is empty, the pointer will point 56 to itself, 'head'. The last 'lock entry' points back to the 'head'. 57 58 The second word, called 'offset', specifies the offset from the 59 address of the associated 'lock entry', plus or minus, of what will 60 be called the 'lock word', from that 'lock entry'. The 'lock word' 61 is always a 32 bit word, unlike the other words above. The 'lock 62 word' holds 3 flag bits in the upper 3 bits, and the thread id (TID) 63 of the thread holding the lock in the bottom 29 bits. See further 64 below for a description of the flag bits. 65 66 The third word, called 'list_op_pending', contains transient copy of 67 the address of the 'lock entry', during list insertion and removal, 68 and is needed to correctly resolve races should a thread exit while 69 in the middle of a locking or unlocking operation. 70 71Each 'lock entry' on the single linked list starting at 'head' consists 72of just a single word, pointing to the next 'lock entry', or back to 73'head' if there are no more entries. In addition, nearby to each 'lock 74entry', at an offset from the 'lock entry' specified by the 'offset' 75word, is one 'lock word'. 76 77The 'lock word' is always 32 bits, and is intended to be the same 32 bit 78lock variable used by the futex mechanism, in conjunction with 79robust_futexes. The kernel will only be able to wakeup the next thread 80waiting for a lock on a threads exit if that next thread used the futex 81mechanism to register the address of that 'lock word' with the kernel. 82 83For each futex lock currently held by a thread, if it wants this 84robust_futex support for exit cleanup of that lock, it should have one 85'lock entry' on this list, with its associated 'lock word' at the 86specified 'offset'. Should a thread die while holding any such locks, 87the kernel will walk this list, mark any such locks with a bit 88indicating their holder died, and wakeup the next thread waiting for 89that lock using the futex mechanism. 90 91When a thread has invoked the above system call to indicate it 92anticipates using robust_futexes, the kernel stores the passed in 'head' 93pointer for that task. The task may retrieve that value later on by 94using the system call: 95 96 asmlinkage long 97 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, 98 size_t __user *len_ptr); 99 100It is anticipated that threads will use robust_futexes embedded in 101larger, user level locking structures, one per lock. The kernel 102robust_futex mechanism doesn't care what else is in that structure, so 103long as the 'offset' to the 'lock word' is the same for all 104robust_futexes used by that thread. The thread should link those locks 105it currently holds using the 'lock entry' pointers. It may also have 106other links between the locks, such as the reverse side of a double 107linked list, but that doesn't matter to the kernel. 108 109By keeping its locks linked this way, on a list starting with a 'head' 110pointer known to the kernel, the kernel can provide to a thread the 111essential service available for robust_futexes, which is to help clean 112up locks held at the time of (a perhaps unexpectedly) exit. 113 114Actual locking and unlocking, during normal operations, is handled 115entirely by user level code in the contending threads, and by the 116existing futex mechanism to wait for, and wakeup, locks. The kernels 117only essential involvement in robust_futexes is to remember where the 118list 'head' is, and to walk the list on thread exit, handling locks 119still held by the departing thread, as described below. 120 121There may exist thousands of futex lock structures in a threads shared 122memory, on various data structures, at a given point in time. Only those 123lock structures for locks currently held by that thread should be on 124that thread's robust_futex linked lock list a given time. 125 126A given futex lock structure in a user shared memory region may be held 127at different times by any of the threads with access to that region. The 128thread currently holding such a lock, if any, is marked with the threads 129TID in the lower 29 bits of the 'lock word'. 130 131When adding or removing a lock from its list of held locks, in order for 132the kernel to correctly handle lock cleanup regardless of when the task 133exits (perhaps it gets an unexpected signal 9 in the middle of 134manipulating this list), the user code must observe the following 135protocol on 'lock entry' insertion and removal: 136 137On insertion: 138 1) set the 'list_op_pending' word to the address of the 'lock word' 139 to be inserted, 140 2) acquire the futex lock, 141 3) add the lock entry, with its thread id (TID) in the bottom 29 bits 142 of the 'lock word', to the linked list starting at 'head', and 143 4) clear the 'list_op_pending' word. 144 145On removal: 146 1) set the 'list_op_pending' word to the address of the 'lock word' 147 to be removed, 148 2) remove the lock entry for this lock from the 'head' list, 149 2) release the futex lock, and 150 2) clear the 'lock_op_pending' word. 151 152On exit, the kernel will consider the address stored in 153'list_op_pending' and the address of each 'lock word' found by walking 154the list starting at 'head'. For each such address, if the bottom 29 155bits of the 'lock word' at offset 'offset' from that address equals the 156exiting threads TID, then the kernel will do two things: 157 158 1) if bit 31 (0x80000000) is set in that word, then attempt a futex 159 wakeup on that address, which will waken the next thread that has 160 used to the futex mechanism to wait on that address, and 161 2) atomically set bit 30 (0x40000000) in the 'lock word'. 162 163In the above, bit 31 was set by futex waiters on that lock to indicate 164they were waiting, and bit 30 is set by the kernel to indicate that the 165lock owner died holding the lock. 166 167The kernel exit code will silently stop scanning the list further if at 168any point: 169 170 1) the 'head' pointer or an subsequent linked list pointer 171 is not a valid address of a user space word 172 2) the calculated location of the 'lock word' (address plus 173 'offset') is not the valid address of a 32 bit user space 174 word 175 3) if the list contains more than 1 million (subject to 176 future kernel configuration changes) elements. 177 178When the kernel sees a list entry whose 'lock word' doesn't have the 179current threads TID in the lower 29 bits, it does nothing with that 180entry, and goes on to the next entry. 181 182Bit 29 (0x20000000) of the 'lock word' is reserved for future use. 183