• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Mechanisms for Coordination Between Garbage Collector and Mutator
2-----------------------------------------------------------------
3
4Most garbage collection work can proceed concurrently with the client or
5mutator Java threads. But in certain places, for example while tracing from
6thread stacks, the garbage collector needs to ensure that Java data processed
7by the collector is consistent and complete. At these points, the mutators
8should not hold references to the heap that are invisible to the garbage
9collector. And they should not be modifying the data that is visible to the
10collector.
11
12Logically, the collector and mutator share a reader-writer lock on the Java
13heap and associated data structures. Mutators hold the lock in reader or shared mode
14while running Java code or touching heap-related data structures. The collector
15holds the lock in writer or exclusive mode while it needs the heap data
16structures to be stable. However, this reader-writer lock has a very customized
17implementation that also provides additional facilities, such as the ability
18to exclude only a single thread, so that we can specifically examine its heap
19references.
20
21In order to ensure consistency of the Java data, the compiler inserts "suspend
22points", sometimes also called "safe points" into the code. These allow a thread
23to respond to external requests.
24
25Whenever a thread is runnable, i.e. whenever a thread logically holds the
26mutator lock in shared mode, it is expected to regularly execute such a suspend
27point, and check for pending requests. They are currently implemented by
28setting a flag in the thread structure[^1], which is then explicitly tested by the
29compiler-generated code.
30
31A thread responds to suspend requests only when it is "runnable", i.e. logically
32running Java code. When it runs native code, or is blocked in a kernel call, it
33logically releases the mutator lock. When the garbage collector needs mutator
34cooperation, and the thread is not runnable, it is assured that the mutator is
35not touching Java data, and hence the collector can safely perform the required
36action itself, on the mutator thread's behalf.
37
38Normally, when a thread makes a JNI call, it is not considered runnable while
39executing native code. This makes the transitions to and from running native JNI
40code somewhat expensive (see below). But these transitions are necessary to
41ensure that such code, which does not execute "suspend points", and can thus not
42cooperate with the GC, doesn't delay GC completion. `@FastNative` and
43`@CriticalNative` calls avoid these transitions, instead allowing the thread to
44remain "runnable", at the expense of potentially delaying GC operations for the
45duration of the call.
46
47Although we say that a thread is "suspended" when it is not running Java code,
48it may in fact still be running native code and touching data structures that
49are not considered "Java data". This distinction can be a fine line. For
50example, a Java thread blocked on a Java monitor will normally be "suspended"
51and blocked on a mutex contained in the monitor data structure. But it may wake
52up for reasons beyond ARTs control, which will normally result in touching the
53mutex. The monitor code must be quite careful to ensure that this does not cause
54problems, especially if the ART runtime was shut down in the interim and the
55monitor data structure has been reclaimed.
56
57Calls to change thread state
58----------------------------
59
60When a thread changes between running Java and native code, it has to
61correspondingly change its state between "runnable" and one of several
62other states, all of which are considered to be "suspended" for our purposes.
63When a Java thread starts to execute native code, and may thus not respond
64promptly to suspend requests, it will normally create an object of type
65`ScopedThreadSuspension`. `ScopedThreadSuspension`'s constructor changes state to
66the "suspended" state given as an argument, logically releasing the mutator lock
67and promising to no longer touch Java data structures. It also handles any
68pending suspension requests that slid in just before it changed state.
69
70Conversely, `ScopedThreadSuspension`'s destructor waits until the GC has finished
71any actions it is currently performing on the thread's behalf and effectively
72released the mutator exclusive lock, and then returns to runnable state,
73re-acquiring the mutator lock.
74
75Occasionally a thread running native code needs to temporarily again access Java
76data structures, performing the above transitions in the opposite order.
77`ScopedObjectAccess` is a similar RAII object whose constructor and destructor
78perform those transitions in the reverse order from `ScopedThreadSuspension`.
79
80Mutator lock implementation
81---------------------------
82
83The mutator lock is not implemented as a conventional mutex. But it plays by the
84rules of our normal static thread-safety analysis. Thus a function that is
85expected to be called in runnable state, with the ability to access Java data,
86should be annotated with `REQUIRES_SHARED(Locks::mutator_lock_)`.
87
88There is an explicit `mutator_lock_` object, of type `MutatorMutex`. `MutatorMutex` is
89seemingly a minor refinement of `ReaderWriterMutex`, but it is used entirely
90differently. It is acquired explicitly by clients that need to hold it
91exclusively, and in a small number of cases, it is acquired in shared mode, e.g.
92via `SharedTryLock()`, or by the GC itself. However, more commonly
93`MutatorMutex::TransitionFromSuspendedToRunnable()`, is used to logically acquire
94the mutator mutex, e.g. as part of `ScopedObjectAccess` construction.
95
96`TransitionFromSuspendedToRunnable()` does not physically acquire the
97`ReaderWriterMutex` in shared mode. Thus any thread acquiring the lock in exclusive mode
98must, in addition, explicitly arrange for mutator threads to be suspended via the
99thread suspension mechanism, and then make them runnable again on release.
100
101Logically the mutator lock is held in shared/reader mode if ***either*** the
102underlying reader-writer lock is held in shared mode, ***or*** if a mutator is in
103runnable state.
104
105Suspension and checkpoint API
106-----------------------------
107
108Suspend point checks enable three kinds of communication with mutator threads:
109
110**Checkpoints**
111: Checkpoint requests are used to get a thread to perform an action
112on our behalf. `RequestCheckpoint()` asks a specific thread to execute the closure
113supplied as an argument at its leisure. `RequestSynchronousCheckpoint()` in
114addition waits for the thread to complete running the closure, and handles
115suspended threads by running the closure on their behalf. In addition to these
116functions provided by `Thread`, `ThreadList` provides the `RunCheckpoint()` function
117that runs a checkpoint function on behalf of each thread, either by using
118`RequestCheckpoint()` to run it inside a running thread, or by ensuring that a
119suspended thread stays suspended, and then running the function on its behalf.
120`RunCheckpoint()` does not wait for completion of the function calls triggered by
121the resulting `RequestCheckpoint()` invocations.
122
123**Empty checkpoints**
124: ThreadList provides `RunEmptyCheckpoint()`, which waits until
125all threads have either passed a suspend point, or have been suspended. This
126ensures that no thread is still executing Java code inside the same
127suspend-point-delimited code interval it was executing before the call. For
128example, a read-barrier started before a `RunEmptyCheckpoint()` call will have
129finished before the call returns.
130
131**Thread suspension**
132: ThreadList provides a number of `SuspendThread...()` calls and
133a `SuspendAll()` call to suspend one or all threads until they are resumed by
134`Resume()` or `ResumeAll()`. The `Suspend...` calls guarantee that the target
135thread(s) are suspended (again, only in the sense of not running Java code)
136when the call returns.
137
138Deadlock freedom
139----------------
140
141It is easy to deadlock while attempting to run checkpoints, or suspending
142threads. In particular, we need to avoid situations in which we cannot suspend
143a thread because it is blocked, directly, or indirectly, on the GC completing
144its task. Deadlocks are avoided as follows:
145
146**Mutator lock ordering**
147The mutator lock participates in the normal ART lock ordering hierarchy, as though it
148were a regular lock. See `base/locks.h` for the hierarchy. In particular, only
149locks at or below level `kPostMutatorTopLockLevel` may be acquired after
150acquiring the mutator lock, e.g. inside the scope of a `ScopedObjectAccess`.
151Similarly only locks at level strictly above `kMutatatorLock` may be held while
152acquiring the mutator lock, e.g. either by starting a `ScopedObjectAccess`, or
153ending a `ScopedThreadSuspension`.
154
155This ensures that code that uses purely mutexes and threads state changes cannot
156deadlock: Since we always wait on a lower-level lock, the holder of the
157lowest-level lock can always progress. An attempt to initiate a checkpoint or to
158suspend another thread must also be treated as an acquisition of the mutator
159lock: A thread that is waiting for a lock before it can respond to the request
160is itself holding the mutator lock, and can only be blocked on lower-level
161locks. And acquisition of those can never depend on acquiring the mutator
162lock.
163
164**Checkpoints**
165Running a checkpoint in a thread requires suspending that thread for the
166duration of the checkpoint, or running the checkpoint on the threads behalf
167while that thread is blocked from executing Java code. In the former case, the
168checkpoint code is run from `CheckSuspend`, which requires the mutator lock,
169so checkpoint code may only acquire mutexes at or below level
170`kPostMutatorTopLockLevel`. But that is not sufficient.
171
172No matter whether the checkpoint is run in the target thread, or on its behalf,
173the target thread is effectively suspended and prevented from running Java code.
174However the target may hold arbitrary Java monitors, which it can no longer
175release. This may also prevent higher level mutexes from getting released.  Thus
176checkpoint code should only acquire mutexes at level `kPostMonitorLock` or
177below.
178
179
180**Waiting**
181This becomes much more problematic when we wait for something other than a lock.
182Waiting for something that may depend on the GC, while holding the mutator lock,
183can potentially lead to deadlock, since it will prevent the waiting thread from
184participating in GC checkpoints. Waiting while holding a lower-level lock like
185`thread_list_lock_` is similarly unsafe in general, since a runnable thread may
186not respond to checkpoints until it acquires `thread_list_lock_`. In general,
187waiting for a condition variable while holding an unrelated lock is problematic,
188and these are specific instances of that general problem.
189
190We do currently provide `WaitHoldingLocks`, and it is sometimes used with
191low-level locks held. But such code must somehow ensure that such waits
192eventually terminate without deadlock.
193
194One common use of WaitHoldingLocks is to wait for weak reference processing.
195Special rules apply to avoid deadlocks in this case: Such waits must start after
196weak reference processing is disabled; the GC may not issue further nonempty
197checkpoints or suspend requests until weak reference processing has been
198reenabled, and threads have been notified. Thus the waiting thread's inability
199to respond to nonempty checkpoints and suspend requests cannot directly block
200the GC. Non-GC checkpoint or suspend requests that target a thread waiting on
201reference processing will block until reference processing completes.
202
203Consider a case in which thread W1 waits on reference processing, while holding
204a low-level mutex M. Thread W2 holds the mutator lock and waits on M. We avoid a
205situation in which the GC needs to suspend or checkpoint W2 by briefly stopping
206the world to disable weak reference access. During the stop-the-world phase, W1
207cannot yet be waiting for weak-reference access.  Thus there is no danger of
208deadlock while entering this phase. After this phase, there is no need for W2 to
209suspend or execute a nonempty checkpoint. If we replaced the stop-the-world
210phase by a checkpoint, W2 could receive the checkpoint request too late, and be
211unable to respond.
212
213Empty checkpoints can continue to occur during reference processing.  Reference
214processing wait loops explicitly handle empty checkpoints, and an empty
215checkpoint request notifies the condition variable used to wait for reference
216processing, after acquiring `reference_processor_lock_`.  This means that empty
217checkpoints do not preclude client threads from being in the middle of an
218operation that involves a weak reference access, while nonempty checkpoints do.
219
220
221[^1]: Some comments in the code refer to a not-yet-really-implemented scheme in
222which the compiler-generated code would load through the address at
223`tlsPtr_.suspend_trigger`. A thread suspension is requested by setting this to
224null, triggering a `SIGSEGV`, causing that thread to check for GC cooperation
225requests. The real mechanism instead sets an appropriate `ThreadFlag` entry to
226request suspension or a checkpoint. Note that the actual checkpoint function
227value is set, along with the flag, while holding `suspend_count_lock_`. If the
228target thread notices that a checkpoint is requested, it then acquires
229the `suspend_count_lock_` to read the checkpoint function.
230