• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# SCC finder core algorithm
2
3When comparing types recursively, we explore a graph whose nodes are pairs of
4types. Nodes can be visited more than once, and there can even be cycles.
5
6The standard tool for exploring directed graphs is Depth First Search. Dealing
7with cycles in such a graph requires the determination of its Strongly-Connected
8Components.
9
10DFS code for graph traversal is trivial. SCC determination is not and we would
11like it to be provided by a reliable and efficient library component.
12
13## Core algorithm
14
15There are three commonly-studied asymptotically-optimal approaches to
16determining the Strongly-Connected Components of a directed graph. Each of these
17admits various optimisations and specialisations for different purposes.
18
19* [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm)
20* [Tarjan's
21  algorithm](https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm)
22* [The path-based
23  algorithm](https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm)
24
25Kosaraju's algorithm is unsuited to DFS-generated graphs (such as type
26comparison graphs) as it requires both forwards and reverse edges to be known
27ahead of time.
28
29Tarjan's algorithm can be massaged into the form where it can be separated into
30a plain DFS traversal and SCC-specific pieces but the resulting code is a bit
31messy and responsibility for SCC state management is rather scattered.
32
33The path-based algorithm is the best fit and can be put in a form where the DFS
34traversal and SCC state management are cleanly separated. The concept of "open"
35nodes carries directly over to the implementation used here and SCC state
36management occurs in two well-defined places.
37
38*   node visit starts; repeat visits to open nodes are detected
39*   node visit end; completed SCCs are detected
40
41The remainder of this document discusses the finer details of what the finder
42API should look like.
43
44## Choice of primitives
45
46The choice of primitives should be determined by the interface priorities:
47
48*   simple - hard to misuse
49*   efficient - does not preclude optimal implementation
50*   powerful - no need to code around deficiencies
51
52Roughly speaking, the SCC finder "opens" and "closes" nodes and the user code
53will try to open and close nodes at the beginning and end of each node visit.
54
55### Close
56
57The logic to follow when the SCC finder reports a complete SCC (nodes closed) is
58simple: the user code must ensure it considers each node as definitively visited
59from this point and avoid asking the finder to consider it again. This condition
60means the SCC does not have to independently track "ever visited" status that
61may be duplicated elsewhere.
62
63### Open
64
65When reaching a node via DFS, the user of the SCC finder has to perform a 3-way
66classification:
67
681.  never visited before - the node should immediately transition to open and
69    known to the SCC finder
701.  open - the link just followed would create a cycle and the SCC finder
71    algorithm needs to do some state maintenance; the user code must not
72    recursively process the node
731.  closed - the link just followed reaches a node already fully processed and
74    assigned to a SCC; the user code must not recursively process the node
75
76There are at least 3 different ways of structuring program logic to distinguish
77these paths.
78
79#### Populate user visited state on open
80
81Node lifecycle:
82
831. unvisited + not open
841. visited + open
851. visited + not open
86
87If a node has never been visited, it can be unconditionally opened. If it has
88been visited, we must still check if it's open. This is a bit odd in the context
89of the SCC algorithm as `really_open` and `is_open` become separate operations
90(-simplicity). It's possible that a user could know, for other reasons, whether
91a visited node is open or not and then omitting to call `is_open` could upset
92the SCC finder (which needs to update its internal state in the already-open
93case). This risk could be eliminated by duplicating the state update logic in
94both methods (-efficiency).
95
96```c++
97if (visited) {
98  if (is_open(node)) {
99    // cycle-breaking back link
100    return;
101  } else {
102    // work-saving link to shared node
103    return;
104  }
105}
106mark_visited();
107token = really_open(node);
108...
109// do work
110...
111nodes = close(token);
112if (!nodes.empty()) {
113  ...
114}
115```
116
117#### Check SCC open / closed first
118
119Node lifecycle:
120
1211. not open + unvisited (never visited)
1221. open (being visited)
1231. not open + visited (closed)
124
125This scheme also requires separate `is_open` and `really_open` operations as
126nodes musn't be reopened (-simplicity, -efficiency). It does allow the user to
127mark nodes as visited any time between open and close (-simplicity, +power).
128
129```c++
130if (is_open(node)) {
131  // cycle-breaking back link
132  return;
133}
134if (visited) {
135  // work-saving link to shared node
136  return;
137}
138token = really_open(node);
139...
140// do work
141...
142nodes = close(token);
143if (!nodes.empty()) {
144  // last chance to call mark_visited
145}
146```
147
148#### Test user visited state before open, populate on close
149
150NOTE: This is the currently implemented approach.
151
152Node lifecycle:
153
1541. unvisited + not open
1551. unvisited + open
1561. visited + not open
157
158This is the purest form of the algorithm with the `open` and `close` operations
159clearly bracketing "real" work. `really_open` and `is_open` operations are
160folded together into `open` which will fail to open an already open node
161(+simplicity, +efficiency).
162
163The user value store cannot be something that gets initialised and updated
164between open and close or it will also need to duplicate the open/closed state
165and this correspondence will need to be maintained as an invariant (-power).
166
167```c++
168if (visited) {
169  // work-saving link to shared node
170  return;
171}
172token = open(node);
173if (!token) {
174  // cycle-breaking back link
175  return;
176}
177...
178// do work
179...
180nodes = close(token.value());
181if (!nodes.empty()) {
182  ...
183  mark_visited();
184}
185```
186
187Evaluating equality can be made to fit nicely into this category, as can using
188equality along with the return stack to build a diff tree with stubs for
189sharing- and cycle-breaking links.
190
191However, building a graph (say a copy of the traversal, or a diff graph)
192requires open node state to be squirrelled away somewhere.
193
194##### Enhancement
195
196The SCC finder data structure can be made to carry values associated with open
197nodes and hand them to the user on failure-to-open and closure. This allows us
198to retain purity and regain the ability to maintain simple state for open nodes
199separately from that for closed nodes, at the expense of a slightly
200heavier-touch interface (+power).
201
202In the simplest case, we'd want nothing stored at all (beyond the node identity)
203and actually supplying a second empty type would be an annoyance and an
204inefficiency (-simplicity, -power, -efficiency)). So the best thing to supply is
205the user's container's `value_type` and associated `value_compare` comparator.
206
207However, in this variation, it's painful to set up the SCC structures for
208efficient `open` as nodes need to exist in a map or set, independently of any
209payload. The approach could be revisited if there's a solution to this.
210
211```c++
212if (visited) {
213  // work-saving link to shared node
214  return;
215}
216[&node_state, token] = open(node_state);
217if (!token) {
218  // cycle-breaking back link
219  return;
220}
221...
222// do work, update node_state if you like
223...
224node_states = close(token.value())
225if (!node_states.empty()) {
226  ...
227  mark_visited();
228}
229```
230