1# SCC finder core algorithm 2 3When comparing types recursively, we explore a graph whose nodes are pairs of 4types. Nodes can be visited more than once, and there can even be cycles. 5 6The standard tool for exploring directed graphs is Depth First Search. Dealing 7with cycles in such a graph requires the determination of its Strongly-Connected 8Components. 9 10DFS code for graph traversal is trivial. SCC determination is not and we would 11like it to be provided by a reliable and efficient library component. 12 13## Core algorithm 14 15There are three commonly-studied asymptotically-optimal approaches to 16determining the Strongly-Connected Components of a directed graph. Each of these 17admits various optimisations and specialisations for different purposes. 18 19* [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm) 20* [Tarjan's 21 algorithm](https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm) 22* [The path-based 23 algorithm](https://en.wikipedia.org/wiki/Path-based_strong_component_algorithm) 24 25Kosaraju's algorithm is unsuited to DFS-generated graphs (such as type 26comparison graphs) as it requires both forwards and reverse edges to be known 27ahead of time. 28 29Tarjan's algorithm can be massaged into the form where it can be separated into 30a plain DFS traversal and SCC-specific pieces but the resulting code is a bit 31messy and responsibility for SCC state management is rather scattered. 32 33The path-based algorithm is the best fit and can be put in a form where the DFS 34traversal and SCC state management are cleanly separated. The concept of "open" 35nodes carries directly over to the implementation used here and SCC state 36management occurs in two well-defined places. 37 38* node visit starts; repeat visits to open nodes are detected 39* node visit end; completed SCCs are detected 40 41The remainder of this document discusses the finer details of what the finder 42API should look like. 43 44## Choice of primitives 45 46The choice of primitives should be determined by the interface priorities: 47 48* simple - hard to misuse 49* efficient - does not preclude optimal implementation 50* powerful - no need to code around deficiencies 51 52Roughly speaking, the SCC finder "opens" and "closes" nodes and the user code 53will try to open and close nodes at the beginning and end of each node visit. 54 55### Close 56 57The logic to follow when the SCC finder reports a complete SCC (nodes closed) is 58simple: the user code must ensure it considers each node as definitively visited 59from this point and avoid asking the finder to consider it again. This condition 60means the SCC does not have to independently track "ever visited" status that 61may be duplicated elsewhere. 62 63### Open 64 65When reaching a node via DFS, the user of the SCC finder has to perform a 3-way 66classification: 67 681. never visited before - the node should immediately transition to open and 69 known to the SCC finder 701. open - the link just followed would create a cycle and the SCC finder 71 algorithm needs to do some state maintenance; the user code must not 72 recursively process the node 731. closed - the link just followed reaches a node already fully processed and 74 assigned to a SCC; the user code must not recursively process the node 75 76There are at least 3 different ways of structuring program logic to distinguish 77these paths. 78 79#### Populate user visited state on open 80 81Node lifecycle: 82 831. unvisited + not open 841. visited + open 851. visited + not open 86 87If a node has never been visited, it can be unconditionally opened. If it has 88been visited, we must still check if it's open. This is a bit odd in the context 89of the SCC algorithm as `really_open` and `is_open` become separate operations 90(-simplicity). It's possible that a user could know, for other reasons, whether 91a visited node is open or not and then omitting to call `is_open` could upset 92the SCC finder (which needs to update its internal state in the already-open 93case). This risk could be eliminated by duplicating the state update logic in 94both methods (-efficiency). 95 96```c++ 97if (visited) { 98 if (is_open(node)) { 99 // cycle-breaking back link 100 return; 101 } else { 102 // work-saving link to shared node 103 return; 104 } 105} 106mark_visited(); 107token = really_open(node); 108... 109// do work 110... 111nodes = close(token); 112if (!nodes.empty()) { 113 ... 114} 115``` 116 117#### Check SCC open / closed first 118 119Node lifecycle: 120 1211. not open + unvisited (never visited) 1221. open (being visited) 1231. not open + visited (closed) 124 125This scheme also requires separate `is_open` and `really_open` operations as 126nodes musn't be reopened (-simplicity, -efficiency). It does allow the user to 127mark nodes as visited any time between open and close (-simplicity, +power). 128 129```c++ 130if (is_open(node)) { 131 // cycle-breaking back link 132 return; 133} 134if (visited) { 135 // work-saving link to shared node 136 return; 137} 138token = really_open(node); 139... 140// do work 141... 142nodes = close(token); 143if (!nodes.empty()) { 144 // last chance to call mark_visited 145} 146``` 147 148#### Test user visited state before open, populate on close 149 150NOTE: This is the currently implemented approach. 151 152Node lifecycle: 153 1541. unvisited + not open 1551. unvisited + open 1561. visited + not open 157 158This is the purest form of the algorithm with the `open` and `close` operations 159clearly bracketing "real" work. `really_open` and `is_open` operations are 160folded together into `open` which will fail to open an already open node 161(+simplicity, +efficiency). 162 163The user value store cannot be something that gets initialised and updated 164between open and close or it will also need to duplicate the open/closed state 165and this correspondence will need to be maintained as an invariant (-power). 166 167```c++ 168if (visited) { 169 // work-saving link to shared node 170 return; 171} 172token = open(node); 173if (!token) { 174 // cycle-breaking back link 175 return; 176} 177... 178// do work 179... 180nodes = close(token.value()); 181if (!nodes.empty()) { 182 ... 183 mark_visited(); 184} 185``` 186 187Evaluating equality can be made to fit nicely into this category, as can using 188equality along with the return stack to build a diff tree with stubs for 189sharing- and cycle-breaking links. 190 191However, building a graph (say a copy of the traversal, or a diff graph) 192requires open node state to be squirrelled away somewhere. 193 194##### Enhancement 195 196The SCC finder data structure can be made to carry values associated with open 197nodes and hand them to the user on failure-to-open and closure. This allows us 198to retain purity and regain the ability to maintain simple state for open nodes 199separately from that for closed nodes, at the expense of a slightly 200heavier-touch interface (+power). 201 202In the simplest case, we'd want nothing stored at all (beyond the node identity) 203and actually supplying a second empty type would be an annoyance and an 204inefficiency (-simplicity, -power, -efficiency)). So the best thing to supply is 205the user's container's `value_type` and associated `value_compare` comparator. 206 207However, in this variation, it's painful to set up the SCC structures for 208efficient `open` as nodes need to exist in a map or set, independently of any 209payload. The approach could be revisited if there's a solution to this. 210 211```c++ 212if (visited) { 213 // work-saving link to shared node 214 return; 215} 216[&node_state, token] = open(node_state); 217if (!token) { 218 // cycle-breaking back link 219 return; 220} 221... 222// do work, update node_state if you like 223... 224node_states = close(token.value()) 225if (!node_states.empty()) { 226 ... 227 mark_visited(); 228} 229``` 230