1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2 3<html> 4<head> 5 <meta http-equiv="Content-Language" content="en-us"> 6 <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> 7 <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> 8 <meta name="ProgId" content="FrontPage.Editor.Document"> 9 <link rel="stylesheet" type="text/css" href="../../../boost.css"> 10 11 <title>The Boost Statechart Library - Performance</title> 12</head> 13 14<body link="#0000FF" vlink="#800080"> 15 <table border="0" cellpadding="7" cellspacing="0" width="100%" summary= 16 "header"> 17 <tr> 18 <td valign="top" width="300"> 19 <h3><a href="../../../index.htm"><img alt="C++ Boost" src= 20 "../../../boost.png" border="0" width="277" height="86"></a></h3> 21 </td> 22 23 <td valign="top"> 24 <h1 align="center">The Boost Statechart Library</h1> 25 26 <h2 align="center">Performance</h2> 27 </td> 28 </tr> 29 </table> 30 <hr> 31 32 <dl class="index"> 33 <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability 34 tradeoffs</a></dt> 35 36 <dt><a href="#MemoryManagementCustomization">Memory management 37 customization</a></dt> 38 39 <dt><a href="#RttiCustomization">RTTI customization</a></dt> 40 41 <dt><a href="#ResourceUsage">Resource usage</a></dt> 42 </dl> 43 44 <h2><a name="SpeedVersusScalabilityTradeoffs" id= 45 "SpeedVersusScalabilityTradeoffs">Speed versus scalability 46 tradeoffs</a></h2> 47 48 <p>Quite a bit of effort has gone into making the library fast for small 49 simple machines <b>and</b> scaleable at the same time (this applies only to 50 <code>state_machine<></code>, there still is some room for optimizing 51 <code>fifo_scheduler<></code>, especially for multi-threaded builds). 52 While I believe it should perform reasonably in most applications, the 53 scalability does not come for free. Small, carefully handcrafted state 54 machines will thus easily outperform equivalent Boost.Statechart machines. 55 To get a picture of how big the gap is, I implemented a simple benchmark in 56 the BitMachine example. The Handcrafted example is a handcrafted variant of 57 the 1-bit-BitMachine implementing the same benchmark.</p> 58 59 <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b> 60 scenario:</p> 61 62 <ul> 63 <li>For both machines exactly one object of the only event is allocated 64 before starting the test. This same object is then sent to the machines 65 over and over</li> 66 67 <li>The Handcrafted machine employs GOF-visitor double dispatch. The 68 states are preallocated so that event dispatch & transition amounts 69 to nothing more than two virtual calls and one pointer assignment</li> 70 </ul> 71 72 <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the 73 following dispatch and transition times per event:</p> 74 75 <ul> 76 <li>Handcrafted: 77 78 <ul> 79 <li>MSVC7.1: 10ns</li> 80 81 <li>GCC3.4.2: 20ns</li> 82 83 <li>Intel9.0: 20ns</li> 84 </ul> 85 </li> 86 87 <li>1-bit-BitMachine with customized memory management: 88 89 <ul> 90 <li>MSVC7.1: 150ns</li> 91 92 <li>GCC3.4.2: 320ns</li> 93 94 <li>Intel9.0: 170ns</li> 95 </ul> 96 </li> 97 </ul> 98 99 <p>Although this is a big difference I still think it will not be 100 noticeable in most real-world applications. No matter whether an 101 application uses handcrafted or Boost.Statechart machines it will...</p> 102 103 <ul> 104 <li>almost never run into a situation where a state machine is swamped 105 with as many events as in the benchmarks. A state machine will almost 106 always spend a good deal of time waiting for events (which typically come 107 from a human operator, from machinery or from electronic devices over 108 often comparatively slow I/O channels). Parsers are just about the only 109 application of FSMs where this is not the case. However, parser FSMs are 110 usually not directly specified on the state machine level but on a higher 111 one that is better suited for the task. Examples of such higher levels 112 are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of 113 parsers allows for a number of optimizations that are not possible in a 114 general-purpose FSM framework.<br> 115 Bottom line: While it is possible to implement a parser with this 116 library, it is almost never advisable to do so because other approaches 117 lead to better performing and more expressive code</li> 118 119 <li>often run state machines in their own threads. This adds considerable 120 locking and thread-switching overhead. Performance tests with the 121 PingPong example, where two asynchronous state machines exchange events, 122 gave the following times to process one event and perform the resulting 123 in-state reaction (using the library with 124 <code>boost::fast_pool_allocator<></code>): 125 126 <ul> 127 <li>Single-threaded (no locking and waiting): 840ns / 840ns</li> 128 129 <li>Multi-threaded with one thread (the scheduler uses mutex locking 130 but never has to wait for events): 6500ns / 4800ns</li> 131 132 <li>Multi-threaded with two threads (both schedulers use mutex 133 locking and exactly one always waits for an event): 14000ns / 134 7000ns</li> 135 </ul> 136 137 <p>As mentioned above, there definitely is some room to improve the 138 timings for the asynchronous machines. Moreover, these are very crude 139 benchmarks, designed to show the overhead of locking and thread context 140 switching. The overhead in a real-world application will typically be 141 smaller and other operating systems can certainly do better in this 142 area. However, I strongly believe that on most platforms the threading 143 overhead is usually larger than the time that Boost.Statechart spends 144 for event dispatch and transition. Handcrafted machines will inevitably 145 have the same overhead, making raw single-threaded dispatch and 146 transition speed much less important</p> 147 </li> 148 149 <li>almost always allocate events with <code>new</code> and destroy them 150 after consumption. This will add a few cycles, even if event memory 151 management is customized</li> 152 153 <li>often use state machines that employ orthogonal states and other 154 advanced features. This forces the handcrafted machines to use a more 155 adequate and more time-consuming book-keeping</li> 156 </ul> 157 158 <p>Therefore, in real-world applications event dispatch and transition not 159 normally constitutes a bottleneck and the relative gap between handcrafted 160 and Boost.Statechart machines also becomes much smaller than in the 161 worst-case scenario.</p> 162 163 <h3>Detailed performance data</h3> 164 165 <p>In an effort to identify the main performance bottleneck, the example 166 program "Performance" has been written. It measures the time that is spent 167 to process one event in different BitMachine variants. In contrast to the 168 BitMachine example, which uses only transitions, Performance uses a varying 169 number of in-state reactions together with transitions. The only difference 170 between in-state-reactions and transitions is that the former neither enter 171 nor exit any states. Apart from that, the same amount of code needs to be 172 run to dispatch an event and execute the resulting action.</p> 173 174 <p>The following diagrams show the average time the library spends to 175 process one event, depending on the percentage of in-state reactions 176 employed. 0% means that all triggered reactions are transitions. 100% means 177 that all triggered reactions are in-state reactions. I draw the following 178 conclusions from these measurements:</p> 179 180 <ul> 181 <li>The fairly linear course of the curves suggests that the measurements 182 with a 100% in-state reaction ratio are accurate and not merely a product 183 of optimizations in the compiler. Such optimizations might have been 184 possible due to the fact that in the 100% case it is known at 185 compile-time that the current state will never change</li> 186 187 <li>The data points with 100% in-state reaction ratio and speed optimized 188 RTTI show that modern compilers seem to inline the complex-looking 189 dispatch code so aggressively that dispatch is reduced to little more 190 than it actually is, one virtual function call followed by a linear 191 search for a suitable reaction. For instance, in the case of the 1-bit 192 Bitmachine, Intel9.0 seems to produce dispatch code that is equally 193 efficient like the two virtual function calls in the Handcrafted 194 machine</li> 195 196 <li>On all compilers and in all variants the time spent in event dispatch 197 is dwarfed by the time spent to exit the current state and enter the 198 target state. It is worth noting that BitMachine is a flat and 199 non-orthogonal state machine, representing a close-to-worst case. 200 Real-world machines will often exit and enter multiple states during a 201 transition, what further dwarfs pure dispatch time. This makes the 202 implementation of constant-time dispatch (as requested by a few people 203 during formal review) an undertaking with little merit. Instead, the 204 future optimization effort will concentrate on state-entry and 205 state-exit</li> 206 207 <li>Intel9.0 seems to have problems to optimize/inline code as soon as 208 the amount of code grows over a certain threshold. Unlike with the other 209 two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit 210 BitMachine into separate executables to get good performance. Even then 211 was the performance overly bad for the 4-bit BitMachine. It was much 212 worse when I compiled all 4 tests into the same executable. This surely 213 looks like a bug in the compiler</li> 214 </ul> 215 216 <h4>Out of the box</h4> 217 218 <p>The library is used as is, without any optimizations/modifications.</p> 219 220 <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0" 221 width="371" height="284"><img alt="PerformanceNormal2" src= 222 "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt= 223 "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371" 224 height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif" 225 border="0" width="371" height="284"></p> 226 227 <h4>Native RTTI</h4> 228 229 <p>The library is used with <code><a href= 230 "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code> 231 defined.</p> 232 233 <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0" 234 width="371" height="284"><img alt="PerformanceNative2" src= 235 "PerformanceNative2.gif" border="0" width="371" height="284"><img alt= 236 "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371" 237 height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif" 238 border="0" width="371" height="284"></p> 239 240 <h4>Customized memory-management</h4> 241 242 <p>The library is used with customized memory management 243 (<code>boost::fast_pool_allocator</code>).</p> 244 245 <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0" 246 width="371" height="284"><img alt="PerformanceCustom2" src= 247 "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt= 248 "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371" 249 height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif" 250 border="0" width="371" height="284"></p> 251 252 <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3> 253 254 <p>At the heart of every state machine lies an implementation of double 255 dispatch. This is due to the fact that the incoming event <b>and</b> the 256 active state define exactly which <a href= 257 "definitions.html#Reaction">reaction</a> the state machine will produce. 258 For each event dispatch, one virtual call is followed by a linear search 259 for the appropriate reaction, using one RTTI comparison per reaction. The 260 following alternatives were considered but rejected:</p> 261 262 <ul> 263 <li><a href= 264 "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic 265 visitor</a>: This double-dispatch variant satisfies all scalability 266 requirements but performs badly due to costly inheritance tree 267 cross-casts. Moreover, a state must store one v-pointer for <b>each</b> 268 reaction what slows down construction and makes memory management 269 customization inefficient. In addition, C++ RTTI must inevitably be 270 turned on, with negative effects on executable size. Boost.Statechart 271 originally employed acyclic visitor and was about 4 times slower than it 272 is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better 273 on other platforms but the other negative effects will remain</li> 274 275 <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF 276 Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine 277 depend upon all events. That is, whenever a new event is added there is 278 no way around recompiling the whole state machine. This is contrary to 279 the scalability requirements</li> 280 281 <li>Single two-dimensional array of function pointers: To satisfy 282 requirement 6, it should be possible to spread a single state machine 283 over several translation units. This however means that the dispatch 284 table must be filled at runtime and the different translation units must 285 somehow make themselves "known", so that their part of the state machine 286 can be added to the table. There simply is no way to do this 287 automatically <b>and</b> portably. The only portable way that a state 288 machine distributed over several translation units could employ 289 table-based double dispatch relies on the user. The programmer(s) would 290 somehow have to <b>manually</b> tie together the various pieces of the 291 state machine. Not only does this scale badly but is also quite 292 error-prone</li> 293 </ul> 294 295 <h2><a name="MemoryManagementCustomization" id= 296 "MemoryManagementCustomization">Memory management customization</a></h2> 297 298 <p>Out of the box, everything (event objects, state objects, internal data, 299 etc.) is allocated through <code>std::allocator< void ></code> (the 300 default for the Allocator template parameter). This should be satisfactory 301 for applications meeting the following prerequisites:</p> 302 303 <ul> 304 <li>There are no deterministic reaction time (hard real-time) 305 requirements</li> 306 307 <li>The application will never run long enough for heap fragmentation to 308 become a problem. This is of course an issue for all long running 309 programs not only the ones employing this library. However, it should be 310 noted that fragmentation problems could show up earlier than with 311 traditional FSM frameworks</li> 312 </ul> 313 314 <p>Should an application not meet these prerequisites, Boost.Statechart's 315 memory management can be customized as follows:</p> 316 317 <ul> 318 <li>By passing a model of the standard Allocator concept to the class 319 templates that support a corresponding parameter 320 (<code>event<></code>, <code>state_machine<></code>, 321 <code>asynchronous_state_machine<></code>, 322 <code>fifo_scheduler<></code>, <code>fifo_worker<></code>). 323 This redirects all allocations to the passed custom allocator and should 324 satisfy the needs of just about any project</li> 325 326 <li>Additionally, it is possible to <b>separately</b> customize 327 <b>state</b> memory management by overloading <code>operator new()</code> 328 and <code>operator delete()</code> for all state classes but this is 329 probably only useful under rare circumstances</li> 330 </ul> 331 332 <h2><a name="RttiCustomization" id="RttiCustomization">RTTI 333 customization</a></h2> 334 335 <p>RTTI is used for event dispatch and 336 <code>state_downcast<>()</code>. Currently, there are exactly two 337 options:</p> 338 339 <ol> 340 <li>By default, a speed-optimized internal implementation is 341 employed</li> 342 343 <li>The library can be instructed to use native C++ RTTI instead by 344 defining <code><a href= 345 "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li> 346 </ol> 347 348 <p>There are 2 reasons to favor 2:</p> 349 350 <ul> 351 <li>When a state machine (or parts of it) are compiled into a DLL, 352 problems could arise from the use of the internal RTTI mechanism (see the 353 FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a 354 dynamic link library (DLL)?</a>"). Using option 2 is one way to work 355 around such problems (on some platforms, it seems to be the only 356 way)</li> 357 358 <li>State and event objects need to store one pointer less, meaning that 359 in the best case the memory footprint of a state machine object could 360 shrink by 15% (an empty event is typically 30% smaller, what can be an 361 advantage when there are bursts of events rather than a steady flow). 362 However, on most platforms executable size grows when C++ RTTI is turned 363 on. So, given the small per machine object savings, this only makes sense 364 in applications where both of the following conditions hold: 365 366 <ul> 367 <li>Event dispatch will never become a bottleneck</li> 368 369 <li>There is a need to reduce the memory allocated at runtime (at the 370 cost of a larger executable)</li> 371 </ul> 372 373 <p>Obvious candidates are embedded systems where the executable resides 374 in ROM. Other candidates are applications running a large number of 375 identical state machines where this measure could even reduce the 376 <b>overall</b> memory footprint</p> 377 </li> 378 </ul> 379 380 <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2> 381 382 <h3>Memory</h3> 383 384 <p>On a 32-bit box, one empty active state typically needs less than 50 385 bytes of memory. Even <b>very</b> complex machines will usually have less 386 than 20 simultaneously active states so just about every machine should run 387 with less than one kilobyte of memory (not counting event queues). 388 Obviously, the per-machine memory footprint is offset by whatever 389 state-local members the user adds.</p> 390 391 <h3>Processor cycles</h3> 392 393 <p>The following ranking should give a rough picture of what feature will 394 consume how many cycles:</p> 395 396 <ol> 397 <li><code>state_cast<>()</code>: By far the most cycle-consuming 398 feature. Searches linearly for a suitable state, using one 399 <code>dynamic_cast</code> per visited state</li> 400 401 <li>State entry and exit: Profiling of the fully optimized 402 1-bit-BitMachine suggested that roughly 3 quarters of the total event 403 processing time is spent destructing the exited state and constructing 404 the entered state. Obviously, transitions where the <a href= 405 "definitions.html#InnermostCommonContext">innermost common context</a> is 406 "far" from the leaf states and/or with lots of orthogonal states can 407 easily cause the destruction and construction of quite a few states 408 leading to significant amounts of time spent for a transition</li> 409 410 <li><code>state_downcast<>()</code>: Searches linearly for the 411 requested state, using one virtual call and one RTTI comparison per 412 visited state</li> 413 414 <li>Deep history: For all innermost states inside a state passing either 415 <code>has_deep_history</code> or <code>has_full_history</code> to its 416 state base class, a binary search through the (usually small) history map 417 must be performed on each exit. History slot allocation is performed 418 exactly once, at first exit</li> 419 420 <li>Shallow history: For all direct inner states of a state passing 421 either <code>has_shallow_history</code> or <code>has_full_history</code> 422 to its state base class, a binary search through the (usually small) 423 history map must be performed on each exit. History slot allocation is 424 performed exactly once, at first exit</li> 425 426 <li>Event dispatch: One virtual call followed by a linear search for a 427 suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI 428 comparison per visited reaction</li> 429 430 <li>Orthogonal states: One additional virtual call for each exited state 431 <b>if</b> there is more than one active leaf state before a transition. 432 It should also be noted that the worst-case event dispatch time is 433 multiplied in the presence of orthogonal states. For example, if two 434 orthogonal leaf states are added to a given state configuration, the 435 worst-case time is tripled</li> 436 </ol> 437 <hr> 438 439 <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src= 440 "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional" 441 height="31" width="88"></a></p> 442 443 <p>Revised 444 <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p> 445 446 <p><i>Copyright © 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" --> 447 <a href="contact.html">Andreas Huber Dönni</a></i></p> 448 449 <p><i>Distributed under the Boost Software License, Version 1.0. (See 450 accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or 451 copy at <a href= 452 "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p> 453</body> 454</html> 455