• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# `lws_fi` Fault Injection
2
3Most efforts during development go towards trying to make the system do what
4it is supposed to do during normal operation.
5
6But to provide reliable quality there's a need to not just test the code paths
7for normal operation, but also to be able to easily confirm that they act
8correctly under various fault conditions that may be difficult to arrange at
9test-time. It's otherwise very easy for error conditions that are low
10probability to be overlooked and turn out to do the wrong thing, eg, try to
11clean up things they had not actually initialized, or forget to free things etc.
12
13Code handling the operational failures we want to check may be anywhere,
14including during early initialization or in user code before lws intialization.
15
16To help with this lws has a `LWS_WITH_SYS_FAULT_INJECTION` build option that
17provides a simple but powerful api for targeted fault injection in any lws or
18user code, and provides a wide range of well-known internal faults inside lws
19you can trigger from outside.
20
21## Fault contexts and faults
22
23The basic idea is objects in the user code can choose to initialize "fault
24contexts" inside objects, that list named, well-known "faults" that the code
25supoorts and that the user wants to inject.
26
27Although these "fault contexts" can be embedded in objects directly at object
28creation time, eg, for lws in the lws_context creation info struct, or the
29client connection info struct, or Secure Stream info struct, it's usually
30inconvenient to pass the desired faults directly deep into the code and attach
31them at creation time.  Eg, if you want to cause a fault in a wsi instantiated
32by a Secure Stream, that is internal lws code one step removed from the Secure
33Stream object creation making it difficult to arrange.
34
35For that reason, faults have a targeted inheritance scheme using namespace
36paths, it's usually enough to just list the faults you want at context creation
37time and they will be filter down to the internal objects you want to target
38when they are created later.
39
40![Fault Injection Overview](../doc-assets/fault-injection.png)
41
42A fault injection request is made in `lws_fi_t` objects, specifying the
43fault name and whether, and how often to inject the fault.
44
45The "fault context" objects `lws_fi_ctx_t` embedded in the creation info
46structs are linked-lists of `lws_fi_t` objects.  When Fault Injection is enabled
47at build-time, the key system objects like the `lws_context`, `lws_vhost`, `wsi`
48and Secure Stream handles / SSPC handles contain their own `lws_fi_ctx_t` lists
49that may have any number of `lws_fi_t` added to them.
50
51When downstream objects are created, eg, when an lws_context creates a Secure
52Stream, in addition to using any faults provided directly in the SS info,
53the lws_context faults are consulted to see if any relate to that streamtype
54and should be applied.
55
56Although faults can be added to objects at creation, it is far more convenient
57to just pass a list of faults you want into the lws_context and have the
58objects later match them using namespacing, described later.
59
60## Integrating fault injection conditionals into code in private lws code
61
62A simple query api `lws_fi(fi_ctx, "name")` is provided that returns 0 if no
63fault to be injected, or 1 if the fault should be synthesized.  If there is no
64rule matching "name", the answer is always to not inject a fault, ie, returns 0.
65
66Similarly for convenience if FAULT_INJECTION is disabled at build, the `lws_fi()`
67call always returns the constant `0`.
68
69By default then just enabling Fault Injection at build does not have any impact
70on code operation since the user must also add the fault injection rules he
71wants to the objects's Fault Injection context.
72
73## Integrating fault injection conditionals into user code with public apis
74
75These public apis query the fault context in a wsi, lws_context, ss handle, or
76sspc handle (client side of proxy) to find any matching rule, if so they return
771 if the conditions (eg, probability) are met and the fault should be injected.
78
79These allow user code to use the whole Fault Injection system without having to
80understand anything except the common object like a wsi they want to query and
81the name of the fault rule they are checking.
82
83|FI context owner|Public API|
84|---|---|
85|lws_context|`int lws_fi_user_context_fi(struct lws_context *ctx, const char *rule)`|
86|wsi|`int lws_fi_user_wsi_fi(struct lws *wsi, const char *rule)`|
87|ss handle|`int lws_fi_user_ss_fi(struct lws_ss_handle *h, const char *rule)`|
88|sspc handle|`int lws_fi_user_sspc_fi(struct lws_sspc_handle *h, const char *rule)`|
89
90For example, the minimal-http-client user code example contains this in its
91ESTABLISHED callback
92
93```
94		if (lws_fi_user_wsi_fi(wsi, "user_reject_at_est"))
95			return -1;
96```
97
98which can be triggered by running it with
99
100`lws-minimal-http-client --fault-injection 'wsi/user_reject_at_est'`, causing
101
102```
103...
104[2021/03/11 13:41:05:2769] U: Connected to 46.105.127.147, http response: 200
105[2021/03/11 13:41:05:2776] W: lws_fi: Injecting fault unk->user_reject_at_est
106[2021/03/11 13:41:05:2789] E: CLIENT_CONNECTION_ERROR: HS: disallowed at ESTABLISHED
107...
108```
109
110When `LWS_WITH_SYS_FAULT_INJECTION` is disabled, these public apis become
111preprocessor defines to `(0)`, so the related code is removed by the compiler.
112
113## Types of fault injection "when" strategy
114
115The api keeps track of each time the context was asked and uses this information
116to drive the decision about when to say yes, according to the type of rule
117
118|Injection rule type|Description|
119|---|---|
120|`LWSFI_ALWAYS`|Unconditionally inject the fault|
121|`LWSFI_DETERMINISTIC`|after `pre` times without the fault, the next `count` times exhibit the fault`|
122|`LWSFI_PROBABILISTIC`|exhibit a fault `pre` percentage of the time|
123|`LWSFI_PATTERN`|Reference `pre` bits pointed to by `pattern` and fault if the bit set, pointing to static array|
124|`LWSFI_PATTERN_ALLOC`|Reference `pre` bits pointed to by `pattern` and fault if the bit set, pointing to allocated array, freed when fault goes out of scope|
125
126Probabalistic choices are sourced from a PRNG with a seed set in the context
127creation info Fault Injection Context.  By default the lws helper
128`lws_cmdline_option_handle_builtin()` sets this to the time in us, but it can
129be overridden using `--fault-seed <decimal>`, and the effective PRNG seed is
130logged when the commandline options are initially parsed.
131
132## Addings Fault Injection Rules to `lws_fi_ctx_t`
133
134Typically the lws_context is used as the central, toplevel place to define
135faults.  This is done by adding prepared `lws_fi_t` objects on the stack one by
136one to the context creation info struct's `.fic` member, using
137`lws_fi_add(lws_fi_ctx_t *fic, const lws_fi_t *fi);`, this will allocate and copy
138the provided `fi` into the allocation, and attach it to the `lws_fi_ctx_t` list.
139
140When the context (or other object using the same scheme) is created, it imports
141all the faults from the info structure `.fic` and takes ownership of them,
142leaving the info `.fic` empty and ready to go out of scope.
143
144## Passing in fault injection rules
145
146A key requirement is that Fault Injection rules must be availble to the code
147creating an object before the object has been created.  This is why the user
148code prepares a Fault Injection context listing his rules in the creation info
149struct, rather than waiting for the object to be created and then attach Fault
150Injection rules... it's too late then to test faults during the creation.
151
152## Directly applying fault contexts
153
154You can pass in a Fault Injection context prepared with lws_fi_t added to it
155when creating the following kinds of objects
156
157|Object being created|info struct|Fault injection Context member|
158|---|---|---|
159|lws context|struct lws_context_creation_info|`fic`|
160|vhost|struct lws_context_creation_info|`fic`|
161|Secure Stream|struct lws_ss_info|`fic`|
162|client wsi|struct lws_client_connect_info|`fic`|
163
164However typically the approach is just provide a list of faults at context
165creation time, and let the objects match and inherit using namespacing,
166described next.
167
168## Using the namespace to target specific instances
169
170Lws objects created by the user can directly have a Fault Injection context
171attached to them at creation time, so the fault injection objects directly
172relate to the object.
173
174But in other common scenarios, there is no direct visibility of the object that
175we want to trigger faults in, it may not exist until some time later.  Eg, we
176want to trigger faults in the listen socket of a vhost.  To allow this, the
177fault names can be structured with a /path/ type namespace so objects created
178later can inherit faults.
179
180Notice that if you are directly creating the vhost, Secure Stream or wsi, you
181can directly attach the subrule yourself without the namespacing needed.  The
182namespacing is used when you have access to a higher level object at creation-
183time, like the lws_context, and it will itself create the object you want to
184target without your having any direct access to it.
185
186|namespace form|effect|
187|---|---|
188|**vh=myvhost/**subrule|subrule is inherited by the vhost named "myvhost" when it is created|
189|**vh/**subrule|subrule is inherited by any vhost when it is created|
190|**ss=mystream/**subrule|subrule is inherited by SS of streamtype "mystream" (also covers SSPC / proxy client)|
191|**ss/**subrule|subrule is inherited by all SS of any streamtype (also covers SSPC / proxy client)|
192|**wsi=myname/**subrule|subrule is inherited by client wsi created with `info->fi_wsi_name` "myname"|
193|**wsi/**subrule|subrule is inherited by any wsi|
194
195Namespaces can be combined, for example `vh=myvhost/wsi/listenskt` will set the
196`listenskt` fault on wsi created by the server vhost "myvhost", ie, it will
197cause the listen socket for the vhost to error out on creation.
198
199In the case of wsi migration when it's the network connection wsi on an h2
200connection that is migrated to be SID 1, the attached faults also migrate.
201
202Here is which Fault Injection Contexts each type of object inherits matching
203Fault Injection rules from:
204
205|Object type|Initialized with|Inherit matching faults from|
206|---|---|---|
207|context|`struct lws_context_creation_info` .fic|-|
208|vhost|`struct lws_context_creation_info` .fic|context FIC|
209|client wsi|`struct lws_client_connect_info` .fic|context FIC, vhost FIC|
210|ss / sspc|`lws_ss_info_t` .fic|context FIC|
211|ss / sspc wsi|-|context FIC, vhost FIC, ss / sspc .fic|
212
213Since everything can be reached from the lws_context fault context, directly or
214by additional inheritence, and that's the most convenient to set from the
215outside, that's typically the original source of all injected faults.
216
217## Integration with minimal examples
218
219All the minimal examples that use the `lws_cmdline_option_handle_builtin()` api
220can take an additional `--fault-injection "...,..."` switch, which automatically
221parses the comma-separated list in the argument to add faults with the given
222name to the lws_context.  For example,
223
224`lws-minimal-http-client --fault-injection "wsi/dnsfail"`
225
226will force all wsi dns lookups to fail for that run of the example.
227
228### Specifying when to inject the fault
229
230By default, if you just give the name part, if the namespace is absent or
231matches an object, the fault will be injected every time.  It's also possible
232to make the fault inject itself at a random probability, or in a cyclic pattern,
233by giving additional information in brackets, eg
234
235|Syntax|Used with|Meaning|
236|---|---|---|
237|`wsi/thefault`|lws_fi()|Inject the fault every time|
238|`wsi/thefault(10%)`|lws_fi()|Randomly inject the fault at 10% probability|
239|`wsi/thefault(.............X.X)`|lws_fi()|Inject the fault on the 14th and 16th try, every 16 tries|
240|`wsi/thefault2(123..456)`|lws_fi_range()|Pick a number between 123 and 456|
241
242You must quote the strings containing these symbols, since they may otherwise be
243interpreted by your shell.
244
245The last example above does not decide whether to inject the fault via `lws_fi()`
246like the others.  Instead you can use it via `lws_fi_range()` as part of the
247fault processing, on a secondary fault injection name.  For example you may have
248a fault `myfault` you use with `lws_fi()` to decide when to inject the fault,
249and then a second, related fault name `myfault_delay` to allow you to add code
250to delay the fault action by some random amount of ms within an externally-
251given range.  You can get a pseudo-random number within the externally-given
252range by calling `lws_fi_range()` on `myfault_delay`, and control the whole
253thing by giving, eg, `"myfault(10%),myfault_delay(123..456)"`
254
255## Well-known fault names in lws
256
257|Scope|Namespc|Name|Fault effect|
258|---|---|---|---|
259|context||`ctx_createfail1`|Fail context creation immediately at entry|
260|context||`ctx_createfail_plugin_init`|Fail context creation as if a plugin init failed (if plugins enabled)|
261|context||`ctx_createfail_evlib_plugin`|Fail context creation due to event lib plugin failed init (if evlib plugins enabled)|
262|context||`ctx_createfail_evlib_sel`|Fail context creation due to unable to select event lib|
263|context||`ctx_createfail_oom_ctx`|Fail context creation due to OOM on context object|
264|context||`ctx_createfail_privdrop`|Fail context creation due to failure dropping privileges|
265|context||`ctx_createfail_maxfds`|Fail context creation due to unable to determine process fd limit|
266|context||`ctx_createfail_oom_fds`|Fail context creation due to OOM on fds table|
267|context||`ctx_createfail_plat_init`|Fail context creation due to platform init failed|
268|context||`ctx_createfail_evlib_init`|Fail context creation due to event lib init failed|
269|context||`ctx_createfail_evlib_pt`|Fail context creation due to event lib pt init failed|
270|context||`ctx_createfail_sys_vh`|Fail context creation due to system vhost creation failed|
271|context||`ctx_createfail_sys_vh_init`|Fail context creaton due to system vhost init failed|
272|context||`ctx_createfail_def_vh`|Fail context creation due to default vhost creation failed|
273|context||`ctx_createfail_ss_pol1`|Fail context creation due to ss policy parse start failed (if policy enabled)|
274|context||`ctx_createfail_ss_pol2`|Fail context creation due to ss policy parse failed (if policy enabled)|
275|context||`ctx_createfail_ss_pol3`|Fail context creation due to ss policy set failed (if policy enabled)|
276|context||`cache_createfail`|Fail `lws_cache` creation due to OOM|
277|context||`cache_lookup_oom`|Fail `lws_cache` lookup due to OOM|
278|vhost|`vh`|`vh_create_oom`|Fail vh creation on vh object alloc OOM|
279|vhost|`vh`|`vh_create_oom`|Fail vh creation on vh object alloc OOM|
280|vhost|`vh`|`vh_create_pcols_oom`|Fail vh creation at protocols alloc OOM|
281|vhost|`vh`|`vh_create_access_log_open_fail`|Fail vh creation due to unable to open access log (LWS_WITH_ACCESS_LOG)|
282|vhost|`vh`|`vh_create_ssl_srv`|Fail server ssl_ctx init|
283|vhost|`vh`|`vh_create_ssl_cli`|Fail client ssl_ctx init|
284|vhost|`vh`|`vh_create_srv_init`|Fail server init|
285|vhost|`vh`|`vh_create_protocol_init`|Fail late protocol init (for late vhost creation)|
286|srv vhost|`vh=xxx/wsi`|`listenskt`|Causes `socket()` allocation for vhost listen socket to fail|
287|cli wsi|`wsi`|`dnsfail`|Sync: `getaddrinfo()` is not called and a EAI_FAIL return synthesized, Async: request not started and immediate fail synthesized|
288|cli wsi|`wsi`|`sendfail`|Attempts to send data on the wsi socket fail|
289|cli wsi|`wsi`|`connfail`|Attempts to connect on the wsi socket fail|
290|cli wsi|`wsi`|`createfail`|Creating the client wsi itself fails|
291|udp wsi|`wsi`|`udp_rx_loss`|Drop UDP RX that was actually received, useful with probabalistic mode|
292|udp wsi|`wsi`|`udp_tx_loss`|Drop UDP TX so that it's not actually sent, useful with probabalistic mode|
293|srv ss|`ss`|`ss_srv_vh_fail`|Secure Streams Server vhost creation forced to fail|
294|cli ss|`ss`|`ss_no_streamtype_policy`|The policy for the streamtype is made to seem as if it is missing|
295|sspc|`ss`|`sspc_fail_on_linkup`|Reject the connection to the proxy when we hear it has succeeded, it will provoke endless retries|
296|sspc|`ss`|`sspc_fake_rxparse_disconnect_me`|Force client-proxy link parse to seem to ask to be disconnected, it will provoke endless retries|
297|sspc|`ss`|`sspc_fake_rxparse_destroy_me`|Force client-proxy link parse to seem to ask to destroy the SS, it will destroy the SS cleanly|
298|sspc|`ss`|`sspc_link_write_fail`|Force write on the link to fail, it will provoke endless retries|
299|sspc|`ss`|`sspc_create_oom`|Cause the sspc handle allocation to fail as if OOM at creation time|
300|sspc|`ss`|`sspc_fail_metadata_set`|Cause the metadata allocation to fail|
301|sspc|`ss`|`sspc_rx_fake_destroy_me`|Make it seem that client's user code *rx() returned DESTROY_ME|
302|sspc|`ss`|`sspc_rx_metadata_oom`|Cause metadata from proxy allocation to fail|
303|ssproxy|`ss`|`ssproxy_dsh_create_oom`|Cause proxy's creation of DSH to fail|
304|ssproxy|`ss`|`ssproxy_dsh_rx_queue_oom`|Cause proxy's allocation in the onward SS->P[->C] DSH rx direction to fail as if OOM, this causes the onward connection to disconnect|
305|ssproxy|`wsi`|`ssproxy_client_adopt_oom`|Cause proxy to be unable to allocate for new client - proxy link connection object|
306|ssproxy|`wsi`|`ssproxy_client_write_fail`|Cause proxy write to client to fail|
307|ssproxy|`wsi`|`sspc_dsh_ss2p_oom`|Cause ss->proxy dsh allocation to fail|
308|ssproxy|`ss`|`ssproxy_onward_conn_fail`|Act as if proxy onward client connection failed immediately|
309|ssproxy|`ss`|`ssproxy_dsh_c2p_pay_oom`|Cause proxy's DSH alloc for C->P payload to fail|
310|ss|`ss`|`ss_create_smd`|SMD: ss creation smd registration fail|
311|ss|`ss`|`ss_create_vhost`|Server: ss creation acts like no vhost matching typename (only for `!vhost`)|
312|ss|`ss`|`ss_create_pcol`|Server: ss creation acts like no protocol given in policy|
313|ss|`ss`|`ss_srv_vh_fail`|Server: ss creation acts like unable to create vhost|
314|ss|`ss`|`ss_create_destroy_me`|ss creation acts like CREATING state returned DESTROY_ME|
315|ss|`ss`|`ss_create_no_ts`|Static Policy: ss creation acts like no trust store|
316|ss|`ss`|`ss_create_smd_1`|SMD: ss creation acts like CONNECTING said DESTROY_ME|
317|ss|`ss`|`ss_create_smd_2`|SMD: ss creation acts like CONNECTED said DESTROY_ME|
318|ss|`ss`|`ss_create_conn`|Nailed up: ss creation client connection fails with DESTROY_ME|
319|wsi|`wsi`|`timedclose`|(see next) Cause wsi to close after some time|
320|wsi|`wsi`|`timedclose_ms`|Range of ms for timedclose (eg, "timedclose_ms(10..250)"|
321
322## Well-known namespace targets
323
324Namespaces can be used to target these more precisely, for example even though
325we are only passing the faults we want inject at the lws_context, we can use
326the namespace "paths" to target only the wsis created by other things.
327
328To target wsis from SS-based connections, you can use `ss=stream_type_name/`,
329eg for captive portal detection, to have it unable to find its policy entry:
330
331`ss=captive_portal_detect/ss_no_streamtype_policy` (disables CPD from operating)
332
333...to force it to fail to resolve the server DNS:
334
335`ss=captive_portal_detect/wsi/dnsfail` (this makes CPD feel there is no internet)
336
337...to target the connection part of the captive portal testing instead:
338
339`ss=captive_portal_detect/wsi/connfail` (this also makes CPD feel there is no internet)
340
341### Well-known internal wsi type names
342
343Wsi created for internal features like Async DNS processing can also be targeted
344
345|wsi target|Meaning|
346|---|---|
347|`wsi=asyncdns/`|UDP wsi used by lws Async DNS support to talk to DNS servers|
348|`wsi=dhcpc/`|UDP wsi used by lws DHCP Client|
349|`wsi=ntpclient/`|UDP wsi used by lws NTP Client|
350
351For example, passing in at lws_context level `wsi=asyncdns/udp_tx_loss`
352will force async dns to be unable to resolve anything since its UDP tx is
353being suppressed.
354
355At client connection creation time, user code can also specify their own names
356to match on these `wsi=xxx/` namespace parts, so the faults only apply to
357specific wsi they are creating themselves later.  This is done by setting the
358client creation info struct `.fi_wsi_name` to the string "xxx".
359