• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1## `lws_metrics`
2
3### Introduction
4
5`lws_metrics` records and aggregates **events** at all lws layers.
6
7There are three distinct parts:
8
9 - the architecture inside lws for collecting and aggregating / decimating the
10   events and maintaining statistics about them, these are lws_metric objects
11
12 - an external handler for forwarding aggregated metrics.  An lws_system ops
13   interface to pass on the aggregated metrics to an external backend.  lws
14   presents its own public metrics objects and leaves it to the external
15   code to have a shim to marry the lws metrics up to whatever is needed in the
16   metrics backend
17
18 - a policy for when to emit each type of aggregated information to the external
19   handler.  This can be specified in the generic Secure Streams policy, or
20   a linked-list of lws_metric_policy_t object passed it at context creation in
21   `info.metrics_policies`.
22
23The external backend interface code may itself make use of lws connectivity apis
24including Secure Streams itself, and lws metrics are available on that too.
25
26### `lws_metrics` policy-based reporting
27
28Normally metrics implementations are fixed at build-time and cannot change
29without a coordinated reflash of devices along with a change of backend schema.
30
31`lws_metrics` separates out the objects and code necessary to collect and
32aggregate the data cheaply, and the reporting policy that controls if, or how
33often, the results are reported to the external handler.
34
35![policy based metrics](/doc-assets/lws_metrics-policy.png)
36
37Metrics are created with a namespace name and the policy applies itself to those
38by listing the names, with wildcards allowed, the policy applies to, eg if
39specified in the Secure Streams JSON policy
40
41```
42	...
43	"metrics": [
44                {
45                        "name":         "tensecs",
46                        "us_schedule":  10000000,
47                        "report":	"cpu.*"
48                }, {
49                        "name":         "30secs",
50                        "us_schedule":  30000000,
51                        "report":       "n.cn.*, n.http.*, n.ss.*, vh.*"
52                }
53        ],
54        ...
55```
56
57Metrics that do not have a reporting policy do not report, but continue to
58aggregate measurements in case they are bound to a policy dynamically later.
59
60### Freeform metrics naming
61
62There is no predefined metrics schema, metrics objects, including those created
63by applications, can independently choose their own name in a namespace like
64"cpu.srv" or "n.cn.dns", and can set a prefix for all metrics names created in a
65context (by setting `info.metrics_prefix` at context creation time).
66
67This allows multiple processes in a single device to expose copies of the same
68metrics in an individually addressable way, eg, if the UI process specifies the
69prefix "ui", then its lws metrics like "cpu.srv" will actually be created as
70"ui.cpu.srv".
71
72Applications can freely define their own `lws_metrics` measurements with their
73own names in the namespace too, without central registration, and refer to those
74names in the reporting policy same as any other metric names.
75
76If the metrics backend requires a fixed schema, the mapping between the
77`lws_metrics` names and the backend schema indexes will be done in the
78`lws_system` external reporting api implementation alone.  Metrics objects
79contain a `void * backend_opaque` that is ignored by lws and can be set and
80read by the external reporting handler implementation to facilitate that.
81
82### Histogram metrics tagging
83
84Histogram metrics track differently-qualified results in the same metric, for
85example the metric `n.cn.failures` maintains separate result counts for all
86variations and kinds of failure.
87
88```
89[2021/03/01 06:34:05:6570] U: my_metric_report: ssproxy.n.cn.failures{ss="badcert_selfsigned",hostname="invalidca.badcert.warmcat.com",peer="46.105.127.147",tls="invalidca"} 2
90[2021/03/01 06:34:05:6573] U: my_metric_report: ssproxy.n.cn.failures{hostname="invalidca.badcert.warmcat.com",peer="46.105.127.147",tls="invalidca"} 1
91[2021/03/01 06:34:05:6576] U: my_metric_report: ssproxy.n.cn.failures{ss="badcert_expired",hostname="warmcat.com",peer="46.105.127.147",tls="expired"} 2
92[2021/03/01 06:34:05:6578] U: my_metric_report: ssproxy.n.cn.failures{hostname="warmcat.com",peer="46.105.127.147",tls="expired"} 1
93[2021/03/01 06:34:05:6580] U: my_metric_report: ssproxy.n.cn.failures{ss="badcert_hostname",hostname="hostname.badcert.warmcat.com",peer="46.105.127.147",tls="hostname"} 2
94[2021/03/01 06:34:05:6583] U: my_metric_report: ssproxy.n.cn.failures{hostname="hostname.badcert.warmcat.com",peer="46.105.127.147",tls="hostname"} 1
95[2021/03/01 06:34:05:6585] U: my_metric_report: ssproxy.n.cn.failures{dns="nores -2"} 8
96```
97
98The user handler for metrics is expected to iterate these, in the provided
99examples (eg, minimal-secure-streams-testsfail)
100
101```
102#if defined(LWS_WITH_SYS_METRICS)
103static int
104my_metric_report(lws_metric_pub_t *mp)
105{
106	lws_metric_bucket_t *sub = mp->u.hist.head;
107	char buf[192];
108
109	do {
110		if (lws_metrics_format(mp, &sub, buf, sizeof(buf)))
111			lwsl_user("%s: %s\n", __func__, buf);
112	} while ((mp->flags & LWSMTFL_REPORT_HIST) && sub);
113
114	/* 0 = leave metric to accumulate, 1 = reset the metric */
115
116	return 1;
117}
118
119static const lws_system_ops_t system_ops = {
120	.metric_report = my_metric_report,
121};
122
123#endif
124```
125
126### `lws_metrics` decimation
127
128Event information can easily be produced faster than it can be transmitted, or
129is useful to record if everything is working.  In the case that things are not
130working, then eventually the number of events that are unable to be forwarded
131to the backend would overwhelm the local storage.
132
133For that reason, the metrics objects are designed to absorb and summarize a
134potentially large number of events cheaply by aggregating them, so even extreme
135situations can be tracked meaningfully inbetween dumps to the backend.
136
137There are two approaches:
138
139 - "aggregation": decimate keeping a uint64 mean + sum, along with a max and min
140
141 - "histogram": keep a linked-list of different named buckets, with a 64-bit
142   counter for the number of times an event in each bucket was observed
143
144A single metric aggregation object has separate "go / no-go" counters, since
145most operations can fail, and failing operations act differently.
146
147`lws_metrics` 'aggregation' supports decimation by
148
149 - a mean of a 64-bit event metric, separate for go and no-go events
150 - counters of go and no-go events
151 - a min and max of the metric
152 - keeping track of when the sample period started
153
154![metrics decimation](/doc-assets/lws_metrics-decimation.png)
155
156In addition, the policy defines a percentage variance from the mean that
157optionally qualifies events to be reported individually.
158
159The `lws_metrics` 'histogram' allows monitoring of different outcomes to
160produce counts of each outcome in the "bucket".
161
162### `lws_metrics` flags
163
164When the metrics object is created, flags are used to control how it will be
165used and consumed.
166
167For example to create a histogram metrics object rather than the default
168aggregation type, you would give the flag `LWSMTFL_REPORT_HIST` at creation
169time.
170
171|Flag|Meaning|
172|---|---|
173|`LWSMTFL_REPORT_OUTLIERS`|track outliers and report them internally|
174|`LWSMTFL_REPORT_OUTLIERS_OOB`|report each outlier externally as they happen|
175|`LWSMTFL_REPORT_INACTIVITY_AT_PERIODIC`|explicitly externally report no activity at periodic cb, by default no events in the period is just not reported|
176|`LWSMTFL_REPORT_MEAN`|the mean is interesting for this metric|
177|`LWSMTFL_REPORT_ONLY_GO`|no-go pieces invalid and should be ignored, used for simple counters|
178|`LWSMTFL_REPORT_DUTY_WALLCLOCK_US`|the aggregated sum or mean can be compared to wallclock time|
179|`LWSMTFL_REPORT_HIST`|object is a histogram (else aggregator)|
180
181### Built-in lws-layer metrics
182
183lws creates and maintains various well-known metrics when you enable build
184with cmake `-DLWS_WITH_SYS_METRICS=1`:
185
186#### Aggregation metrics
187|metric name|scope|type|meaning|
188---|---|---|---|
189`cpu.svc`|context|monotonic over time|time spent servicing, outside of event loop wait|
190`n.cn.dns`|context|go/no-go mean|duration of blocking libc DNS lookup|
191`n.cn.adns`|context|go/no-go mean|duration of SYS_ASYNC_DNS lws DNS lookup|
192`n.cn.tcp`|context|go/no-go mean|duration of tcp connection until accept|
193`n.cn.tls`|context|go/no-go mean|duration of tls connection until accept|
194`n.http.txn`|context|go (2xx)/no-go mean|duration of lws http transaction|
195`n.ss.conn`|context|go/no-go mean|duration of Secure Stream transaction|
196`n.ss.cliprox.conn`|context|go/no-go mean|time taken for client -> proxy connection|
197`vh.[vh-name].rx`|vhost|go/no-go sum|received data on the vhost|
198`vh.[vh-name].tx`|vhost|go/no-go sum|transmitted data on the vhost|
199
200#### Histogram metrics
201|metric name|scope|type|meaning|
202|---|---|---|---|
203`n.cn.failures`|context|histogram|Histogram of connection attempt failure reasons|
204
205#### Connection failure histogram buckets
206|Bucket name|Meaning|
207|---|---|
208`tls/invalidca`|Peer certificate CA signature missing or not trusted|
209`tls/hostname`|Peer certificate CN or SAN doesn't match the endpoint we asked for|
210`tls/notyetvalid`|Peer certificate start date is in the future (time wrong?)|
211`tls/expired`|Peer certificate is expiry date is in the past|
212`dns/badsrv`|No DNS result because couldn't talk to the server|
213`dns/nxdomain`|No DNS result because server says no result|
214
215The `lws-minimal-secure-streams` example is able to report the aggregated
216metrics at the end of execution, eg
217
218```
219[2021/01/13 11:47:19:9145] U: my_metric_report: cpu.svc: 137.045ms / 884.563ms (15%)
220[2021/01/13 11:47:19:9145] U: my_metric_report: n.cn.dns: Go: 4, mean: 3.792ms, min: 2.470ms, max: 5.426ms
221[2021/01/13 11:47:19:9145] U: my_metric_report: n.cn.tcp: Go: 4, mean: 40.633ms, min: 17.107ms, max: 94.560ms
222[2021/01/13 11:47:19:9145] U: my_metric_report: n.cn.tls: Go: 3, mean: 91.232ms, min: 30.322ms, max: 204.635ms
223[2021/01/13 11:47:19:9145] U: my_metric_report: n.http.txn: Go: 4, mean: 63.089ms, min: 20.184ms, max: 125.474ms
224[2021/01/13 11:47:19:9145] U: my_metric_report: n.ss.conn: Go: 4, mean: 161.740ms, min: 42.937ms, max: 429.510ms
225[2021/01/13 11:47:19:9145] U: my_metric_report: vh._ss_default.rx: Go: (1) 102, NoGo: (1) 0
226[2021/01/13 11:47:19:9145] U: my_metric_report: vh.le_via_dst.rx: Go: (22) 28.165Ki
227[2021/01/13 11:47:19:9145] U: my_metric_report: vh.le_via_dst.tx: Go: (1) 267
228[2021/01/13 11:47:19:9145] U: my_metric_report: vh.api_amazon_com.rx: Go: (1) 1.611Ki, NoGo: (1) 0
229[2021/01/13 11:47:19:9145] U: my_metric_report: vh.api_amazon_com.tx: Go: (3) 1.505Ki
230```
231
232lws-minimal-secure-stream-testsfail which tests various kinds of connection failure
233reports histogram results like this
234
235```
236[2021/01/15 13:10:16:0933] U: my_metric_report: n.cn.failures: tot: 36, [ tls/invalidca: 5, tls/expired: 5, tls/hostname: 5, dns/nxdomain: 21 ]
237```
238
239## Support for openmetrics
240
241Openmetrics https://tools.ietf.org/html/draft-richih-opsawg-openmetrics-00
242defines a textual metrics export format comaptible with Prometheus.  Lws
243provides a protocol plugin in `./plugins/protocol_lws_openmetrics_export`
244that enables direct export for prometheus scraping, and also protocols to
245proxy openmetrics export for unreachable servers.
246