• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1MCE Stress Test HOWTO
2====================
3
4Oct 10th, 2009
5
6Haicheng Li
7
8
9Abstract
10--------
11
12This document explains the design and structure of MCE stress test suite,
13the kernel configurations and user space tools required for automated
14stress testing, as well as usage guide and etc.
15
16
170. Quick Shortcut
18-----------------
19
20- Install the Linux kernel (2.6.32 or newer) with full MCA recovery support.
21  Make sure following configuration options are enabled:
22
23  CONFIG_X86_MCE=y
24  CONFIG_MEMORY_FAILURE=y
25
26  With these two options enabled, you can do stress testing thru madvise
27  syscall (sec 4.1).
28
29- Install page-types tool (sec 3.3), which is accompanied with Linux kernel
30  source (2.6.32 or newer).
31
32  # cd $KERNEL_SRC/Documentation/vm/
33  # gcc -o page-types page-types.c
34  # cp page-types /usr/bin/
35
36- Get latest LTP (Linux Test Project) image from http://ltp.sf.net. Refer
37  to INSTALL of LTP to install LTP on your machine.
38
39- Build and run stress testing
40
41  # make
42  # cd stress
43  # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR -N
44
45  Note here, '-d $YOUR_PARTITION' is a mandatory option. Test will create
46  all temporary files on $YOUR_PARTITION, and error injection will just
47  affect the pages associated with $$YOUR_PARTITION. So you must provide a
48  free disk partition to stress test driver!
49
50  This will do the stress testing thru madvise syscall (sec 4.1). However,
51  there are more advanced test methods provided (sec 4.2, 4.3).
52
53Note, for all examples in the rest of this doc, it is supposed that $PWD is
54the stress subdir.
55
561. Overview
57-----------
58
59The MCE stress test suite is a collection of tools and test scripts, which
60intends to achieve stress testing on Linux kernel MCA high level handlers
61that include HWPosion page recovery, soft page offline, and so on.
62
63In general, this test suite is designed to do stress testing thru various
64test interfaces, i.e. madvise syscall, HWPoison page injector, and APEI
65injector (see ACPI4.0 spec). And it's able to support most of popular
66Linux File Systems (FS), that is, there is an option for user to specify which
67FS type they want the test to be running on.
68
69If you just want to start testing as quickly as possible, you can skip
70section 2 & 3, just go to section 4 directly.
71
72
732. Design Details
74-----------------
75
76The MCE stress test suite consists of four parts: test driver, workload
77controller, customized workloads, and background workloads.
78
79The main test idea is described as below:
80- Test driver launchs various customized workloads to continuously generate
81  lots of pages with expected page states, Note, all of these workloads know
82  about their expected results that should not be affected by Linux MCE high
83  level handlers.
84- Then test driver injects MCE errors to these pages thru either madvise
85  syscall or HWPoison injector or APEI injector. While Linux Kernel handling
86  these MCE errors, all the workloads continue running normally,
87- After long time running, test driver will collect test result of each
88  workload to see if any unexpected failures happened. In such a way, it can
89  decide if any bug is found.
90- If any system panics or FS corruption happens, that means there must be a
91  bug. It's the bottom line to decide if test gets pass.
92
932.1 Test Driver
94
95Test driver (a.k.a hwpoison.sh) drives the whole test procedure. It's
96responsible for managing test environment, setting up error injection
97interface, controlling test progress, launching workloads, injecting page
98errors, as well as recording test logs and reporting test result.
99
100For detailed usage of hwpoison.sh test driver, please refer to:
101# ./hwpoison.sh -h
102
1032.2 Workload Controller
104
105Workload controller needs to have various test workloads running parallelly
106and continuously within a required duration time. We select ltp-pan
107program of Linux Test Project (LTP) as the workload controller of this
108stress test suite.
109
110Test driver (hwpoison.sh) interacts with ltp-pan in following ways:
111- hwpoison.sh generates a test config file that lists the workload type
112  to be launched by ltp-pan.
113- hwpoison also passes test duration time and other workload specific
114  parameters to ltp-pan via test config file.
115- ltp-pan makes each workload run and get finished in time, then test driver
116  can get the result of each workload via corresponding result files.
117- finally, hwpoison.sh will decide the overall test result based on each
118  workload result, and report final result out.
119
1202.3 Customized Workloads
121
122There are three types of customized workloads, which are intended to generate
123pages with various page state.
124
125* Type0: page-poisoning workload, meant to cover:
126  - anonymous pages operations.
127  - file data operations.
128
129* Type1: fs-metadata workload, meant to cover:
130  - inode operations.
131
132* Type2: fs_type specific workload, meant to cover:
133  - extended functions of some special FS.
134
1352.4 Background Workloads
136
137LTP is selected as the background workload to simulate normal system
138operations in background while stress testing is running.
139
140Besides LTP, there are also some alternatives, like AIM. We might extend more
141background workloads in future.
142
1432.5 Test Result
144
145How to determine that stress testing gets pass?
146- at least no kernel panics happens during stress testing.
147- fsck on the target disk at the end of stress testing should get pass.
148- there is no failure found by customized workloads, especially for
149  page-poisoning workload.
150
151Where to get detailed test result?
152- When stress testing is done, the general test result is recorded in
153  result/hwpoison.result, and the general test log is in result/hwpoison.log.
154  However, you can specify them in following way:
155  # hwpoison.sh -r $YOUR_RESULT -l $YOUR_LOG
156- The test result and test log of each workload are recorded as
157  log/$workload/$workload.result and log/$workload/$workload.log.
158  For example, for page-poisoning workload, its test result and test logs are
159  log/page-poisoning/page-poisoning.result and
160  log/page-poisoning/page-poisoning.log.
161- Besides, under each workload result dir, you can find other extra logs
162  like pan_log, pan_output and etc. These logs are generated by ltp-pan
163  workload controller. Usually they can help you understand what has been
164  going on with ltp-pan while workload is running. Pls. refer to ltp-pan doc
165  for details.
166
167
1683. Tools
169--------
170
1713.1 page-poisoning
172
173It is the page-poisoning workload. page-poisoning workload is an extension of
174tinjpage test program with a multi-process model. It spawns thousands of
175processes that inject HWPosion error to various pages simultaneously thru
176madvise syscall. Then it checks if these errors get handled correctly,
177i.e. whether each test process receives or doesn't receive SIGBUS signal as
178expected.
179
180For more info about page-poisoning workload, pls. read through README file
181under stress/tools/page-poisoning/.
182
1833.2 fs-metadata
184
185It is the fs-metadata workload. fs-metadata is designed to test i-node
186operations with heavy workload and make sure every i-node operation gets
187the expected result. In details, it firstly generates a huge directory
188hierarchy on the target disk, then it performs unlink operations on this
189directory hierarchy and duplicate a copy of the directory, finally it
190checks if these two directories are same as expected.
191
192For more info about fs-metadata workload, pls. read through README file
193under stress/tools/fs-metadata/.
194
1953.3 page-types
196
197page-types is a tool to query the page type of every memory page in the
198system. We use it to filter out pages with required page types. Test will
199inject error to these pages via error injector, although the page filter
200of HWPosion handler in Linux Kernel will filter them out for a second
201time. Note, the reason we need to use page-types to do first time filtering
202is just about performance.
203
204To install page-types on your test machine:
205
206  # cd $KERNEL_SRC/Documentation/vm/
207  # gcc -o page-types page-types.c
208  # cp page-types /usr/bin/
209
2103.4 ltp-pan
211
212It's the workload controller of this stress test suite. In fact, ltp-pan
213is the test harness of LTP (Linux Test Project), and is included in
214LTP package. For more information, please refer to ltp-pan document of LTP.
215
216
2174. Usage Guide
218--------------
219
220This section is trying to show you how to conduct the stress testing thru
221various test interfaces.
222
223As an example, we choose to run stress testing based on partition /dev/sda1
224for 1 hour. Note, we've installed LTP to /ltp.
225
2264.1 Stress Test thru Madvise Syscall.
227
228To run this stress testing, you need to strictly follow below test
229instructions.
230
231* Test instructions:
232
233- make sure following kernel options are enabled:
234  CONFIG_X86_MCE=y
235  CONFIG_MEMORY_FAILURE=y
236
237- build and run stress testing
238  # make
239  # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR
240
241* Example:
242
243- launch testing
244  # ./hwpoison.sh -d /dev/sda1 -M -t 3600
245
246- general test results
247  result: result/hwpoison.result
248  logs: result/hwpoison.log
249
250- detailed workload results
251  result: log/page-poisoning/page-poisoning.result
252  log: log/page-poisoning/page-poisoning.log
253
2544.2 Stress Test thru HWPosion Page Injector
255
256This is the default test method of this stress test suite.
257
258To run this stress testing, you need to strictly follow below test
259instructions.
260
261* Test instructions:
262
263- make sure following kernel options are enabled:
264  CONFIG_X86_MCE=y
265  CONFIG_MEMORY_FAILURE=y
266  CONFIG_DEBUG_KERNEL=y
267  CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
268  CONFIG_HWPOISON_INJECT=y
269
270- build and run stress testing
271  # make
272  # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L
273
274* Example:
275
276- launch testing
277  # ./hwpoison.sh -d /dev/sda1 -t 3600 -L
278
279- general test results
280  result: result/hwpoison.result
281  logs: result/hwpoison.log
282
283- detailed workload results
284  fs-metadata result: log/fs-metadata/fs-metadata.result
285  fs-metadata log: log/fs-metadata/fs-metadata.log
286  ltp result: log/ltp/ltp.result
287  ltp log: log/ltp/ltp.log
288  fs-specific result: log/fs-specific/fs-specific.result
289  fs-specific log: log/fs-specific/fs-specific.log
290
2914.3 Stress Test thru APEI Injector
292
293To run this stress testing, you need to follow below test instructions.
294
295* Test instructions:
296
297- make sure following kernel options are enabled:
298  CONFIG_X86_MCE=y
299  CONFIG_X86_MCE_INTEL=y
300  CONFIG_MEMORY_FAILURE=y
301  CONFIG_ACPI_APEI=y
302  CONFIG_ACPI_APEI_EINJ=y
303
304- build and run stress testing
305  # make
306  # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L -A
307
308* Example:
309
310- launch testing
311  # ./hwpoison.sh -d /dev/sda1 -t 3600 -L -A
312
313- general test results
314  result: result/hwpoison.result
315  logs: result/hwpoison.log
316
317- detailed workload results
318  fs-metadata result: log/fs-metadata/fs-metadata.result
319  fs-metadata log: log/fs-metadata/fs-metadata.log
320  ltp result: log/ltp/ltp.result
321  ltp log: log/ltp/ltp.log
322  fs-specific result: log/fs-specific/fs-specific.result
323  fs-specific log: log/fs-specific/fs-specific.log
324
325
3265. FAQs
327-------
328
329Here is a collection of frequently asked questions:
330
331Q: How to tell test driver not to format my disk partition?
332A: Use the option '-N'.
333
334Q: Can three types of tests run on same sytem simultaneously?
335A: No. There are limitations in Linux Kernel HWPoison page filtering.
336
337Q: Can I run this stress testing on multiple disks parallely?
338A: Yes. But it requires updated Kernel patches for HWPosion page filtering.
339   Now, it just supports one same test with same pagetype flags specified.
340
341