1f08c3bdfSopenharmony_ciMCE Stress Test HOWTO 2f08c3bdfSopenharmony_ci==================== 3f08c3bdfSopenharmony_ci 4f08c3bdfSopenharmony_ciOct 10th, 2009 5f08c3bdfSopenharmony_ci 6f08c3bdfSopenharmony_ciHaicheng Li 7f08c3bdfSopenharmony_ci 8f08c3bdfSopenharmony_ci 9f08c3bdfSopenharmony_ciAbstract 10f08c3bdfSopenharmony_ci-------- 11f08c3bdfSopenharmony_ci 12f08c3bdfSopenharmony_ciThis document explains the design and structure of MCE stress test suite, 13f08c3bdfSopenharmony_cithe kernel configurations and user space tools required for automated 14f08c3bdfSopenharmony_cistress testing, as well as usage guide and etc. 15f08c3bdfSopenharmony_ci 16f08c3bdfSopenharmony_ci 17f08c3bdfSopenharmony_ci0. Quick Shortcut 18f08c3bdfSopenharmony_ci----------------- 19f08c3bdfSopenharmony_ci 20f08c3bdfSopenharmony_ci- Install the Linux kernel (2.6.32 or newer) with full MCA recovery support. 21f08c3bdfSopenharmony_ci Make sure following configuration options are enabled: 22f08c3bdfSopenharmony_ci 23f08c3bdfSopenharmony_ci CONFIG_X86_MCE=y 24f08c3bdfSopenharmony_ci CONFIG_MEMORY_FAILURE=y 25f08c3bdfSopenharmony_ci 26f08c3bdfSopenharmony_ci With these two options enabled, you can do stress testing thru madvise 27f08c3bdfSopenharmony_ci syscall (sec 4.1). 28f08c3bdfSopenharmony_ci 29f08c3bdfSopenharmony_ci- Install page-types tool (sec 3.3), which is accompanied with Linux kernel 30f08c3bdfSopenharmony_ci source (2.6.32 or newer). 31f08c3bdfSopenharmony_ci 32f08c3bdfSopenharmony_ci # cd $KERNEL_SRC/Documentation/vm/ 33f08c3bdfSopenharmony_ci # gcc -o page-types page-types.c 34f08c3bdfSopenharmony_ci # cp page-types /usr/bin/ 35f08c3bdfSopenharmony_ci 36f08c3bdfSopenharmony_ci- Get latest LTP (Linux Test Project) image from http://ltp.sf.net. Refer 37f08c3bdfSopenharmony_ci to INSTALL of LTP to install LTP on your machine. 38f08c3bdfSopenharmony_ci 39f08c3bdfSopenharmony_ci- Build and run stress testing 40f08c3bdfSopenharmony_ci 41f08c3bdfSopenharmony_ci # make 42f08c3bdfSopenharmony_ci # cd stress 43f08c3bdfSopenharmony_ci # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR -N 44f08c3bdfSopenharmony_ci 45f08c3bdfSopenharmony_ci Note here, '-d $YOUR_PARTITION' is a mandatory option. Test will create 46f08c3bdfSopenharmony_ci all temporary files on $YOUR_PARTITION, and error injection will just 47f08c3bdfSopenharmony_ci affect the pages associated with $$YOUR_PARTITION. So you must provide a 48f08c3bdfSopenharmony_ci free disk partition to stress test driver! 49f08c3bdfSopenharmony_ci 50f08c3bdfSopenharmony_ci This will do the stress testing thru madvise syscall (sec 4.1). However, 51f08c3bdfSopenharmony_ci there are more advanced test methods provided (sec 4.2, 4.3). 52f08c3bdfSopenharmony_ci 53f08c3bdfSopenharmony_ciNote, for all examples in the rest of this doc, it is supposed that $PWD is 54f08c3bdfSopenharmony_cithe stress subdir. 55f08c3bdfSopenharmony_ci 56f08c3bdfSopenharmony_ci1. Overview 57f08c3bdfSopenharmony_ci----------- 58f08c3bdfSopenharmony_ci 59f08c3bdfSopenharmony_ciThe MCE stress test suite is a collection of tools and test scripts, which 60f08c3bdfSopenharmony_ciintends to achieve stress testing on Linux kernel MCA high level handlers 61f08c3bdfSopenharmony_cithat include HWPosion page recovery, soft page offline, and so on. 62f08c3bdfSopenharmony_ci 63f08c3bdfSopenharmony_ciIn general, this test suite is designed to do stress testing thru various 64f08c3bdfSopenharmony_citest interfaces, i.e. madvise syscall, HWPoison page injector, and APEI 65f08c3bdfSopenharmony_ciinjector (see ACPI4.0 spec). And it's able to support most of popular 66f08c3bdfSopenharmony_ciLinux File Systems (FS), that is, there is an option for user to specify which 67f08c3bdfSopenharmony_ciFS type they want the test to be running on. 68f08c3bdfSopenharmony_ci 69f08c3bdfSopenharmony_ciIf you just want to start testing as quickly as possible, you can skip 70f08c3bdfSopenharmony_cisection 2 & 3, just go to section 4 directly. 71f08c3bdfSopenharmony_ci 72f08c3bdfSopenharmony_ci 73f08c3bdfSopenharmony_ci2. Design Details 74f08c3bdfSopenharmony_ci----------------- 75f08c3bdfSopenharmony_ci 76f08c3bdfSopenharmony_ciThe MCE stress test suite consists of four parts: test driver, workload 77f08c3bdfSopenharmony_cicontroller, customized workloads, and background workloads. 78f08c3bdfSopenharmony_ci 79f08c3bdfSopenharmony_ciThe main test idea is described as below: 80f08c3bdfSopenharmony_ci- Test driver launchs various customized workloads to continuously generate 81f08c3bdfSopenharmony_ci lots of pages with expected page states, Note, all of these workloads know 82f08c3bdfSopenharmony_ci about their expected results that should not be affected by Linux MCE high 83f08c3bdfSopenharmony_ci level handlers. 84f08c3bdfSopenharmony_ci- Then test driver injects MCE errors to these pages thru either madvise 85f08c3bdfSopenharmony_ci syscall or HWPoison injector or APEI injector. While Linux Kernel handling 86f08c3bdfSopenharmony_ci these MCE errors, all the workloads continue running normally, 87f08c3bdfSopenharmony_ci- After long time running, test driver will collect test result of each 88f08c3bdfSopenharmony_ci workload to see if any unexpected failures happened. In such a way, it can 89f08c3bdfSopenharmony_ci decide if any bug is found. 90f08c3bdfSopenharmony_ci- If any system panics or FS corruption happens, that means there must be a 91f08c3bdfSopenharmony_ci bug. It's the bottom line to decide if test gets pass. 92f08c3bdfSopenharmony_ci 93f08c3bdfSopenharmony_ci2.1 Test Driver 94f08c3bdfSopenharmony_ci 95f08c3bdfSopenharmony_ciTest driver (a.k.a hwpoison.sh) drives the whole test procedure. It's 96f08c3bdfSopenharmony_ciresponsible for managing test environment, setting up error injection 97f08c3bdfSopenharmony_ciinterface, controlling test progress, launching workloads, injecting page 98f08c3bdfSopenharmony_cierrors, as well as recording test logs and reporting test result. 99f08c3bdfSopenharmony_ci 100f08c3bdfSopenharmony_ciFor detailed usage of hwpoison.sh test driver, please refer to: 101f08c3bdfSopenharmony_ci# ./hwpoison.sh -h 102f08c3bdfSopenharmony_ci 103f08c3bdfSopenharmony_ci2.2 Workload Controller 104f08c3bdfSopenharmony_ci 105f08c3bdfSopenharmony_ciWorkload controller needs to have various test workloads running parallelly 106f08c3bdfSopenharmony_ciand continuously within a required duration time. We select ltp-pan 107f08c3bdfSopenharmony_ciprogram of Linux Test Project (LTP) as the workload controller of this 108f08c3bdfSopenharmony_cistress test suite. 109f08c3bdfSopenharmony_ci 110f08c3bdfSopenharmony_ciTest driver (hwpoison.sh) interacts with ltp-pan in following ways: 111f08c3bdfSopenharmony_ci- hwpoison.sh generates a test config file that lists the workload type 112f08c3bdfSopenharmony_ci to be launched by ltp-pan. 113f08c3bdfSopenharmony_ci- hwpoison also passes test duration time and other workload specific 114f08c3bdfSopenharmony_ci parameters to ltp-pan via test config file. 115f08c3bdfSopenharmony_ci- ltp-pan makes each workload run and get finished in time, then test driver 116f08c3bdfSopenharmony_ci can get the result of each workload via corresponding result files. 117f08c3bdfSopenharmony_ci- finally, hwpoison.sh will decide the overall test result based on each 118f08c3bdfSopenharmony_ci workload result, and report final result out. 119f08c3bdfSopenharmony_ci 120f08c3bdfSopenharmony_ci2.3 Customized Workloads 121f08c3bdfSopenharmony_ci 122f08c3bdfSopenharmony_ciThere are three types of customized workloads, which are intended to generate 123f08c3bdfSopenharmony_cipages with various page state. 124f08c3bdfSopenharmony_ci 125f08c3bdfSopenharmony_ci* Type0: page-poisoning workload, meant to cover: 126f08c3bdfSopenharmony_ci - anonymous pages operations. 127f08c3bdfSopenharmony_ci - file data operations. 128f08c3bdfSopenharmony_ci 129f08c3bdfSopenharmony_ci* Type1: fs-metadata workload, meant to cover: 130f08c3bdfSopenharmony_ci - inode operations. 131f08c3bdfSopenharmony_ci 132f08c3bdfSopenharmony_ci* Type2: fs_type specific workload, meant to cover: 133f08c3bdfSopenharmony_ci - extended functions of some special FS. 134f08c3bdfSopenharmony_ci 135f08c3bdfSopenharmony_ci2.4 Background Workloads 136f08c3bdfSopenharmony_ci 137f08c3bdfSopenharmony_ciLTP is selected as the background workload to simulate normal system 138f08c3bdfSopenharmony_cioperations in background while stress testing is running. 139f08c3bdfSopenharmony_ci 140f08c3bdfSopenharmony_ciBesides LTP, there are also some alternatives, like AIM. We might extend more 141f08c3bdfSopenharmony_cibackground workloads in future. 142f08c3bdfSopenharmony_ci 143f08c3bdfSopenharmony_ci2.5 Test Result 144f08c3bdfSopenharmony_ci 145f08c3bdfSopenharmony_ciHow to determine that stress testing gets pass? 146f08c3bdfSopenharmony_ci- at least no kernel panics happens during stress testing. 147f08c3bdfSopenharmony_ci- fsck on the target disk at the end of stress testing should get pass. 148f08c3bdfSopenharmony_ci- there is no failure found by customized workloads, especially for 149f08c3bdfSopenharmony_ci page-poisoning workload. 150f08c3bdfSopenharmony_ci 151f08c3bdfSopenharmony_ciWhere to get detailed test result? 152f08c3bdfSopenharmony_ci- When stress testing is done, the general test result is recorded in 153f08c3bdfSopenharmony_ci result/hwpoison.result, and the general test log is in result/hwpoison.log. 154f08c3bdfSopenharmony_ci However, you can specify them in following way: 155f08c3bdfSopenharmony_ci # hwpoison.sh -r $YOUR_RESULT -l $YOUR_LOG 156f08c3bdfSopenharmony_ci- The test result and test log of each workload are recorded as 157f08c3bdfSopenharmony_ci log/$workload/$workload.result and log/$workload/$workload.log. 158f08c3bdfSopenharmony_ci For example, for page-poisoning workload, its test result and test logs are 159f08c3bdfSopenharmony_ci log/page-poisoning/page-poisoning.result and 160f08c3bdfSopenharmony_ci log/page-poisoning/page-poisoning.log. 161f08c3bdfSopenharmony_ci- Besides, under each workload result dir, you can find other extra logs 162f08c3bdfSopenharmony_ci like pan_log, pan_output and etc. These logs are generated by ltp-pan 163f08c3bdfSopenharmony_ci workload controller. Usually they can help you understand what has been 164f08c3bdfSopenharmony_ci going on with ltp-pan while workload is running. Pls. refer to ltp-pan doc 165f08c3bdfSopenharmony_ci for details. 166f08c3bdfSopenharmony_ci 167f08c3bdfSopenharmony_ci 168f08c3bdfSopenharmony_ci3. Tools 169f08c3bdfSopenharmony_ci-------- 170f08c3bdfSopenharmony_ci 171f08c3bdfSopenharmony_ci3.1 page-poisoning 172f08c3bdfSopenharmony_ci 173f08c3bdfSopenharmony_ciIt is the page-poisoning workload. page-poisoning workload is an extension of 174f08c3bdfSopenharmony_citinjpage test program with a multi-process model. It spawns thousands of 175f08c3bdfSopenharmony_ciprocesses that inject HWPosion error to various pages simultaneously thru 176f08c3bdfSopenharmony_cimadvise syscall. Then it checks if these errors get handled correctly, 177f08c3bdfSopenharmony_cii.e. whether each test process receives or doesn't receive SIGBUS signal as 178f08c3bdfSopenharmony_ciexpected. 179f08c3bdfSopenharmony_ci 180f08c3bdfSopenharmony_ciFor more info about page-poisoning workload, pls. read through README file 181f08c3bdfSopenharmony_ciunder stress/tools/page-poisoning/. 182f08c3bdfSopenharmony_ci 183f08c3bdfSopenharmony_ci3.2 fs-metadata 184f08c3bdfSopenharmony_ci 185f08c3bdfSopenharmony_ciIt is the fs-metadata workload. fs-metadata is designed to test i-node 186f08c3bdfSopenharmony_cioperations with heavy workload and make sure every i-node operation gets 187f08c3bdfSopenharmony_cithe expected result. In details, it firstly generates a huge directory 188f08c3bdfSopenharmony_cihierarchy on the target disk, then it performs unlink operations on this 189f08c3bdfSopenharmony_cidirectory hierarchy and duplicate a copy of the directory, finally it 190f08c3bdfSopenharmony_cichecks if these two directories are same as expected. 191f08c3bdfSopenharmony_ci 192f08c3bdfSopenharmony_ciFor more info about fs-metadata workload, pls. read through README file 193f08c3bdfSopenharmony_ciunder stress/tools/fs-metadata/. 194f08c3bdfSopenharmony_ci 195f08c3bdfSopenharmony_ci3.3 page-types 196f08c3bdfSopenharmony_ci 197f08c3bdfSopenharmony_cipage-types is a tool to query the page type of every memory page in the 198f08c3bdfSopenharmony_cisystem. We use it to filter out pages with required page types. Test will 199f08c3bdfSopenharmony_ciinject error to these pages via error injector, although the page filter 200f08c3bdfSopenharmony_ciof HWPosion handler in Linux Kernel will filter them out for a second 201f08c3bdfSopenharmony_citime. Note, the reason we need to use page-types to do first time filtering 202f08c3bdfSopenharmony_ciis just about performance. 203f08c3bdfSopenharmony_ci 204f08c3bdfSopenharmony_ciTo install page-types on your test machine: 205f08c3bdfSopenharmony_ci 206f08c3bdfSopenharmony_ci # cd $KERNEL_SRC/Documentation/vm/ 207f08c3bdfSopenharmony_ci # gcc -o page-types page-types.c 208f08c3bdfSopenharmony_ci # cp page-types /usr/bin/ 209f08c3bdfSopenharmony_ci 210f08c3bdfSopenharmony_ci3.4 ltp-pan 211f08c3bdfSopenharmony_ci 212f08c3bdfSopenharmony_ciIt's the workload controller of this stress test suite. In fact, ltp-pan 213f08c3bdfSopenharmony_ciis the test harness of LTP (Linux Test Project), and is included in 214f08c3bdfSopenharmony_ciLTP package. For more information, please refer to ltp-pan document of LTP. 215f08c3bdfSopenharmony_ci 216f08c3bdfSopenharmony_ci 217f08c3bdfSopenharmony_ci4. Usage Guide 218f08c3bdfSopenharmony_ci-------------- 219f08c3bdfSopenharmony_ci 220f08c3bdfSopenharmony_ciThis section is trying to show you how to conduct the stress testing thru 221f08c3bdfSopenharmony_civarious test interfaces. 222f08c3bdfSopenharmony_ci 223f08c3bdfSopenharmony_ciAs an example, we choose to run stress testing based on partition /dev/sda1 224f08c3bdfSopenharmony_cifor 1 hour. Note, we've installed LTP to /ltp. 225f08c3bdfSopenharmony_ci 226f08c3bdfSopenharmony_ci4.1 Stress Test thru Madvise Syscall. 227f08c3bdfSopenharmony_ci 228f08c3bdfSopenharmony_ciTo run this stress testing, you need to strictly follow below test 229f08c3bdfSopenharmony_ciinstructions. 230f08c3bdfSopenharmony_ci 231f08c3bdfSopenharmony_ci* Test instructions: 232f08c3bdfSopenharmony_ci 233f08c3bdfSopenharmony_ci- make sure following kernel options are enabled: 234f08c3bdfSopenharmony_ci CONFIG_X86_MCE=y 235f08c3bdfSopenharmony_ci CONFIG_MEMORY_FAILURE=y 236f08c3bdfSopenharmony_ci 237f08c3bdfSopenharmony_ci- build and run stress testing 238f08c3bdfSopenharmony_ci # make 239f08c3bdfSopenharmony_ci # ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR 240f08c3bdfSopenharmony_ci 241f08c3bdfSopenharmony_ci* Example: 242f08c3bdfSopenharmony_ci 243f08c3bdfSopenharmony_ci- launch testing 244f08c3bdfSopenharmony_ci # ./hwpoison.sh -d /dev/sda1 -M -t 3600 245f08c3bdfSopenharmony_ci 246f08c3bdfSopenharmony_ci- general test results 247f08c3bdfSopenharmony_ci result: result/hwpoison.result 248f08c3bdfSopenharmony_ci logs: result/hwpoison.log 249f08c3bdfSopenharmony_ci 250f08c3bdfSopenharmony_ci- detailed workload results 251f08c3bdfSopenharmony_ci result: log/page-poisoning/page-poisoning.result 252f08c3bdfSopenharmony_ci log: log/page-poisoning/page-poisoning.log 253f08c3bdfSopenharmony_ci 254f08c3bdfSopenharmony_ci4.2 Stress Test thru HWPosion Page Injector 255f08c3bdfSopenharmony_ci 256f08c3bdfSopenharmony_ciThis is the default test method of this stress test suite. 257f08c3bdfSopenharmony_ci 258f08c3bdfSopenharmony_ciTo run this stress testing, you need to strictly follow below test 259f08c3bdfSopenharmony_ciinstructions. 260f08c3bdfSopenharmony_ci 261f08c3bdfSopenharmony_ci* Test instructions: 262f08c3bdfSopenharmony_ci 263f08c3bdfSopenharmony_ci- make sure following kernel options are enabled: 264f08c3bdfSopenharmony_ci CONFIG_X86_MCE=y 265f08c3bdfSopenharmony_ci CONFIG_MEMORY_FAILURE=y 266f08c3bdfSopenharmony_ci CONFIG_DEBUG_KERNEL=y 267f08c3bdfSopenharmony_ci CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 268f08c3bdfSopenharmony_ci CONFIG_HWPOISON_INJECT=y 269f08c3bdfSopenharmony_ci 270f08c3bdfSopenharmony_ci- build and run stress testing 271f08c3bdfSopenharmony_ci # make 272f08c3bdfSopenharmony_ci # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L 273f08c3bdfSopenharmony_ci 274f08c3bdfSopenharmony_ci* Example: 275f08c3bdfSopenharmony_ci 276f08c3bdfSopenharmony_ci- launch testing 277f08c3bdfSopenharmony_ci # ./hwpoison.sh -d /dev/sda1 -t 3600 -L 278f08c3bdfSopenharmony_ci 279f08c3bdfSopenharmony_ci- general test results 280f08c3bdfSopenharmony_ci result: result/hwpoison.result 281f08c3bdfSopenharmony_ci logs: result/hwpoison.log 282f08c3bdfSopenharmony_ci 283f08c3bdfSopenharmony_ci- detailed workload results 284f08c3bdfSopenharmony_ci fs-metadata result: log/fs-metadata/fs-metadata.result 285f08c3bdfSopenharmony_ci fs-metadata log: log/fs-metadata/fs-metadata.log 286f08c3bdfSopenharmony_ci ltp result: log/ltp/ltp.result 287f08c3bdfSopenharmony_ci ltp log: log/ltp/ltp.log 288f08c3bdfSopenharmony_ci fs-specific result: log/fs-specific/fs-specific.result 289f08c3bdfSopenharmony_ci fs-specific log: log/fs-specific/fs-specific.log 290f08c3bdfSopenharmony_ci 291f08c3bdfSopenharmony_ci4.3 Stress Test thru APEI Injector 292f08c3bdfSopenharmony_ci 293f08c3bdfSopenharmony_ciTo run this stress testing, you need to follow below test instructions. 294f08c3bdfSopenharmony_ci 295f08c3bdfSopenharmony_ci* Test instructions: 296f08c3bdfSopenharmony_ci 297f08c3bdfSopenharmony_ci- make sure following kernel options are enabled: 298f08c3bdfSopenharmony_ci CONFIG_X86_MCE=y 299f08c3bdfSopenharmony_ci CONFIG_X86_MCE_INTEL=y 300f08c3bdfSopenharmony_ci CONFIG_MEMORY_FAILURE=y 301f08c3bdfSopenharmony_ci CONFIG_ACPI_APEI=y 302f08c3bdfSopenharmony_ci CONFIG_ACPI_APEI_EINJ=y 303f08c3bdfSopenharmony_ci 304f08c3bdfSopenharmony_ci- build and run stress testing 305f08c3bdfSopenharmony_ci # make 306f08c3bdfSopenharmony_ci # ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L -A 307f08c3bdfSopenharmony_ci 308f08c3bdfSopenharmony_ci* Example: 309f08c3bdfSopenharmony_ci 310f08c3bdfSopenharmony_ci- launch testing 311f08c3bdfSopenharmony_ci # ./hwpoison.sh -d /dev/sda1 -t 3600 -L -A 312f08c3bdfSopenharmony_ci 313f08c3bdfSopenharmony_ci- general test results 314f08c3bdfSopenharmony_ci result: result/hwpoison.result 315f08c3bdfSopenharmony_ci logs: result/hwpoison.log 316f08c3bdfSopenharmony_ci 317f08c3bdfSopenharmony_ci- detailed workload results 318f08c3bdfSopenharmony_ci fs-metadata result: log/fs-metadata/fs-metadata.result 319f08c3bdfSopenharmony_ci fs-metadata log: log/fs-metadata/fs-metadata.log 320f08c3bdfSopenharmony_ci ltp result: log/ltp/ltp.result 321f08c3bdfSopenharmony_ci ltp log: log/ltp/ltp.log 322f08c3bdfSopenharmony_ci fs-specific result: log/fs-specific/fs-specific.result 323f08c3bdfSopenharmony_ci fs-specific log: log/fs-specific/fs-specific.log 324f08c3bdfSopenharmony_ci 325f08c3bdfSopenharmony_ci 326f08c3bdfSopenharmony_ci5. FAQs 327f08c3bdfSopenharmony_ci------- 328f08c3bdfSopenharmony_ci 329f08c3bdfSopenharmony_ciHere is a collection of frequently asked questions: 330f08c3bdfSopenharmony_ci 331f08c3bdfSopenharmony_ciQ: How to tell test driver not to format my disk partition? 332f08c3bdfSopenharmony_ciA: Use the option '-N'. 333f08c3bdfSopenharmony_ci 334f08c3bdfSopenharmony_ciQ: Can three types of tests run on same sytem simultaneously? 335f08c3bdfSopenharmony_ciA: No. There are limitations in Linux Kernel HWPoison page filtering. 336f08c3bdfSopenharmony_ci 337f08c3bdfSopenharmony_ciQ: Can I run this stress testing on multiple disks parallely? 338f08c3bdfSopenharmony_ciA: Yes. But it requires updated Kernel patches for HWPosion page filtering. 339f08c3bdfSopenharmony_ci Now, it just supports one same test with same pagetype flags specified. 340f08c3bdfSopenharmony_ci 341