KVM RAS Test Suite ================== The KVM RAS Test Suite is a collection of test scripts for testing the Linux kernel MCE processing features in KVM guest system. Jan 26th, 2010 Jiajia Zheng In the Package ---------------- Here is a short description of what is included in the package host/* Contains host test scripts, which drive test procedure on host system. guest/* Contains guest test scripts, which drive test procedure on guest system. Dependencies ---------------- KVM RAS Test Suite has following dependencies on kernel and other tools: * Linux Kernel: Version 2.6.32 or newer, with MCA high level handlers enabled. * mce-inject A tool to inject mce error into kernel * page-types: A tool to query page types, which is accompanied with Linux kernel source (2.6.32 or newer, $KERNEL_SRC/Documentation/vm/page-types.c). * simple_process: A process constantly access the allocated memeory. (../tools/simple_process) * kpartx: A tool to list partition mappings from partition tables * (optionally) losetup A tool to set up and control loop devices * (optionally) pvdisplay/vgchange: A tool to display/change physical volume attribute * (optionally) ssh-keygen A tool to provide the authentication key generation, management and conversion Test method --------------- - Start a process in the guest OS, get a virtual address from guest OS - Translate this address untill we get the physical address on the host OS - Software injects an SRAO MCE at that physical address from host OS - (optionally) Write to the address from the guest, i.e attempt to write to the poisoned page from the guest. the expected result: HOST system dmesg: ... Machine check injector initialized Triggering MCE exception on CPU 0 Disabling lock debugging due to kernel taint [Hardware Error]: Machine check events logged MCE exception done on CPU 0 MCE 0x806324: Killing qemu-system-x86:8829 early due to hardware memory corruption MCE 0x806324: dirty LRU page recovery: Recovered ... GUEST system dmesg: ... [Hardware Error]: Machine check events logged MCE 0x75925: Killing simple_process:2273 early due to hardware memory corruption MCE 0x75925: dirty LRU page recovery : Recovered ... Installation --------------- 1. Build *host* kernel with CONFIG_KVM=y CONFIG_KVM_INTEL=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_INJECT=y or CONFIG_X86_MCE_INJECT=m and following config both on *host* and *guest* CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y CONFIG_MEMORY_FAILURE=y NOTE: if the host machine doesn't support software error recovery (MCG_SER_P in IA32_MCG_CAP[24]), please apply the patch fake_ser_p.patch under ./patches/ 2. Use ssh-keygen to generate public and privite keys on the host OS, and copy id_rsa and id_rsa.pub from ~/.ssh/ into the testing directory on the host system. 3. compile and install qemu-kvm the qemu-kvm source can be got from git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git. Before compile qemu-kvm, a patch p2v.patch should be applied. This patch is located under ./patches/ Please ensure *SDL-devel* library is installed on the host machine, otherwise only VNC can be used substituing local graphic output. 4. install the guest OS on the qemu e.g. step 1: qemu-img create -f qcow2 test.img 10G step 2: qemu-system-x86_64 -hda ./test.img -m 2048 -cdrom rhel6.iso -boot d after the installation, please be sure to execute the following check: a) add necessary command line parameters under your boot item. This is to enable console output redirection to the serial. e.g. before: title Red Hat Enterprise Linux Server (2.6.32kvm) root (hd0,1) kernel /boot/vmlinuz-2.6.32kvm ro root=/dev/sda1 initrd /boot/initramfs-2.6.32kvm.img after: title Red Hat Enterprise Linux Server (2.6.32kvm) root (hd0,1) kernel /boot/vmlinuz-2.6.32kvm ro root=/dev/sda1 console=tty0 console=ttyS0,115200n8 initrd /boot/initramfs-2.6.32kvm.img b) DHCP guest ethernet card. This operation is to ensure the network connection is OK. e.g. bash> dhclient eth0 c) enable SSH public/private key authorization, otherwise, the SSH connection password is necessary to be provided in the test progress. please check /etc/ssh/ssd_config and ensure related options are opened. e.g. RSAAuthentication yes PubkeyAuthentication yes after the related options are opened, please restart ssh service e.g. bash> service sshd restart Start Testing --------------- Run testing by ./host_run.sh