1The health mechanism is targeted for Real Time Alerting, in order to know when 2something bad had happened to a PCI device 3- Provide alert debug information 4- Self healing 5- If problem needs vendor support, provide a way to gather all needed debugging 6 information. 7 8The main idea is to unify and centralize driver health reports in the 9generic devlink instance and allow the user to set different 10attributes of the health reporting and recovery procedures. 11 12The devlink health reporter: 13Device driver creates a "health reporter" per each error/health type. 14Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) 15or unknown (driver specific). 16For each registered health reporter a driver can issue error/health reports 17asynchronously. All health reports handling is done by devlink. 18Device driver can provide specific callbacks for each "health reporter", e.g. 19 - Recovery procedures 20 - Diagnostics and object dump procedures 21 - OOB initial parameters 22Different parts of the driver can register different types of health reporters 23with different handlers. 24 25Once an error is reported, devlink health will do the following actions: 26 * A log is being send to the kernel trace events buffer 27 * Health status and statistics are being updated for the reporter instance 28 * Object dump is being taken and saved at the reporter instance (as long as 29 there is no other dump which is already stored) 30 * Auto recovery attempt is being done. Depends on: 31 - Auto-recovery configuration 32 - Grace period vs. time passed since last recover 33 34The user interface: 35User can access/change each reporter's parameters and driver specific callbacks 36via devlink, e.g per error type (per health reporter) 37 - Configure reporter's generic parameters (like: disable/enable auto recovery) 38 - Invoke recovery procedure 39 - Run diagnostics 40 - Object dump 41 42The devlink health interface (via netlink): 43DEVLINK_CMD_HEALTH_REPORTER_GET 44 Retrieves status and configuration info per DEV and reporter. 45DEVLINK_CMD_HEALTH_REPORTER_SET 46 Allows reporter-related configuration setting. 47DEVLINK_CMD_HEALTH_REPORTER_RECOVER 48 Triggers a reporter's recovery procedure. 49DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE 50 Retrieves diagnostics data from a reporter on a device. 51DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET 52 Retrieves the last stored dump. Devlink health 53 saves a single dump. If an dump is not already stored by the devlink 54 for this reporter, devlink generates a new dump. 55 dump output is defined by the reporter. 56DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR 57 Clears the last saved dump file for the specified reporter. 58 59 60 netlink 61 +--------------------------+ 62 | | 63 | + | 64 | | | 65 +--------------------------+ 66 |request for ops 67 |(diagnose, 68 mlx5_core devlink |recover, 69 |dump) 70+--------+ +--------------------------+ 71| | | reporter| | 72| | | +---------v----------+ | 73| | ops execution | | | | 74| <----------------------------------+ | | 75| | | | | | 76| | | + ^------------------+ | 77| | | | request for ops | 78| | | | (recover, dump) | 79| | | | | 80| | | +-+------------------+ | 81| | health report | | health handler | | 82| +-------------------------------> | | 83| | | +--------------------+ | 84| | health reporter create | | 85| +----------------------------> | 86+--------+ +--------------------------+ 87