// Copyright 2021-2024 The Khronos Group Inc.
//
// SPDX-License-Identifier: CC-BY-4.0

= VK_EXT_device_fault
:toc: left
:refpage: https://registry.khronos.org/vulkan/specs/1.2-extensions/man/html/
:sectnums:

This document outlines functionality to allow applications to query for
additional diagnostic information following device-loss.

== Problem Statement

Device-loss errors can be challenging to diagnose. They can be triggered by a
number of issues, including invalid application behaviour, driver bugs, and
physical failure or removal of hardware. Whilst the Vulkan Validation layers are
recommended as a first step in diagnosing the majority of API usage issues, they
are unable to address all possible causes of device-loss.

This proposal aims to provide application developers with additional information
that may aid in diagnosing such errors.

== Solution Space

Several options have been considered:

- Provide foundational extensions to enable the development of crash postmortem
  tooling
- Develop extensions or tools that aim to attribute faults to individual Vulkan
  objects
- Rely on individual vendor tools and extensions

This proposal focuses on the first option. It represents a partial solution,
with further extensions required in order to fully enable crash postmortem
tooling.

== Proposal

=== API Features

The following features are exposed by the `VK_EXT_device_fault` extension:

[source,c]
----
typedef struct VkPhysicalDeviceFaultFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           deviceFault;
    VkBool32           deviceFaultVendorBinary;
} VkPhysicalDeviceFaultFeaturesEXT;
----

`deviceFault` is the main feature enabling this extension’s functionality and
must be supported if this extension is supported.

`deviceFaultVendorBinary` is an optional feature that enables support for
vendor-specific binary crash dumps, which may be interpreted via external vendor
tools.

=== Querying for Fault Information

Following device-loss, applications may query for additional diagnostic
information by calling `vkGetDeviceFaultInfoEXT`.

[source,c]
----
typedef struct VkDeviceFaultCountsEXT {
    VkStructureType    sType;
    void*              pNext;
    uint32_t           addressInfoCount;
    uint32_t           vendorInfoCount;
    VkDeviceSize       vendorBinarySize;
} VkDeviceFaultCountsEXT;

typedef struct VkDeviceFaultInfoEXT {
    VkStructureType                 sType;
    void*                           pNext;
    char                            description[VK_MAX_DESCRIPTION_SIZE];
    VkDeviceFaultAddressInfoEXT*    pAddressInfos;
    VkDeviceFaultVendorInfoEXT*     pVendorInfos;
    void*                           pVendorBinaryData;
} VkDeviceFaultInfoEXT;

VKAPI_ATTR VkResult VKAPI_CALL vkGetDeviceFaultInfoEXT(
    VkDevice                                    device,
    VkDeviceFaultCountsEXT*                     pFaultCounts,
    VkDeviceFaultInfoEXT*                       pFaultInfo);
----

The signature of `vkGetDeviceFaultInfoEXT` is intended to mirror the design of
existing query functions, where the second parameter (`pFaultCounts`) indicates
size of output arrays, or the number of results written. However, device fault
information requires multiple output arrays. Therefore, a
`VkDeviceFaultCountsEXT` structure is used to specify the sizes of multiple
arrays at once.

[source,c]
----
// Query number of available results
VkDeviceFaultCountsEXT faultCounts{};
faultCounts.sType = VK_STRUCTURE_TYPE_DEVICE_FAULT_COUNTS_EXT;

vkGetDeviceFaultInfoEXT(device, &faultCounts, NULL);

// Allocate output arrays and query fault data
VkDeviceFaultInfoEXT faultInfo{}
info.sType             = VK_STRUCTURE_TYPE_DEVICE_FAULT_INFO_EXT;
info.pAddressInfos = (VkDeviceFaultAddressInfoEXT*) malloc(sizeof(VkDeviceFaultAddressInfoEXT) *
                                                           faultCounts.addressInfoCount);
info.pVendorInfos  = (VkDeviceFaultVendorInfoEXT*)  malloc(sizeof(VkDeviceFaultVendorInfoEXT)  *
                                                           faultCounts.vendorInfoCount);
info.pVendorBinaryData = malloc(faultCounts.vendorBinarySize);

vkGetDeviceFaultInfoEXT(device, &faultCounts, &faultInfo);
----

=== Interpreting GPU Virtual Addresses

Implementations may return information on both page faults generated by invalid
memory accesses, and instruction pointers indicating the instructions executing
at the time of the fault.

[source,c]
----
typedef enum VkDeviceFaultAddressTypeEXT {
    VK_DEVICE_FAULT_ADDRESS_TYPE_NONE_EXT = 0,
    VK_DEVICE_FAULT_ADDRESS_TYPE_READ_INVALID_EXT = 1,
    VK_DEVICE_FAULT_ADDRESS_TYPE_WRITE_INVALID_EXT = 2,
    VK_DEVICE_FAULT_ADDRESS_TYPE_EXECUTE_INVALID_EXT = 3,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_UNKNOWN_EXT = 4,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_INVALID_EXT = 5,
    VK_DEVICE_FAULT_ADDRESS_TYPE_INSTRUCTION_POINTER_FAULT_EXT = 6,
    VK_DEVICE_FAULT_ADDRESS_TYPE_MAX_ENUM_EXT = 0x7FFFFFFF
} VkDeviceFaultAddressTypeEXT;

typedef struct VkDeviceFaultAddressInfoEXT {
    VkDeviceFaultAddressTypeEXT    addressType;
    VkDeviceAddress                reportedAddress;
    VkDeviceSize                   addressPrecision;
} VkDeviceFaultAddressInfoEXT;
----

Page addresses and instruction pointers are reported as GPU virtual addresses,
and additional extensions or vendor tools may be required in order to correlate
these extensions with individual Vulkan objects.

Implementations may only be able to report these addresses with limited
precision. The combination of `reportedAddress` and `addressPrecision`
allow the possible range of addresses to be calculated, such that:

[source,c++]
---------------------------------------------------
lower_address = (pInfo->reportedAddress & ~(pInfo->addressPrecision-1))
upper_address = (pInfo->reportedAddress |  (pInfo->addressPrecision-1))
---------------------------------------------------

[NOTE]
.Note
====
It is valid for the `reportedAddress` to contain a more precise address
than indicated by `addressPrecision`.
In this case, the value of `reportedAddress` should be
treated as an additional hint as to the value of the address that triggered the
page fault, or to the value of an instruction pointer.
====


=== Vendor Binary Crash Dumps

Optionally, implementations may also support the generation of vendor-specific
binary blobs containing additional diagnostic information. All vendor-specific
binaries will begin with a common header. The contents of the remainder of the
binary blob are vendor-specific, and will require vendor-specific documentation
or tools to interpret.

[source,c]
----
typedef struct VkDeviceFaultVendorBinaryHeaderVersionOneEXT {
    uint32_t                                     headerSize;
    VkDeviceFaultVendorBinaryHeaderVersionEXT    headerVersion;
    uint32_t                                     vendorID;
    uint32_t                                     deviceID;
    uint32_t                                     driverVersion;
    uint8_t                                      pipelineCacheUUID[VK_UUID_SIZE];
    uint32_t                                     applicationNameOffset;
    uint32_t                                     applicationVersion;
    uint32_t                                     engineNameOffset;
} VkDeviceFaultVendorBinaryHeaderVersionOneEXT;
----

== Issues

1) Should `vkGetDeviceFaultInfoEXT` return multiple faults?

*RESOLVED*: No. This extension only seeks to identify a single fault as a
possible cause of device loss and not to maintain a log of multiple faults.
We anticipate that in cases where a GPU does encounter multiple faults, there
is a high probability that the faults would be duplicates, such as those caused
by parallel execution of the same defective code.

2) Can `vkGetDeviceFaultInfoEXT` be called prior to device loss?

*RESOLVED*: No. `VK_KHR_fault_handling` in VulkanSC does support an equivalent
to this, but `VK_KHR_fault_handling` aims to address a different use case, where
a fault log is polled prior to device loss to enable remedial action to be taken.

3) Do page faults need to report the actual address that was accessed, or
should we allow reporting of the page address?

*RESOLVED*: Some IHVs hardware reports page faults at page alignment, or
at some other hardware-unit dependent granularity, rather than the precise
address that triggered the fault. All addresses are reported at hardware-unit
dependent granularity, along with an associated precision indicator. This information
can be used to compute an address range that contains the original address that
triggered the fault.

4) How should we report cases where one of multiple pipelines may have caused a
fault?

*RESOLVED*: In cases where a fault cannot be attributed to a single unique
pipeline, reporting the set of possible candidates is desirable.

5) The page fault and instruction address information structures have similar
structure. Should they be combined?

*RESOLVED*: Yes. These have been combined as `VkDeviceFaultAddressInfoEXT`
to reduce API surface area.

6) How should implementors approach extensibility for vendor-specific faults?
Should they rely on pname:pNext chains, or should the extension introduce a
generic structure to return vendor error codes and human-readable descriptions
in the base structure?

*RESOLVED*: Implementors should utilize the generic
`VkDeviceFaultVendorInfoEXT` structures where applicable, and fallback to
extending pname:pNext chains where this is insufficient. Where a pname:pNext
chain is required, vendors should tailor their human-readable error
descriptions to advise developers that additional information may be available.