// Copyright (c) 2018-2020 NVIDIA Corporation
//
// SPDX-License-Identifier: CC-BY-4.0

include::{generated}/meta/{refprefix}VK_NV_cuda_kernel_launch.adoc[]

=== Other Extension Metadata

*Last Modified Date*::
    2020-09-30
*Contributors*::
  - Eric Werness, NVIDIA

=== Description

Interoperability between APIs can sometimes create additional overhead
depending on the platform used.
This extension targets deployment of existing CUDA kernels via Vulkan, with
a way to directly upload PTX kernels and dispatch the kernels from Vulkan's
command buffer without the need to use interoperability between the Vulkan
and CUDA contexts.
However, we do encourage actual development using the native CUDA runtime
for the purpose of debugging and profiling.

The application will first have to create a CUDA module using
flink:vkCreateCudaModuleNV then create the CUDA function entry point with
flink:vkCreateCudaFunctionNV.

Then in order to dispatch this function, the application will create a
command buffer where it will launch the kernel with
flink:vkCmdCudaLaunchKernelNV.

When done, the application will then destroy the function handle, as well as
the CUDA module handle with flink:vkDestroyCudaFunctionNV and
flink:vkDestroyCudaModuleNV.

To reduce the impact of compilation time, this extension offers the
capability to return a binary cache from the PTX that was provided.
For this, a first query for the required cache size is made with
flink:vkGetCudaModuleCacheNV with a `NULL` pointer to a buffer and with a
valid pointer receiving the size; then another call of the same function
with a valid pointer to a buffer to retrieve the data.
The resulting cache could then be user later for further runs of this
application by sending this cache instead of the PTX code (using the same
flink:vkCreateCudaModuleNV), thus significantly speeding up the
initialization of the CUDA module.

As with slink:VkPipelineCache, the binary cache depends on the hardware
architecture.
The application must assume the cache might fail, and need to handle falling
back to the original PTX code as necessary.
Most often, the cache will succeed if the same GPU driver and architecture
is used between the cache generation from PTX and the use of this cache.
In the event of a new driver version, or if using a different GPU
architecture, the cache is likely to become invalid.

include::{generated}/interfaces/VK_NV_cuda_kernel_launch.adoc[]

=== Issues

None.

=== Version History

  * Revision 1, 2020-03-01 (Tristan Lorach)
  * Revision 2, 2020-09-30 (Tristan Lorach)