VisionWorks Toolkit Reference

December 18, 2015 | 1.2 Release

 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages

This section describes the implementation of user custom nodes based on CUDA.

A user custom kernel may use CUDA directly or use a CUDA library such as NPP. In these cases, there are some important rules to follow when implementing the custom kernel:

7.1. Declare the target as GPU

In order to declare that a kernel uses the GPU, the custom kernel must be registered with a name that is prefixed with gpu:

vx_char cuda_kernel_name[] = "gpu:user.kernel.cuda_kernel";

7.2. Use the CUDA stream given by VisionWorks

In order for VisionWorks to ensure correct execution of any graph that uses a custom node, it is necessary for the CUDA workload generated by a user node (in the processing callback function) to be synchronized with the CUDA stream provided by VisionWorks. This CUDA stream can be retrieved by querying the NVX_NODE_ATTRIBUTE_CUDA_STREAM node attribute.

cudaStream_t stream = NULL;
vxQueryNode(node, NVX_NODE_ATTRIBUTE_CUDA_STREAM, &stream, sizeof(stream));

Once this stream is known, there are 2 possible situations:

If the CUDA workload is properly synchronized with the CUDA stream given by VisionWorks, there is no need for the processing callback function (that is executed on the CPU) to synchronize with any CUDA stream upon completion (with cudaStreamSynchronize, cudaDeviceSynchronize or cudaEventSynchronize for instance). The GPU workload generated by a node can be executed asynchronously beyond the node boundaries. The synchronization between GPU and CPU is handled by the VisionWorks graph manager.

Full Code for a CUDA User Custom Kernel

vx_status cudaNode_kernel(vx_node node, const vx_reference *parameters, vx_uint32 num)
{
if (num != 3)
return VX_FAILURE;
vx_image input = (vx_image)parameters[0];
vx_image output = (vx_image)parameters[1];
vx_scalar scalar = (vx_scalar)parameters[2];
// Gets the CUDA stream used for current node.
cudaStream_t stream = NULL;
vxQueryNode(node, NVX_NODE_ATTRIBUTE_CUDA_STREAM, &stream, sizeof(stream));
// Maps the objects.
vx_imagepatch_addressing_t in_addr, out_addr;
void *in_ptr = NULL, *out_ptr = NULL;
vx_uint8 value = 0;
vxReadScalarValue(scalar, &value);
NppiSize size = {
int(out_addr.dim_x),
int(out_addr.dim_y)
};
// Uses this stream for the CUDA code.
nppSetStream(stream);
// Calls the CUDA routines.
nppiAddC_8u_C1RSfs((Npp8u *)in_ptr, in_addr.stride_y, value,
(Npp8u *)out_ptr, out_addr.stride_y,
size, 1);
// Unmaps the objects.
nvxUnmapImagePatch(input, nullptr, 0, in_ptr, NVX_IMPORT_TYPE_CUDA);
nvxUnmapImagePatch(output, nullptr, 0, out_ptr, NVX_IMPORT_TYPE_CUDA);
return VX_SUCCESS;
}