VisionWorks Toolkit Reference

December 18, 2015 | 1.2 Release

 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Process Array with CUBLAS Library

This tutorial demonstrates VisionWorks array processing with user CUDA code.

This tutorial demonstrates implementation of the AXPY (generalized vector addition) using the CUBLAS Library.

VisionWorks array objects provide CUDA pointers with a simple 1D memory layout, similar to memory allocated by the cudaMalloc function. Array elements are located in contiguous memory segments without any gaps between them.

  1. Determine the number of elements in the input array:

    vx_size num_items_x = 0;
    vxQueryArray(arr_x, VX_ARRAY_ATTRIBUTE_NUMITEMS, &num_items_x, sizeof(num_items_x));
    vx_size num_items_y = 0;
    vxQueryArray(arr_y, VX_ARRAY_ATTRIBUTE_NUMITEMS, &num_items_y, sizeof(num_items_y));
    assert( num_items_y == num_items_x );
  2. Map the array object into the CUDA address space. The vxAccessArrayRange requires a range for access; in this sample, the whole array range [0, num_items) is used.

    Note
    Output pointers should be equal to NULL before calling the vxAccessArrayRange function; otherwise, the function will work in COPY mode, assuming that the pointer refers to a pre-allocated buffer.
    The vxAccessArrayRange function also returns stride in bytes between arrays elements. In VisionWorks, this stride is always equal to the size of the element.
    vx_size stride_x = 0;
    void *ptr_x = NULL; // should be NULL to work in MAP mode
    vxAccessArrayRange(arr_x, 0, num_items_x, &stride_x, &ptr_x, NVX_READ_ONLY_CUDA);
    vx_size stride_y = 0;
    void *ptr_y = NULL; // should be NULL to work in MAP mode
    vxAccessArrayRange(arr_y, 0, num_items_y, &stride_y, &ptr_y, NVX_READ_AND_WRITE_CUDA);
  3. After you get the mapped pointer, you can use it in CUDA kernels and CUDA libraries in the same way as plain CUDA pointers allocated by cudaMalloc function.

    cublasHandle_t handle = NULL;
    cublasCreate(&handle);
    cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_HOST);
    cublasSaxpy(handle, static_cast<int>(num_items_x), &alpha,
    static_cast<const float*>(ptr_x), 1,
    static_cast<float*>(ptr_y), 1);
    cublasDestroy(handle);
  4. Unmap array object.

    Note
    All CUDA processing must be finished before the array memory is unmapped (i.e., you must explicitly call synchronization functions, such as cudaStreamSynchronize or cudaDeviceSynchronize to be sure that all custom CUDA kernels finish processing.
    // Need to be sure that all CUDA processing is finished before commit
    cudaDeviceSynchronize();
    vxCommitArrayRange(arr_x, 0, num_items_x, ptr_x);
    vxCommitArrayRange(arr_y, 0, num_items_y, ptr_y);

The Full Code for This Tutorial

void processArrayWithCUDA(vx_array arr_x, vx_array arr_y, float alpha)
{
vx_enum item_type_x = 0;
vxQueryArray(arr_x, VX_ARRAY_ATTRIBUTE_ITEMTYPE, &item_type_x, sizeof(item_type_x));
assert( item_type_x == VX_TYPE_FLOAT32 );
vx_enum item_type_y = 0;
vxQueryArray(arr_y, VX_ARRAY_ATTRIBUTE_ITEMTYPE, &item_type_y, sizeof(item_type_y));
assert( item_type_y == VX_TYPE_FLOAT32 );
vx_size num_items_x = 0;
vxQueryArray(arr_x, VX_ARRAY_ATTRIBUTE_NUMITEMS, &num_items_x, sizeof(num_items_x));
vx_size num_items_y = 0;
vxQueryArray(arr_y, VX_ARRAY_ATTRIBUTE_NUMITEMS, &num_items_y, sizeof(num_items_y));
assert( num_items_y == num_items_x );
if (num_items_x > 0)
{
vx_size stride_x = 0;
void *ptr_x = NULL; // should be NULL to work in MAP mode
vxAccessArrayRange(arr_x, 0, num_items_x, &stride_x, &ptr_x, NVX_READ_ONLY_CUDA);
vx_size stride_y = 0;
void *ptr_y = NULL; // should be NULL to work in MAP mode
vxAccessArrayRange(arr_y, 0, num_items_y, &stride_y, &ptr_y, NVX_READ_AND_WRITE_CUDA);
cublasHandle_t handle = NULL;
cublasCreate(&handle);
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_HOST);
cublasSaxpy(handle, static_cast<int>(num_items_x), &alpha,
static_cast<const float*>(ptr_x), 1,
static_cast<float*>(ptr_y), 1);
cublasDestroy(handle);
// Need to be sure that all CUDA processing is finished before commit
cudaDeviceSynchronize();
vxCommitArrayRange(arr_x, 0, num_items_x, ptr_x);
vxCommitArrayRange(arr_y, 0, num_items_y, ptr_y);
}
}