当前位置：首页>自动驾驶>自动驾驶里最经典的 CUDA 图像 kernel

自动驾驶里最经典的 CUDA 图像 kernel

2026-03-17 18:51:51

一个自动驾驶里最经典的 CUDA 图像 kernel：NV12 → Resize → Normalize → CHW 一次 kernel 完成。很多 TensorRT 前处理就是这么写的，性能非常高，从而显著提升性能。

NV12 → RGB → Resize → Normalize → CHW

可以减少：

多次 GPU kernel launch
多次显存读写
CPU-GPU copy

一、NV12 数据格式

NV12 内存布局：

YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYUVUVUVUVUVUVUVUVUVUVUVUV

特点：

Y：每个像素一个
UV：2×2共享

访问公式：

Y = y_plane[y * width + x]UV = uv_plane[(y/2) * width + (x/2)*2]

二、Kernel设计目标

输入：


NV12 (uint8)
输出：
float tensor (CHW)
例如：
3 x 512 x 512
同时完成：
NV12 → RGBResizeNormalizeHWC → CHW
三、CUDA Kernel
__device__ inline float clip(float x){    return fminf(fmaxf(x, 0.0f), 255.0f);}__global__ void nv12_resize_norm_chw_kernel(        const uint8_t* nv12,        float* output,        int in_w,        int in_h,        int out_w,        int out_h,        float mean0,        float mean1,        float mean2,        float std0,        float std1,        float std2){    int x = blockIdx.x * blockDim.x + threadIdx.x;    int y = blockIdx.y * blockDim.y + threadIdx.y;    if (x >= out_w || y >= out_h)        return;    float scale_x = (float)in_w / out_w;    float scale_y = (float)in_h / out_h;    int src_x = min((int)(x * scale_x), in_w - 1);    int src_y = min((int)(y * scale_y), in_h - 1);    const uint8_t* y_plane = nv12;    const uint8_t* uv_plane = nv12 + in_w * in_h;    int y_index = src_y * in_w + src_x;    int uv_index = (src_y / 2) * in_w + (src_x / 2) * 2;    int Y = y_plane[y_index];    int U = uv_plane[uv_index] - 128;    int V = uv_plane[uv_index + 1] - 128;    float R = clip(Y + 1.402f * V);    float G = clip(Y - 0.344136f * U - 0.714136f * V);    float B = clip(Y + 1.772f * U);    R /= 255.0f;    G /= 255.0f;    B /= 255.0f;    R = (R - mean0) / std0;    G = (G - mean1) / std1;    B = (B - mean2) / std2;    int out_index = y * out_w + x;    output[out_index] = R;    output[out_w * out_h + out_index] = G;    output[2 * out_w * out_h + out_index] = B;}
四、Kernel 启动
dim3 block(16,16);dim3 grid((out_w+15)/16,(out_h+15)/16);nv12_resize_norm_chw_kernel<<<grid, block>>>(    d_nv12,    d_output,    in_w,    in_h,    out_w,    out_h,    0.485f,    0.456f,    0.406f,    0.229f,    0.224f,    0.225f);
输入tensor
3 x H x W
五、内存布局示意

输入：

NV12[Y plane]YYYYYYYYYYYYYYYY[UV plane]UVUVUVUVUVUVUVUV

输出：

Tensor CHWR channelG channelB channel

| RRRRR || RRRRR || GGGGG || GGGGG || BBBBB || BBBBB |

真正工业代码会进一步优化：

1 memory coalescing

线程访问连续地址：

thread0 → pixel0thread1 → pixel1thread2 → pixel2

2 vectorized load

使用

uchar2uchar4

读取

UVUV

eg:

uchar2 uv = *(uchar2*)(uv_plane + uv_index);

3 texture memory（图像采样）

Resize 可以用：

cudaTextureObject_t

支持：

bilinear
cache

性能更高。

4 half precision

输出：

__half

减少显存带宽。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

自动驾驶里最经典的 CUDA 图像 kernel

一、NV12 数据格式

1 memory coalescing

2 vectorized load

使用

4 half precision

最新文章

热门文章

随机文章

自动驾驶里最经典的 CUDA 图像 kernel

一、NV12 数据格式

1 memory coalescing

2 vectorized load

使用

4 half precision

极氪8X |超级电混高性能旗舰SUV

轿车托运被骗怎么追回钱?4步操作,快速止损

最新文章

热门文章

随机文章