一个自动驾驶里最经典的 CUDA 图像 kernel:NV12 → Resize → Normalize → CHW 一次 kernel 完成。很多 TensorRT 前处理就是这么写的,性能非常高,从而显著提升性能。
NV12 → RGB → Resize → Normalize → CHW
可以减少:
多次 GPU kernel launch
多次显存读写
CPU-GPU copy
一、NV12 数据格式
NV12 内存布局:
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYUVUVUVUVUVUVUVUVUVUVUVUV
特点:
Y = y_plane[y * width + x]UV = uv_plane[(y/2) * width + (x/2)*2]
输入:
输出:
NV12 → RGBResizeNormalizeHWC → CHW
__device__ inline float clip(float x){ return fminf(fmaxf(x, 0.0f), 255.0f);}__global__ void nv12_resize_norm_chw_kernel( const uint8_t* nv12, float* output, int in_w, int in_h, int out_w, int out_h, float mean0, float mean1, float mean2, float std0, float std1, float std2){ int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x >= out_w || y >= out_h) return; float scale_x = (float)in_w / out_w; float scale_y = (float)in_h / out_h; int src_x = min((int)(x * scale_x), in_w - 1); int src_y = min((int)(y * scale_y), in_h - 1); const uint8_t* y_plane = nv12; const uint8_t* uv_plane = nv12 + in_w * in_h; int y_index = src_y * in_w + src_x; int uv_index = (src_y / 2) * in_w + (src_x / 2) * 2; int Y = y_plane[y_index]; int U = uv_plane[uv_index] - 128; int V = uv_plane[uv_index + 1] - 128; float R = clip(Y + 1.402f * V); float G = clip(Y - 0.344136f * U - 0.714136f * V); float B = clip(Y + 1.772f * U); R /= 255.0f; G /= 255.0f; B /= 255.0f; R = (R - mean0) / std0; G = (G - mean1) / std1; B = (B - mean2) / std2; int out_index = y * out_w + x; output[out_index] = R; output[out_w * out_h + out_index] = G; output[2 * out_w * out_h + out_index] = B;}
四、Kernel 启动
dim3 block(16,16);dim3 grid((out_w+15)/16,(out_h+15)/16);nv12_resize_norm_chw_kernel<<<grid, block>>>( d_nv12, d_output, in_w, in_h, out_w, out_h, 0.485f, 0.456f, 0.406f, 0.229f, 0.224f, 0.225f);
输入tensor
五、内存布局示意
输入:
NV12[Y plane]YYYYYYYYYYYYYYYY[UV plane]UVUVUVUVUVUVUVUV
输出:
Tensor CHWR channelG channelB channel
| RRRRR || RRRRR || GGGGG || GGGGG || BBBBB || BBBBB |
1 memory coalescing
线程访问连续地址:
thread0 → pixel0thread1 → pixel1thread2 → pixel2
2 vectorized load
使用
读取
eg:
uchar2 uv = *(uchar2*)(uv_plane + uv_index);
3 texture memory(图像采样)
Resize 可以用:
支持:
性能更高。
4 half precision
输出:
减少显存带宽。