# How to copy 2d array in cuda?

Problem Description:

I am new to cuda and still trying to figure things out, so this question maybe dumb but I can’t seem to figure out the problem so bare with me.

I am trying to copy a 2d array to the GPU. The size of the array is N*N (square array). I’m trying to copy it using `MallocPitch()` & `cudaMemcpy2D()`. The problem is I seem to only be copying the first row of the array and nothing else. I can’t find what exactly im doing wrong.

My code:

``````void function(){
double A[N][N];
//code to fill out the array.

double* d_A;
size_t pitch;
cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

int blocksnum = 1;

//copying back to host & freeing up memory

}

__global__ void kernal_print(double* d_A, N){
int xIdx = threadIdx.x + blockDim.x * blockIdx.x;
int yIdx = threadIdx.y + blockDim.y * blockIdx.y;

printf("n");
for(int i = 0; i < N*N; i++){
printf("%f, ",d_A[i]);
}
printf("n");
}
``````

The code above will only print the first row of whatever matrix I have. So for example a 3×3 matrix that looks like this:

1 2 3
4 5 6
7 8 9

the code will print (1 2 3 0 0 0 0 0 0)

Any idea of what Iam doing wrong?

## Solution – 1

This question may be useful for background.

Perhaps you don’t know what a pitched allocation is. A pitched allocation looks like this:

``````X  X  X  P  P  P
X  X  X  P  P  P
X  X  X  P  P  P
``````

The above could represent storage for a 3×3 array (elements represented by `X`) that is pitched (pitched value of 6 elements, pitch "elements" represented by `P`).

You’ll have no luck accessing such a storage arrangement if you don’t follow the guidelines given in the reference manual for `cudaMallocPitch`. In-kernel access to such a pitched allocation should be done as follows:

``````T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;
``````

You’ll note that the above formula depends on the `pitch` value that was provided at the point of `cudaMallocPitch`. If you don’t pass that value to your kernel, you won’t have any luck with this.

Because you are not doing that, the proximal reason for your observation:

the code will print (1 2 3 0 0 0 0 0 0)

is because your indexing is reading just the first "row" of that pitched allocation, and the `P` elements are showing up as zero (although that’s not guaranteed.)

We can fix your code simply by implementing the suggestions given in the reference manual:

``````\$ cat t2153.cu
#include <cstdio>
const size_t N = 3;
__global__ void kernal_print(double* d_A, size_t my_N, size_t pitch){
//   int xIdx = threadIdx.x + blockDim.x * blockIdx.x;
//   int yIdx = threadIdx.y + blockDim.y * blockIdx.y;

printf("n");
for(int row = 0; row < my_N; row++)
for (int col = 0; col < my_N; col++){
double* pElement = (double *)((char*)d_A + row * pitch) + col;
printf("%f, ",*pElement);
}
printf("n");
}

void function(){
double A[N][N];
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
A[row][col] = row*N+col+1;
double* d_A;
size_t pitch;
cudaMallocPitch(&d_A, &pitch, N * sizeof(double), N);
cudaMemcpy2D(d_A, pitch, A, N * sizeof(double) , N * sizeof(double), N, cudaMemcpyHostToDevice);

int blocknum = 1;

}

int main(){

function();
}
\$ nvcc -o t2153 t2153.cu
\$ compute-sanitizer ./t2153
========= COMPUTE-SANITIZER

1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000,
========= ERROR SUMMARY: 0 errors
\$
``````

• This sort of allocation: `double A[N][N];` may give you trouble for large `N`, because it is a stack-based allocation. Instead, use a dynamic allocation (which may affect a number of the methods you use to handle it.) There are various questions covering this, such as this one.