# Parallel Processing Quiz 2

10 Questions | Total Attempts: 423  Settings Create your own Quiz • 1.
Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?
• A.

1

• B.

1000

• C.

512

• D.

512000

• 2.
For the tiled single-precision matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block?Which one do you like?
• A.

16

• B.

32

• C.

64

• D.

128

• 3.
We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?
• A.

I=blockIdx.x*blockDim.x + threadIdx.x +2;

• B.

• C.

I= (blockIdx.x*blockDim.x + threadIdx.x)*2

• D.

• 4.
We are to process a 600X800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 600 and n’s value is 800. __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {    // Calculate the row # of the d_Pin and d_Pout element to process    int Row = blockIdx.y*blockDim.y + threadIdx.y;    // Calculate the column # of the d_Pin and d_Pout element to process    int Col = blockIdx.x*blockDim.x + threadIdx.x;    // each thread computes one element of d_Pout if in range    if ((Row < m) && (Col < n)) {        d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];    }} Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel?
• A.

37*16

• B.

38*50

• C.

38*8*50

• D.

38*50*2

• 5.
If a CUDA device’s SM (streaming multiprocessor) can take up to 1,536 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in each SM?
• A.

64 threads per block

• B.

128 threads per block

• C.

512 threads per block

• D.

1,024 threads per block

• 6.
Assume the following simple matrix multiplication kernel    __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)   {     int Row = blockIdx.y*blockDim.y+threadIdx.y;     int Col = blockIdx.x*blockDim.x+threadIdx.x;     if ((Row < Width) && (Col < Width)) {        float Pvalue = 0;        for (int k = 0; k < Width; ++k) {Pvalue += M[Row*Width+k] * N[k*Width+Col];}        P[Row*Width+Col] = Pvalue;      }    } which of the following is true?
• A.

M[Row*Width+k] and N[k*Width+Col] are coalesced but P[Row*Width+Col] is not

• B.

M[Row*Width+k], N[k*Width+Col] and P[Row*Width+Col] are all coalesced

• C.

M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are

• D.

M[Row*Width+k] is coalesced but N[k*Width+Col] andt P[Row*Width+Col] are not

• 7.
For the simple reduction kernel, if the block size is 1,024 and the warp size is 32, how many warps in a block will have divergence during the 5th iteration?
• A.

0

• B.

1

• C.

16

• D.

32

• 8.
For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? unsigned int t = threadIdx.x;Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x+t] = input[start+ blockDim.x+t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){    __syncthreads();    if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}}
• A.

0

• B.

1

• C.

16

• D.

32

• 9.
SM implements zero overhead scheduling because –
• A.

Warp whose next instruction has its operands ready for consumption is eligible for execution.

• B.

All threads in a warp execute the same instruction when selected.

• C.

Both are correct

• D.

None

• 10.
__device__ constant int mask=10 will have memory, lifetime and scope defined as
• A.

• B.

Global, grid and application

• C.

Shared, grid and application

• D.

Constant, grid and application Back to top