# Parallel Processing Quiz 2

Approved & Edited by ProProfs Editorial Team
At ProProfs Quizzes, our dedicated in-house team of experts takes pride in their work. With a sharp eye for detail, they meticulously review each quiz. This ensures that every quiz, taken by over 100 million users, meets our standards of accuracy, clarity, and engagement.
| Written by Rahulsaxena08
R
Rahulsaxena08
Community Contributor
Quizzes Created: 1 | Total Attempts: 717
Questions: 10 | Attempts: 718  Settings  • 1.

### Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?

• A.

1

• B.

1000

• C.

512

• D.

512000

B. 1000
Explanation
In this scenario, since the variable is declared as a shared memory variable, there will be one version of the variable created for each thread block. Since there are 1000 thread blocks, there will be 1000 versions of the variable created throughout the execution of the kernel.

Rate this question:

• 2.

### For the tiled single-precision matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block?Which one do you like?

• A.

16

• B.

32

• C.

64

• D.

128

B. 32
Explanation
When loading one A-matrix tile by a thread block, each element in the tile is a single-precision value, which requires 4 bytes of memory. The tile size is 32x32, so there are a total of 32*32 = 1024 elements in the tile. Multiplying the number of elements by the size of each element gives 1024 * 4 = 4096 bytes. Since the DRAM burst size is 128 bytes, dividing the total size of the tile (4096 bytes) by the burst size (128 bytes) gives 4096 / 128 = 32 bursts. Therefore, 32 DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block.

Rate this question:

• 3.

### We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?

• A.

• B.

• C.

• D.

Explanation
The expression i= (blockIdx.x*blockDim.x + threadIdx.x)*2 is the correct answer. It correctly maps the thread/block indices to the data index of the first element by multiplying the sum of blockIdx.x and blockDim.x with threadIdx.x, and then multiplying the result by 2. This ensures that each thread calculates two adjacent output elements of the vector addition.

Rate this question:

• 4.

### We are to process a 600X800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 600 and n’s value is 800. __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {    // Calculate the row # of the d_Pin and d_Pout element to process    int Row = blockIdx.y*blockDim.y + threadIdx.y;    // Calculate the column # of the d_Pin and d_Pout element to process    int Col = blockIdx.x*blockDim.x + threadIdx.x;    // each thread computes one element of d_Pout if in range    if ((Row < m) && (Col < n)) {        d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];    }} Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel?

• A.

37*16

• B.

38*50

• C.

38*8*50

• D.

38*50*2

C. 38*8*50
Explanation
During the execution of the kernel, each block will have 16x16 threads, which means a total of 256 threads per block. The grid is organized as a 2D 16x16 array of blocks, so there will be a total of 16x16 blocks, which is 256 blocks. Each warp consists of 32 threads, so the total number of warps generated during the execution of the kernel will be 256 blocks x 256 threads/block divided by 32 threads/warp. This gives us a total of 2048 warps.

Rate this question:

• 5.

### If a CUDA device’s SM (streaming multiprocessor) can take up to 1,536 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in each SM?

• A.

• B.

• C.

• D.

Explanation
A CUDA device's SM can take up to 1,536 threads and up to 8 thread blocks. To maximize the number of threads in each SM, we need to find the block configuration that results in the most threads. Since each SM can take up to 1,536 threads, we need to divide this number by the number of threads per block.

If we choose 64 threads per block, we would have 1,536 / 64 = 24 blocks per SM, resulting in 24 * 64 = 1,536 threads.

If we choose 128 threads per block, we would have 1,536 / 128 = 12 blocks per SM, resulting in 12 * 128 = 1,536 threads.

If we choose 512 threads per block, we would have 1,536 / 512 = 3 blocks per SM, resulting in 3 * 512 = 1,536 threads.

If we choose 1,024 threads per block, we would have 1,536 / 1,024 = 1.5 blocks per SM, resulting in 1.5 * 1,024 = 1,536 threads.

Therefore, the block configuration that would result in the most number of threads in each SM is 512 threads per block.

Rate this question:

• 6.

### Assume the following simple matrix multiplication kernel    __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)   {     int Row = blockIdx.y*blockDim.y+threadIdx.y;     int Col = blockIdx.x*blockDim.x+threadIdx.x;     if ((Row < Width) && (Col < Width)) {        float Pvalue = 0;        for (int k = 0; k < Width; ++k) {Pvalue += M[Row*Width+k] * N[k*Width+Col];}        P[Row*Width+Col] = Pvalue;      }    } which of the following is true?

• A.

M[Row*Width+k] and N[k*Width+Col] are coalesced but P[Row*Width+Col] is not

• B.

M[Row*Width+k], N[k*Width+Col] and P[Row*Width+Col] are all coalesced

• C.

M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are

• D.

M[Row*Width+k] is coalesced but N[k*Width+Col] andt P[Row*Width+Col] are not

C. M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are
Explanation
In this matrix multiplication kernel, the memory accesses for M[Row*Width+k] and N[k*Width+Col] are coalesced because they are accessed in a continuous manner by the threads in a warp. However, the memory access for P[Row*Width+Col] is not coalesced because each thread is accessing a different location in the output matrix P. Therefore, M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] are both coalesced.

Rate this question:

• 7.

### For the simple reduction kernel, if the block size is 1,024 and the warp size is 32, how many warps in a block will have divergence during the 5th iteration?

• A.

0

• B.

1

• C.

16

• D.

32

D. 32
Explanation
During the 5th iteration of the simple reduction kernel, each thread in a warp will perform the reduction operation independently. Since the warp size is 32, all 32 threads in a warp will execute the same instructions in lockstep. Therefore, there will be no divergence within a warp. Since there are no divergent warps, the answer is 0.

Rate this question:

• 8.

### For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? unsigned int t = threadIdx.x;Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x+t] = input[start+ blockDim.x+t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){    __syncthreads();    if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}}

• A.

0

• B.

1

• C.

16

• D.

32

A. 0
Explanation
During the iteration where the stride is equal to 1, there will be no warps in a block that have divergence. This is because the condition "t % stride == 0" ensures that only threads with thread index divisible by the stride will execute the code inside the if statement. Since the stride is 1, all threads will satisfy this condition and there will be no divergence. Therefore, the answer is 0.

Rate this question:

• 9.

### SM implements zero overhead scheduling because –

• A.

Warp whose next instruction has its operands ready for consumption is eligible for execution.

• B.

All threads in a warp execute the same instruction when selected.

• C.

Both are correct

• D.

None

C. Both are correct
Explanation
The given answer states that both statements are correct. The first statement explains that in SM (Streaming Multiprocessor), zero overhead scheduling is implemented when a warp (a group of 32 threads) whose next instruction has its operands ready for consumption is eligible for execution. This means that the scheduler does not have to wait for the operands to be available, reducing overhead. The second statement explains that all threads in a warp execute the same instruction when selected, which is also true. Therefore, both statements are correct.

Rate this question:

• 10.

### __device__ constant int mask=10 will have memory, lifetime and scope defined as

• A.

• B.

Global, grid and application

• C.

Shared, grid and application

• D.

Constant, grid and application Back to top