Parallel Processing Quiz 2

Approved & Edited by ProProfs Editorial Team
The editorial team at ProProfs Quizzes consists of a select group of subject experts, trivia writers, and quiz masters who have authored over 10,000 quizzes taken by more than 100 million users. This team includes our in-house seasoned quiz moderators and subject matter experts. Our editorial experts, spread across the world, are rigorously trained using our comprehensive guidelines to ensure that you receive the highest quality quizzes.
Learn about Our Editorial Process
| By Rahulsaxena08
R
Rahulsaxena08
Community Contributor
Quizzes Created: 1 | Total Attempts: 770
Questions: 10 | Attempts: 770

SettingsSettingsSettings
Parallel Processing Quiz 2 - Quiz


Questions and Answers
  • 1. 

    Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?

    • A.

      1

    • B.

      1000

    • C.

      512

    • D.

      512000

    Correct Answer
    B. 1000
    Explanation
    In this scenario, since the variable is declared as a shared memory variable, there will be one version of the variable created for each thread block. Since there are 1000 thread blocks, there will be 1000 versions of the variable created throughout the execution of the kernel.

    Rate this question:

  • 2. 

    For the tiled single-precision matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block?Which one do you like?

    • A.

      16

    • B.

      32

    • C.

      64

    • D.

      128

    Correct Answer
    B. 32
    Explanation
    When loading one A-matrix tile by a thread block, each element in the tile is a single-precision value, which requires 4 bytes of memory. The tile size is 32x32, so there are a total of 32*32 = 1024 elements in the tile. Multiplying the number of elements by the size of each element gives 1024 * 4 = 4096 bytes. Since the DRAM burst size is 128 bytes, dividing the total size of the tile (4096 bytes) by the burst size (128 bytes) gives 4096 / 128 = 32 bursts. Therefore, 32 DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block.

    Rate this question:

  • 3. 

    We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?

    • A.

      I=blockIdx.x*blockDim.x + threadIdx.x +2;

    • B.

      I=blockIdx.x*threadIdx.x*2

    • C.

      I= (blockIdx.x*blockDim.x + threadIdx.x)*2

    • D.

      I=blockIdx.x*blockDim.x*2 + threadIdx.x

    Correct Answer
    C. I= (blockIdx.x*blockDim.x + threadIdx.x)*2
    Explanation
    The expression i= (blockIdx.x*blockDim.x + threadIdx.x)*2 is the correct answer. It correctly maps the thread/block indices to the data index of the first element by multiplying the sum of blockIdx.x and blockDim.x with threadIdx.x, and then multiplying the result by 2. This ensures that each thread calculates two adjacent output elements of the vector addition.

    Rate this question:

  • 4. 

    We are to process a 600X800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 600 and n’s value is 800. __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {    // Calculate the row # of the d_Pin and d_Pout element to process    int Row = blockIdx.y*blockDim.y + threadIdx.y;    // Calculate the column # of the d_Pin and d_Pout element to process    int Col = blockIdx.x*blockDim.x + threadIdx.x;    // each thread computes one element of d_Pout if in range    if ((Row < m) && (Col < n)) {        d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];    }} Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel? 

    • A.

      37*16

    • B.

      38*50

    • C.

      38*8*50

    • D.

      38*50*2

    Correct Answer
    C. 38*8*50
    Explanation
    During the execution of the kernel, each block will have 16x16 threads, which means a total of 256 threads per block. The grid is organized as a 2D 16x16 array of blocks, so there will be a total of 16x16 blocks, which is 256 blocks. Each warp consists of 32 threads, so the total number of warps generated during the execution of the kernel will be 256 blocks x 256 threads/block divided by 32 threads/warp. This gives us a total of 2048 warps.

    Rate this question:

  • 5. 

    If a CUDA device’s SM (streaming multiprocessor) can take up to 1,536 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in each SM?

    • A.

      64 threads per block

    • B.

      128 threads per block

    • C.

      512 threads per block

    • D.

      1,024 threads per block

    Correct Answer
    C. 512 threads per block
    Explanation
    A CUDA device's SM can take up to 1,536 threads and up to 8 thread blocks. To maximize the number of threads in each SM, we need to find the block configuration that results in the most threads. Since each SM can take up to 1,536 threads, we need to divide this number by the number of threads per block.

    If we choose 64 threads per block, we would have 1,536 / 64 = 24 blocks per SM, resulting in 24 * 64 = 1,536 threads.

    If we choose 128 threads per block, we would have 1,536 / 128 = 12 blocks per SM, resulting in 12 * 128 = 1,536 threads.

    If we choose 512 threads per block, we would have 1,536 / 512 = 3 blocks per SM, resulting in 3 * 512 = 1,536 threads.

    If we choose 1,024 threads per block, we would have 1,536 / 1,024 = 1.5 blocks per SM, resulting in 1.5 * 1,024 = 1,536 threads.

    Therefore, the block configuration that would result in the most number of threads in each SM is 512 threads per block.

    Rate this question:

  • 6. 

    Assume the following simple matrix multiplication kernel    __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)   {     int Row = blockIdx.y*blockDim.y+threadIdx.y;     int Col = blockIdx.x*blockDim.x+threadIdx.x;     if ((Row < Width) && (Col < Width)) {        float Pvalue = 0;        for (int k = 0; k < Width; ++k) {Pvalue += M[Row*Width+k] * N[k*Width+Col];}        P[Row*Width+Col] = Pvalue;      }    } which of the following is true? 

    • A.

      M[Row*Width+k] and N[k*Width+Col] are coalesced but P[Row*Width+Col] is not

    • B.

      M[Row*Width+k], N[k*Width+Col] and P[Row*Width+Col] are all coalesced

    • C.

      M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are

    • D.

      M[Row*Width+k] is coalesced but N[k*Width+Col] andt P[Row*Width+Col] are not

    Correct Answer
    C. M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are
    Explanation
    In this matrix multiplication kernel, the memory accesses for M[Row*Width+k] and N[k*Width+Col] are coalesced because they are accessed in a continuous manner by the threads in a warp. However, the memory access for P[Row*Width+Col] is not coalesced because each thread is accessing a different location in the output matrix P. Therefore, M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] are both coalesced.

    Rate this question:

  • 7. 

    For the simple reduction kernel, if the block size is 1,024 and the warp size is 32, how many warps in a block will have divergence during the 5th iteration? 

    • A.

      0

    • B.

      1

    • C.

      16

    • D.

      32

    Correct Answer
    D. 32
    Explanation
    During the 5th iteration of the simple reduction kernel, each thread in a warp will perform the reduction operation independently. Since the warp size is 32, all 32 threads in a warp will execute the same instructions in lockstep. Therefore, there will be no divergence within a warp. Since there are no divergent warps, the answer is 0.

    Rate this question:

  • 8. 

    For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? unsigned int t = threadIdx.x;Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x+t] = input[start+ blockDim.x+t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){    __syncthreads();    if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}} 

    • A.

      0

    • B.

      1

    • C.

      16

    • D.

      32

    Correct Answer
    A. 0
    Explanation
    During the iteration where the stride is equal to 1, there will be no warps in a block that have divergence. This is because the condition "t % stride == 0" ensures that only threads with thread index divisible by the stride will execute the code inside the if statement. Since the stride is 1, all threads will satisfy this condition and there will be no divergence. Therefore, the answer is 0.

    Rate this question:

  • 9. 

    SM implements zero overhead scheduling because – 

    • A.

      Warp whose next instruction has its operands ready for consumption is eligible for execution.

    • B.

      All threads in a warp execute the same instruction when selected.

    • C.

      Both are correct

    • D.

      None

    Correct Answer
    C. Both are correct
    Explanation
    The given answer states that both statements are correct. The first statement explains that in SM (Streaming Multiprocessor), zero overhead scheduling is implemented when a warp (a group of 32 threads) whose next instruction has its operands ready for consumption is eligible for execution. This means that the scheduler does not have to wait for the operands to be available, reducing overhead. The second statement explains that all threads in a warp execute the same instruction when selected, which is also true. Therefore, both statements are correct.

    Rate this question:

  • 10. 

    __device__ constant int mask=10 will have memory, lifetime and scope defined as

    • A.

      Register, thread and thread

    • B.

      Global, grid and application

    • C.

      Shared, grid and application

    • D.

      Constant, grid and application

    Correct Answer
    D. Constant, grid and application
    Explanation
    The keyword "__device__" indicates that the variable "mask" is a device variable, meaning it resides in the global memory of the GPU. The qualifier "constant" specifies that the variable is read-only and its value cannot be modified during execution. The scope of the variable is defined as "grid", meaning it is accessible to all threads within a grid. Lastly, the lifetime of the variable is "application", indicating that it persists throughout the duration of the application execution. Therefore, the correct answer is "constant, grid and application".

    Rate this question:

Quiz Review Timeline +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Mar 22, 2023
    Quiz Edited by
    ProProfs Editorial Team
  • Feb 11, 2017
    Quiz Created by
    Rahulsaxena08

Related Topics

Back to Top Back to top
Advertisement