Parallel Processing Quiz 2

10 Questions | Total Attempts: 249

SettingsSettingsSettings
Please wait...
Parallel Processing Quiz 2


Questions and Answers
  • 1. 
    Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?
    • A. 

      1

    • B. 

      1000

    • C. 

      512

    • D. 

      512000

  • 2. 
    For the tiled single-precision matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix tile by a thread block?Which one do you like?
    • A. 

      16

    • B. 

      32

    • C. 

      64

    • D. 

      128

  • 3. 
    We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?
    • A. 

      I=blockIdx.x*blockDim.x + threadIdx.x +2;

    • B. 

      I=blockIdx.x*threadIdx.x*2

    • C. 

      I= (blockIdx.x*blockDim.x + threadIdx.x)*2

    • D. 

      I=blockIdx.x*blockDim.x*2 + threadIdx.x

  • 4. 
    We are to process a 600X800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’s value is 600 and n’s value is 800. __global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {    // Calculate the row # of the d_Pin and d_Pout element to process    int Row = blockIdx.y*blockDim.y + threadIdx.y;    // Calculate the column # of the d_Pin and d_Pout element to process    int Col = blockIdx.x*blockDim.x + threadIdx.x;    // each thread computes one element of d_Pout if in range    if ((Row < m) && (Col < n)) {        d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];    }} Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D 16X16 array of threads. How many warps will be generated during the execution of the kernel? 
    • A. 

      37*16

    • B. 

      38*50

    • C. 

      38*8*50

    • D. 

      38*50*2

  • 5. 
    If a CUDA device’s SM (streaming multiprocessor) can take up to 1,536 threads and up to 8 thread blocks. Which of the following block configuration would result in the most number of threads in each SM?
    • A. 

      64 threads per block

    • B. 

      128 threads per block

    • C. 

      512 threads per block

    • D. 

      1,024 threads per block

  • 6. 
    Assume the following simple matrix multiplication kernel    __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)   {     int Row = blockIdx.y*blockDim.y+threadIdx.y;     int Col = blockIdx.x*blockDim.x+threadIdx.x;     if ((Row < Width) && (Col < Width)) {        float Pvalue = 0;        for (int k = 0; k < Width; ++k) {Pvalue += M[Row*Width+k] * N[k*Width+Col];}        P[Row*Width+Col] = Pvalue;      }    } which of the following is true? 
    • A. 

      M[Row*Width+k] and N[k*Width+Col] are coalesced but P[Row*Width+Col] is not

    • B. 

      M[Row*Width+k], N[k*Width+Col] and P[Row*Width+Col] are all coalesced

    • C. 

      M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col] both are

    • D. 

      M[Row*Width+k] is coalesced but N[k*Width+Col] andt P[Row*Width+Col] are not

  • 7. 
    For the simple reduction kernel, if the block size is 1,024 and the warp size is 32, how many warps in a block will have divergence during the 5th iteration? 
    • A. 

      0

    • B. 

      1

    • C. 

      16

    • D. 

      32

  • 8. 
    For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? unsigned int t = threadIdx.x;Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;partialSum[t] = input[start + t];partialSum[blockDim.x+t] = input[start+ blockDim.x+t];for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){    __syncthreads();    if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}} 
    • A. 

      0

    • B. 

      1

    • C. 

      16

    • D. 

      32

  • 9. 
    SM implements zero overhead scheduling because – 
    • A. 

      Warp whose next instruction has its operands ready for consumption is eligible for execution.

    • B. 

      All threads in a warp execute the same instruction when selected.

    • C. 

      Both are correct

    • D. 

      None

  • 10. 
    __device__ constant int mask=10 will have memory, lifetime and scope defined as
    • A. 

      Register, thread and thread

    • B. 

      Global, grid and application

    • C. 

      Shared, grid and application

    • D. 

      Constant, grid and application

Back to Top Back to top