Parallel Programming Exam Quiz!

Approved & Edited by ProProfs Editorial Team
At ProProfs Quizzes, our dedicated in-house team of experts takes pride in their work. With a sharp eye for detail, they meticulously review each quiz. This ensures that every quiz, taken by over 100 million users, meets our standards of accuracy, clarity, and engagement.
Learn about Our Editorial Process
| Written by Dps158
D
Dps158
Community Contributor
Quizzes Created: 1 | Total Attempts: 576
Questions: 10 | Attempts: 576

SettingsSettingsSettings
Parallel Programming Exam Quiz! - Quiz

.


Questions and Answers
  • 1. 

    We want to use each thread to calculate two (adjacent) elements of vector addition. Assume that a variable I should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to the data index?

    • A. 

      I = blockIdx.x*blockDim.x + threadIdx.x + 2

    • B. 

      I = blockIdx.x*threadIdx.x*2

    • C. 

      I = (blockIdx.x*blockDim.x + threadIdx.x)*2

    • D. 

      I = blockIdx.x*threadIdx.x*2 + threadIdx.x

    Correct Answer
    C. I = (blockIdx.x*blockDim.x + threadIdx.x)*2
    Explanation
    The expression for mapping the thread/block indices to the data index is given by i = (blockIdx.x*blockDim.x + threadIdx.x)*2. This expression takes into account both the block index and the thread index, and multiplies it by 2 to calculate the data index. The blockIdx.x represents the index of the current block, blockDim.x represents the number of threads per block, and threadIdx.x represents the index of the current thread within the block. Multiplying these values and adding them together gives the data index.

    Rate this question:

  • 2. 

    For a vector addition, assume that the vector length is 4000, each thread calculates one output element, and the thread block size is 1024 threads. How many threads will be in the grid?

    • A. 

      2000

    • B. 

      3000

    • C. 

      1024

    • D. 

      4096

    Correct Answer
    D. 4096
    Explanation
    In this scenario, the vector length is 4000 and each thread calculates one output element. The thread block size is 1024 threads. To find the total number of threads in the grid, we divide the vector length by the thread block size. So, 4000/1024 equals approximately 3.906. Since we cannot have a fraction of a thread, we round up to the nearest whole number, which is 4. Therefore, there will be 4 thread blocks, each with 1024 threads, resulting in a total of 4096 threads in the grid.

    Rate this question:

  • 3. 

    If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 6 thread blocks. Which of the following block configuration would result in the most number of threads in the SM?

    • A. 

      128 threads per block

    • B. 

      276 threads per block

    • C. 

      512 threads per block

    • D. 

      1024 threads per block

    Correct Answer
    C. 512 threads per block
    Explanation
    A CUDA device's SM can take up to 1536 threads and up to 6 thread blocks. To maximize the number of threads in the SM, we need to consider the total number of threads that can be accommodated.

    If we have 128 threads per block, the total number of threads would be 128 * 6 = 768 threads.

    If we have 276 threads per block, the total number of threads would be 276 * 6 = 1656 threads, which exceeds the maximum capacity of 1536 threads.

    If we have 512 threads per block, the total number of threads would be 512 * 6 = 3072 threads, which also exceeds the maximum capacity of 1536 threads.

    If we have 1024 threads per block, the total number of threads would be 1024 * 6 = 6144 threads, which significantly exceeds the maximum capacity of 1536 threads.

    Therefore, the block configuration that would result in the most number of threads in the SM is 512 threads per block.

    Rate this question:

  • 4. 

    __syncthreads() function is applicable to?

    • A. 

      Thread level

    • B. 

      Block level

    • C. 

      Grid level

    • D. 

      All the option

    Correct Answer
    B. Block level
    Explanation
    The __syncthreads() function in CUDA is applicable at the block level. It is used to synchronize the threads within a block, ensuring that all threads have completed their execution before proceeding to the next set of instructions. This function is commonly used to coordinate shared memory access and avoid race conditions between threads within a block. It does not synchronize threads across different blocks or the entire grid. Therefore, the correct answer is block level.

    Rate this question:

  • 5. 

    For tiled matrix-matrix multiplication kernel, if we use a 64x64 tile, what is the reduction of memory bandwidth usage for input matrices M and N?

    • A. 

      1/8 of the original usage

    • B. 

      1/16 of the original usage

    • C. 

      1/32of the original usage

    • D. 

      1/64 of the original usage

    Correct Answer
    D. 1/64 of the original usage
    Explanation
    Using a 64x64 tile for tiled matrix-matrix multiplication reduces the memory bandwidth usage for input matrices M and N to 1/64 of the original usage. This reduction is achieved by loading a tile of size 64x64 into shared memory and reusing the loaded data multiple times, reducing the number of memory accesses. By minimizing the number of memory accesses, the overall memory bandwidth usage is significantly reduced, resulting in the 1/64 reduction mentioned in the answer.

    Rate this question:

  • 6. 

    Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?

    • A. 

      1

    • B. 

      1000

    • C. 

      51200

    • D. 

      512000

    Correct Answer
    D. 512000
    Explanation
    In this scenario, a kernel is launched with 1000 thread blocks, and each block has 512 threads. If a variable is declared as a local variable in the kernel, a separate version of the variable will be created for each thread. Since there are 1000 blocks and 512 threads per block, the total number of threads is 1000 * 512 = 512000. Therefore, 512000 versions of the variable will be created throughout the execution of the kernel.

    Rate this question:

  • 7. 

    Consider performing a matrix multiplication of two input matrices with dimensions NxN. How many times is each element in the input matrices requested from global memory, When tiles of size TxT are used?

    • A. 

      T/N

    • B. 

      N/T

    • C. 

      T*N

    • D. 

      (N*N)/(T*T)

    Correct Answer
    B. N/T
    Explanation
    When tiles of size TxT are used, each element in the input matrices is requested from global memory N/T times. This is because each tile of size TxT will contain T elements from the input matrices, and for each element, it needs to be requested N/T times in order to perform the matrix multiplication. Therefore, the correct answer is N/T.

    Rate this question:

  • 8. 

    For the shared memory based tiled matrix multiplication (MxN) based on a row-major layout, which input matrix will have coalesced access?

    • A. 

      M

    • B. 

      N

    • C. 

      Both

    • D. 

      None

    Correct Answer
    C. Both
    Explanation
    Both input matrices M and N will have coalesced access in shared memory based tiled matrix multiplication. Coalesced access refers to accessing consecutive memory locations by consecutive threads. In a row-major layout, consecutive elements in a row are stored adjacently in memory. Since both matrices M and N are based on a row-major layout, the threads can access consecutive elements in each row of both matrices, resulting in coalesced memory access.

    Rate this question:

  • 9. 

    What are the qualifier keywords in function declarations in CUDA?

    • A. 

      __graphic__

    • B. 

      __global__

    • C. 

      __Kernel__

    • D. 

      All the option

    Correct Answer
    B. __global__
    Explanation
    The qualifier keyword in function declarations in CUDA is "__global__". This keyword is used to indicate that the function is a CUDA kernel function, which will be executed on the GPU. The "__global__" keyword allows the function to be called from the CPU and executed on the GPU, making it an essential qualifier for CUDA programming.

    Rate this question:

  • 10. 

    A number of configuration parameters in the CUDA kernel function call.

    • A. 

      2

    • B. 

      1

    • C. 

      3

    • D. 

      5

    Correct Answer
    A. 2
    Explanation
    The CUDA kernel function call requires a number of configuration parameters, such as the number of blocks and threads to be launched. These parameters define how the GPU will execute the kernel function. Therefore, the correct answer is 2, indicating that there are multiple configuration parameters involved in a CUDA kernel function call.

    Rate this question:

Back to Top Back to top
×

Wait!
Here's an interesting quiz for you.

We have other quizzes matching your interest.