Parallel Programming Exam Quiz!

Reviewed by Editorial Team
The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.
Learn about Our Editorial Process
| By Dps158
D
Dps158
Community Contributor
Quizzes Created: 1 | Total Attempts: 682
| Attempts: 683
SettingsSettings
Please wait...
  • 1/10 Questions

    We want to use each thread to calculate two (adjacent) elements of vector addition. Assume that a variable I should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to the data index?

    • I = blockIdx.x*blockDim.x + threadIdx.x + 2
    • I = blockIdx.x*threadIdx.x*2
    • I = (blockIdx.x*blockDim.x + threadIdx.x)*2
    • I = blockIdx.x*threadIdx.x*2 + threadIdx.x
Please wait...
About This Quiz

The 'Parallel Programming Exam Quiz!' assesses knowledge in parallel computing using CUDA. It covers thread indexing, memory management, and synchronization, targeting skills crucial for optimizing performance on parallel architectures. Essential for learners in computer science focusing on high-performance computing.

Parallel Programming Exam Quiz! - Quiz

Quiz Preview

  • 2. 

    For a vector addition, assume that the vector length is 4000, each thread calculates one output element, and the thread block size is 1024 threads. How many threads will be in the grid?

    • 2000

    • 3000

    • 1024

    • 4096

    Correct Answer
    A. 4096
    Explanation
    In this scenario, the vector length is 4000 and each thread calculates one output element. The thread block size is 1024 threads. To find the total number of threads in the grid, we divide the vector length by the thread block size. So, 4000/1024 equals approximately 3.906. Since we cannot have a fraction of a thread, we round up to the nearest whole number, which is 4. Therefore, there will be 4 thread blocks, each with 1024 threads, resulting in a total of 4096 threads in the grid.

    Rate this question:

  • 3. 

    If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 6 thread blocks. Which of the following block configuration would result in the most number of threads in the SM?

    • 128 threads per block

    • 276 threads per block

    • 512 threads per block

    • 1024 threads per block

    Correct Answer
    A. 512 threads per block
    Explanation
    A CUDA device's SM can take up to 1536 threads and up to 6 thread blocks. To maximize the number of threads in the SM, we need to consider the total number of threads that can be accommodated.

    If we have 128 threads per block, the total number of threads would be 128 * 6 = 768 threads.

    If we have 276 threads per block, the total number of threads would be 276 * 6 = 1656 threads, which exceeds the maximum capacity of 1536 threads.

    If we have 512 threads per block, the total number of threads would be 512 * 6 = 3072 threads, which also exceeds the maximum capacity of 1536 threads.

    If we have 1024 threads per block, the total number of threads would be 1024 * 6 = 6144 threads, which significantly exceeds the maximum capacity of 1536 threads.

    Therefore, the block configuration that would result in the most number of threads in the SM is 512 threads per block.

    Rate this question:

  • 4. 

    __syncthreads() function is applicable to?

    • Thread level

    • Block level

    • Grid level

    • All the option

    Correct Answer
    A. Block level
    Explanation
    The __syncthreads() function in CUDA is applicable at the block level. It is used to synchronize the threads within a block, ensuring that all threads have completed their execution before proceeding to the next set of instructions. This function is commonly used to coordinate shared memory access and avoid race conditions between threads within a block. It does not synchronize threads across different blocks or the entire grid. Therefore, the correct answer is block level.

    Rate this question:

  • 5. 

    For tiled matrix-matrix multiplication kernel, if we use a 64x64 tile, what is the reduction of memory bandwidth usage for input matrices M and N?

    • 1/8 of the original usage

    • 1/16 of the original usage

    • 1/32of the original usage

    • 1/64 of the original usage

    Correct Answer
    A. 1/64 of the original usage
    Explanation
    Using a 64x64 tile for tiled matrix-matrix multiplication reduces the memory bandwidth usage for input matrices M and N to 1/64 of the original usage. This reduction is achieved by loading a tile of size 64x64 into shared memory and reusing the loaded data multiple times, reducing the number of memory accesses. By minimizing the number of memory accesses, the overall memory bandwidth usage is significantly reduced, resulting in the 1/64 reduction mentioned in the answer.

    Rate this question:

  • 6. 

    Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?

    • 1

    • 1000

    • 51200

    • 512000

    Correct Answer
    A. 512000
    Explanation
    In this scenario, a kernel is launched with 1000 thread blocks, and each block has 512 threads. If a variable is declared as a local variable in the kernel, a separate version of the variable will be created for each thread. Since there are 1000 blocks and 512 threads per block, the total number of threads is 1000 * 512 = 512000. Therefore, 512000 versions of the variable will be created throughout the execution of the kernel.

    Rate this question:

  • 7. 

    Consider performing a matrix multiplication of two input matrices with dimensions NxN. How many times is each element in the input matrices requested from global memory, When tiles of size TxT are used?

    • T/N

    • N/T

    • T*N

    • (N*N)/(T*T)

    Correct Answer
    A. N/T
    Explanation
    When tiles of size TxT are used, each element in the input matrices is requested from global memory N/T times. This is because each tile of size TxT will contain T elements from the input matrices, and for each element, it needs to be requested N/T times in order to perform the matrix multiplication. Therefore, the correct answer is N/T.

    Rate this question:

  • 8. 

    For the shared memory based tiled matrix multiplication (MxN) based on a row-major layout, which input matrix will have coalesced access?

    • M

    • N

    • Both

    • None

    Correct Answer
    A. Both
    Explanation
    Both input matrices M and N will have coalesced access in shared memory based tiled matrix multiplication. Coalesced access refers to accessing consecutive memory locations by consecutive threads. In a row-major layout, consecutive elements in a row are stored adjacently in memory. Since both matrices M and N are based on a row-major layout, the threads can access consecutive elements in each row of both matrices, resulting in coalesced memory access.

    Rate this question:

  • 9. 

    What are the qualifier keywords in function declarations in CUDA?

    • __graphic__

    • __global__

    • __Kernel__

    • All the option

    Correct Answer
    A. __global__
    Explanation
    The qualifier keyword in function declarations in CUDA is "__global__". This keyword is used to indicate that the function is a CUDA kernel function, which will be executed on the GPU. The "__global__" keyword allows the function to be called from the CPU and executed on the GPU, making it an essential qualifier for CUDA programming.

    Rate this question:

  • 10. 

    A number of configuration parameters in the CUDA kernel function call.

    • 2

    • 1

    • 3

    • 5

    Correct Answer
    A. 2
    Explanation
    The CUDA kernel function call requires a number of configuration parameters, such as the number of blocks and threads to be launched. These parameters define how the GPU will execute the kernel function. Therefore, the correct answer is 2, indicating that there are multiple configuration parameters involved in a CUDA kernel function call.

    Rate this question:

Quiz Review Timeline (Updated): Jun 14, 2023 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Jun 14, 2023
    Quiz Edited by
    ProProfs Editorial Team
  • May 01, 2015
    Quiz Created by
    Dps158
Back to Top Back to top
Advertisement
×

Wait!
Here's an interesting quiz for you.

We have other quizzes matching your interest.