Parallel Programming Exam Quiz!

2.

What are the qualifier keywords in function declarations in CUDA?

__graphic__
__global__
__Kernel__
All the option

Correct Answer

A. __global__

Explanation

The qualifier keyword in function declarations in CUDA is "__global__". This keyword is used to indicate that the function is a CUDA kernel function, which will be executed on the GPU. The "__global__" keyword allows the function to be called from the CPU and executed on the GPU, making it an essential qualifier for CUDA programming.

Rate this question:

3.

For tiled matrix-matrix multiplication kernel, if we use a 64x64 tile, what is the reduction of memory bandwidth usage for input matrices M and N?

1/8 of the original usage
1/16 of the original usage
1/32of the original usage
1/64 of the original usage

Correct Answer

A. 1/64 of the original usage

Explanation

Using a 64x64 tile for tiled matrix-matrix multiplication reduces the memory bandwidth usage for input matrices M and N to 1/64 of the original usage. This reduction is achieved by loading a tile of size 64x64 into shared memory and reusing the loaded data multiple times, reducing the number of memory accesses. By minimizing the number of memory accesses, the overall memory bandwidth usage is significantly reduced, resulting in the 1/64 reduction mentioned in the answer.

Rate this question:

4.

Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?

1
1000
51200
512000

Correct Answer

A. 512000

Explanation

In this scenario, a kernel is launched with 1000 thread blocks, and each block has 512 threads. If a variable is declared as a local variable in the kernel, a separate version of the variable will be created for each thread. Since there are 1000 blocks and 512 threads per block, the total number of threads is 1000 * 512 = 512000. Therefore, 512000 versions of the variable will be created throughout the execution of the kernel.

Rate this question:

5.

We want to use each thread to calculate two (adjacent) elements of vector addition. Assume that a variable I should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to the data index?

I = blockIdx.x*blockDim.x + threadIdx.x + 2
I = blockIdx.x*threadIdx.x*2
I = (blockIdx.x*blockDim.x + threadIdx.x)*2
I = blockIdx.x*threadIdx.x*2 + threadIdx.x

Correct Answer

A. I = (blockIdx.x*blockDim.x + threadIdx.x)*2

Explanation

The expression for mapping the thread/block indices to the data index is given by i = (blockIdx.x*blockDim.x + threadIdx.x)*2. This expression takes into account both the block index and the thread index, and multiplies it by 2 to calculate the data index. The blockIdx.x represents the index of the current block, blockDim.x represents the number of threads per block, and threadIdx.x represents the index of the current thread within the block. Multiplying these values and adding them together gives the data index.

Rate this question:

6.

A number of configuration parameters in the CUDA kernel function call.

2
1
3
5

Correct Answer

A. 2

Explanation

The CUDA kernel function call requires a number of configuration parameters, such as the number of blocks and threads to be launched. These parameters define how the GPU will execute the kernel function. Therefore, the correct answer is 2, indicating that there are multiple configuration parameters involved in a CUDA kernel function call.

Rate this question:

7.

For the shared memory based tiled matrix multiplication (MxN) based on a row-major layout, which input matrix will have coalesced access?

M
N
Both
None

Correct Answer

A. Both

Explanation

Both input matrices M and N will have coalesced access in shared memory based tiled matrix multiplication. Coalesced access refers to accessing consecutive memory locations by consecutive threads. In a row-major layout, consecutive elements in a row are stored adjacently in memory. Since both matrices M and N are based on a row-major layout, the threads can access consecutive elements in each row of both matrices, resulting in coalesced memory access.

Rate this question:

8.

__syncthreads() function is applicable to?

Thread level
Block level
Grid level
All the option

Correct Answer

A. Block level

Explanation

The __syncthreads() function in CUDA is applicable at the block level. It is used to synchronize the threads within a block, ensuring that all threads have completed their execution before proceeding to the next set of instructions. This function is commonly used to coordinate shared memory access and avoid race conditions between threads within a block. It does not synchronize threads across different blocks or the entire grid. Therefore, the correct answer is block level.

Rate this question:

9.

Consider performing a matrix multiplication of two input matrices with dimensions NxN. How many times is each element in the input matrices requested from global memory, When tiles of size TxT are used?

T/N
N/T
T*N
(N*N)/(T*T)

Correct Answer

A. N/T

Explanation

When tiles of size TxT are used, each element in the input matrices is requested from global memory N/T times. This is because each tile of size TxT will contain T elements from the input matrices, and for each element, it needs to be requested N/T times in order to perform the matrix multiplication. Therefore, the correct answer is N/T.

Rate this question:

10.

If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 6 thread blocks. Which of the following block configuration would result in the most number of threads in the SM?

128 threads per block
276 threads per block
512 threads per block
1024 threads per block

Correct Answer

A. 512 threads per block

Explanation

A CUDA device's SM can take up to 1536 threads and up to 6 thread blocks. To maximize the number of threads in the SM, we need to consider the total number of threads that can be accommodated.

If we have 128 threads per block, the total number of threads would be 128 * 6 = 768 threads.

If we have 276 threads per block, the total number of threads would be 276 * 6 = 1656 threads, which exceeds the maximum capacity of 1536 threads.

If we have 512 threads per block, the total number of threads would be 512 * 6 = 3072 threads, which also exceeds the maximum capacity of 1536 threads.

If we have 1024 threads per block, the total number of threads would be 1024 * 6 = 6144 threads, which significantly exceeds the maximum capacity of 1536 threads.

Therefore, the block configuration that would result in the most number of threads in the SM is 512 threads per block.

Rate this question:

0

Parallel Programming Exam Quiz!

For a vector addition, assume that the vector length is 4000, each thread calculates one output element, and the thread block size is 1024 threads. How many threads will be in the grid?

Quiz Preview

What are the qualifier keywords in function declarations in CUDA?

For tiled matrix-matrix multiplication kernel, if we use a 64x64 tile, what is the reduction of memory bandwidth usage for input matrices M and N?

Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?

We want to use each thread to calculate two (adjacent) elements of vector addition. Assume that a variable I should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to the data index?

A number of configuration parameters in the CUDA kernel function call.

For the shared memory based tiled matrix multiplication (MxN) based on a row-major layout, which input matrix will have coalesced access?

__syncthreads() function is applicable to?

Consider performing a matrix multiplication of two input matrices with dimensions NxN. How many times is each element in the input matrices requested from global memory, When tiles of size TxT are used?

If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 6 thread blocks. Which of the following block configuration would result in the most number of threads in the SM?