1.
We want to use each thread to calculate two (adjacent) elements of vector addition. Assume that a variable I should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to the data index?
Correct Answer
C. I = (blockIdx.x*blockDim.x + threadIdx.x)*2
Explanation
The expression for mapping the thread/block indices to the data index is given by i = (blockIdx.x*blockDim.x + threadIdx.x)*2. This expression takes into account both the block index and the thread index, and multiplies it by 2 to calculate the data index. The blockIdx.x represents the index of the current block, blockDim.x represents the number of threads per block, and threadIdx.x represents the index of the current thread within the block. Multiplying these values and adding them together gives the data index.
2.
For a vector addition, assume that the vector length is 4000, each thread calculates one output element, and the thread block size is 1024 threads. How many threads will be in the grid?
Correct Answer
D. 4096
Explanation
In this scenario, the vector length is 4000 and each thread calculates one output element. The thread block size is 1024 threads. To find the total number of threads in the grid, we divide the vector length by the thread block size. So, 4000/1024 equals approximately 3.906. Since we cannot have a fraction of a thread, we round up to the nearest whole number, which is 4. Therefore, there will be 4 thread blocks, each with 1024 threads, resulting in a total of 4096 threads in the grid.
3.
If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 6 thread blocks. Which of the following block configuration would result in the most number of threads in the SM?
Correct Answer
C. 512 threads per block
Explanation
A CUDA device's SM can take up to 1536 threads and up to 6 thread blocks. To maximize the number of threads in the SM, we need to consider the total number of threads that can be accommodated.
If we have 128 threads per block, the total number of threads would be 128 * 6 = 768 threads.
If we have 276 threads per block, the total number of threads would be 276 * 6 = 1656 threads, which exceeds the maximum capacity of 1536 threads.
If we have 512 threads per block, the total number of threads would be 512 * 6 = 3072 threads, which also exceeds the maximum capacity of 1536 threads.
If we have 1024 threads per block, the total number of threads would be 1024 * 6 = 6144 threads, which significantly exceeds the maximum capacity of 1536 threads.
Therefore, the block configuration that would result in the most number of threads in the SM is 512 threads per block.
4.
__syncthreads() function is applicable to?
Correct Answer
B. Block level
Explanation
The __syncthreads() function in CUDA is applicable at the block level. It is used to synchronize the threads within a block, ensuring that all threads have completed their execution before proceeding to the next set of instructions. This function is commonly used to coordinate shared memory access and avoid race conditions between threads within a block. It does not synchronize threads across different blocks or the entire grid. Therefore, the correct answer is block level.
5.
For tiled matrix-matrix multiplication kernel, if we use a 64x64 tile, what is the reduction of memory bandwidth usage for input matrices M and N?
Correct Answer
D. 1/64 of the original usage
Explanation
Using a 64x64 tile for tiled matrix-matrix multiplication reduces the memory bandwidth usage for input matrices M and N to 1/64 of the original usage. This reduction is achieved by loading a tile of size 64x64 into shared memory and reusing the loaded data multiple times, reducing the number of memory accesses. By minimizing the number of memory accesses, the overall memory bandwidth usage is significantly reduced, resulting in the 1/64 reduction mentioned in the answer.
6.
Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?
Correct Answer
D. 512000
Explanation
In this scenario, a kernel is launched with 1000 thread blocks, and each block has 512 threads. If a variable is declared as a local variable in the kernel, a separate version of the variable will be created for each thread. Since there are 1000 blocks and 512 threads per block, the total number of threads is 1000 * 512 = 512000. Therefore, 512000 versions of the variable will be created throughout the execution of the kernel.
7.
Consider performing a matrix multiplication of two input matrices with dimensions NxN. How many times is each element in the input matrices requested from global memory, When tiles of size TxT are used?
Correct Answer
B. N/T
Explanation
When tiles of size TxT are used, each element in the input matrices is requested from global memory N/T times. This is because each tile of size TxT will contain T elements from the input matrices, and for each element, it needs to be requested N/T times in order to perform the matrix multiplication. Therefore, the correct answer is N/T.
8.
For the shared memory based tiled matrix multiplication (MxN) based on a row-major layout, which input matrix will have coalesced access?
Correct Answer
C. Both
Explanation
Both input matrices M and N will have coalesced access in shared memory based tiled matrix multiplication. Coalesced access refers to accessing consecutive memory locations by consecutive threads. In a row-major layout, consecutive elements in a row are stored adjacently in memory. Since both matrices M and N are based on a row-major layout, the threads can access consecutive elements in each row of both matrices, resulting in coalesced memory access.
9.
What are the qualifier keywords in function declarations in CUDA?
Correct Answer
B. __global__
Explanation
The qualifier keyword in function declarations in CUDA is "__global__". This keyword is used to indicate that the function is a CUDA kernel function, which will be executed on the GPU. The "__global__" keyword allows the function to be called from the CPU and executed on the GPU, making it an essential qualifier for CUDA programming.
10.
A number of configuration parameters in the CUDA kernel function call.
Correct Answer
A. 2
Explanation
The CUDA kernel function call requires a number of configuration parameters, such as the number of blocks and threads to be launched. These parameters define how the GPU will execute the kernel function. Therefore, the correct answer is 2, indicating that there are multiple configuration parameters involved in a CUDA kernel function call.