Parallel Processing Quiz 3

1. Barrier synchronizations should be used whenever we want to ensure all threads have completed a common phase of their execution_____________

Before any of them start the next phase

After any of them start the next phase

Before any of them start the previous phase

After any of them start the previous phase

Barrier synchronizations should be used whenever we want to ensure all threads have completed a common phase of their execution before any of them start the next phase. This means that the barrier synchronization will block the threads until all of them have reached the barrier, ensuring that they all finish the current phase before moving on to the next one. This helps in coordinating the execution of multiple threads and ensures that they all reach a specific point before proceeding further.

Explanation

Barrier synchronizations should be used whenever we want to ensure all threads have completed a common phase of their execution before any of them start the next phase. This means that the barrier synchronization will block the threads until all of them have reached the barrier, ensuring that they all finish the current phase before moving on to the next one. This helps in coordinating the execution of multiple threads and ensures that they all reach a specific point before proceeding further.

2. Do CUDA memory architecture have cache and cache levels?

True

False

CUDA memory architecture does have cache and cache levels. The GPU's memory hierarchy includes multiple levels of cache, such as L1 and L2 caches, which are used to store frequently accessed data and improve memory access latency. These caches help to reduce the time it takes to fetch data from the main memory, thereby improving overall performance.

Explanation

CUDA memory architecture does have cache and cache levels. The GPU's memory hierarchy includes multiple levels of cache, such as L1 and L2 caches, which are used to store frequently accessed data and improve memory access latency. These caches help to reduce the time it takes to fetch data from the main memory, thereby improving overall performance.

3. Consider performing a 1D convolution on array N= {4,1,3,2,3} with mask M={2,1,4}. What is the resulting output array?

{8,21,13,21,8}

{8,21,13,20,7}

{9,21,14,20,7}

{9,21,14,21,7}

not-available-via-ai

Explanation

not-available-via-ai

4. Which of the memory is referred to as "scratchpad memory"-

Constant memory

Global memory

Shared memory

Registers

Shared memory refers to a type of memory that is shared among multiple threads within a block in a GPU. It is referred to as "scratchpad memory" because it can be used as a temporary storage space for threads to quickly exchange data and communicate with each other. This type of memory is faster to access compared to global memory, making it ideal for frequently accessed data that needs to be shared among threads. Constant memory, global memory, and registers are different types of memory in a GPU, but they do not specifically serve the purpose of scratchpad memory.

Explanation

Shared memory refers to a type of memory that is shared among multiple threads within a block in a GPU. It is referred to as "scratchpad memory" because it can be used as a temporary storage space for threads to quickly exchange data and communicate with each other. This type of memory is faster to access compared to global memory, making it ideal for frequently accessed data that needs to be shared among threads. Constant memory, global memory, and registers are different types of memory in a GPU, but they do not specifically serve the purpose of scratchpad memory.

5. Correct Syntax to declare constant memory is:

CudaMemcpyToSymbol (dest, src, size)

CudaMemcpy (dest, src, size)

CudaMemcpyToSymbol (src, dest, size)

CudaMemcpySymbol (dest, src, size)

The correct syntax to declare constant memory is cudaMemcpyToSymbol (dest, src, size). This function is used to copy data from the host memory to the constant memory on the device. The "dest" parameter specifies the destination symbol in the constant memory, the "src" parameter specifies the source data in the host memory, and the "size" parameter specifies the size of the data to be copied.

Explanation

The correct syntax to declare constant memory is cudaMemcpyToSymbol (dest, src, size). This function is used to copy data from the host memory to the constant memory on the device. The "dest" parameter specifies the destination symbol in the constant memory, the "src" parameter specifies the source data in the host memory, and the "size" parameter specifies the size of the data to be copied.

6. For a tiled 1D convolution, if the output tile width is 250 elements and mask width is 7 elements, what is the input tile width loaded in to shared memory?

250

254

256

7

In a tiled 1D convolution, the output tile width is given as 250 elements and the mask width is given as 7 elements. To efficiently perform the convolution, the input tile width loaded into shared memory needs to be a multiple of the mask width. Since the mask width is 7, the input tile width needs to be a multiple of 7. Among the given options, the only value that is a multiple of 7 is 256. Therefore, the input tile width loaded into shared memory is 256.

Explanation

In a tiled 1D convolution, the output tile width is given as 250 elements and the mask width is given as 7 elements. To efficiently perform the convolution, the input tile width loaded into shared memory needs to be a multiple of the mask width. Since the mask width is 7, the input tile width needs to be a multiple of 7. Among the given options, the only value that is a multiple of 7 is 256. Therefore, the input tile width loaded into shared memory is 256.

7. How many multiplications are performed if halo cells are treated as multiplications (by 0) for an array of size n and mask of size m in case of 1-D convolution?

M*n+1

M*n-1

M*n

N*n

In 1-D convolution, each element of the mask is multiplied with the corresponding element in the array, and then all the products are summed up. Since the mask has size m and the array has size n, there will be m*n multiplications. Additionally, there is one extra multiplication because of the halo cells being treated as multiplications by 0. Therefore, the total number of multiplications is m*n+1.

Explanation

In 1-D convolution, each element of the mask is multiplied with the corresponding element in the array, and then all the products are summed up. Since the mask has size m and the array has size n, there will be m*n multiplications. Additionally, there is one extra multiplication because of the halo cells being treated as multiplications by 0. Therefore, the total number of multiplications is m*n+1.

8. For the work inefficient scan kernel based on reduction trees, assume that we have 1024 elements, which of the following gives the closest approximation of the number of add operations performed?

(1024-1) *2

(512-1) *2

1024*1024

1024*10

The given correct answer, 1024*10, gives the closest approximation of the number of add operations performed in the work inefficient scan kernel based on reduction trees. This is because the number of add operations performed in the scan kernel is equal to the number of elements minus one, multiplied by two. In this case, there are 1024 elements, so the number of add operations would be (1024-1) * 2 = 2046. Therefore, the closest approximation to 2046 is 1024*10, which equals 10240.

Explanation

The given correct answer, 1024*10, gives the closest approximation of the number of add operations performed in the work inefficient scan kernel based on reduction trees. This is because the number of add operations performed in the scan kernel is equal to the number of elements minus one, multiplied by two. In this case, there are 1024 elements, so the number of add operations would be (1024-1) * 2 = 2046. Therefore, the closest approximation to 2046 is 1024*10, which equals 10240.

9. Consider performing a 1D convolution on an array of size n with a mask of size m. How many halo cells are there in total?

M+n-1

M-1

N-1

M+n

When performing a 1D convolution on an array of size n with a mask of size m, the halo cells are the extra cells that are added to the array to ensure that the convolution is properly calculated at the edges. The number of halo cells required is equal to the size of the mask minus one (m-1). This is because the mask needs to extend one cell beyond each end of the array to cover all the elements in the convolution.

Explanation

When performing a 1D convolution on an array of size n with a mask of size m, the halo cells are the extra cells that are added to the array to ensure that the convolution is properly calculated at the edges. The number of halo cells required is equal to the size of the mask minus one (m-1). This is because the mask needs to extend one cell beyond each end of the array to cover all the elements in the convolution.

10. Each time a DRAM location is accessed, then __________

Many consecutive locations that include the requested location are actually accessed

Only the location requested location are actually accessed

All the locations that include the requested location are actually accessed

Many consecutive locations that excluding the requested location are actually accessed

When a DRAM location is accessed, all the locations that include the requested location are actually accessed. This is because DRAM operates in blocks, and when a specific location is accessed, the entire block that contains that location is read or written to. This is known as the "row buffer" or "page" in DRAM, and it helps to improve efficiency by accessing multiple locations at once. Therefore, all the locations within the block are accessed, not just the requested location.

Explanation

When a DRAM location is accessed, all the locations that include the requested location are actually accessed. This is because DRAM operates in blocks, and when a specific location is accessed, the entire block that contains that location is read or written to. This is known as the "row buffer" or "page" in DRAM, and it helps to improve efficiency by accessing multiple locations at once. Therefore, all the locations within the block are accessed, not just the requested location.