Certified Developer Quiz: How Much Do You Know About Apache Hadoop?

1. How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?

Keys are presented to reducer in sorted order; values for a given key are not sorted.

Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.

Keys are presented to a reducer in random order; values for a given key are not sorted.

Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.

Explanation

Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.

2. In the execution of a MapReduce job, where does the Mapper place the intermediate data of each Map task?

The Hadoop framework hold the intermediate data in the TaskTracker's memory

The Mapper transfers the intermediate data to the JobTracker, which then sends it to the Reducers

The Mapper transfers the intermediate data to the reducers as soon as it is generated by the Map task

The Mapper stores the intermediate data on the underlying filesystem of the local disk of the machine which ran the Map task. This is because the intermediate data is generated by the Mapper and needs to be stored temporarily before being transferred to the Reducers. Storing the data on the local disk allows for efficient access and retrieval when it is needed for the Reducers to process and combine the data.

Explanation

The Mapper stores the intermediate data on the underlying filesystem of the local disk of the machine which ran the Map task. This is because the intermediate data is generated by the Mapper and needs to be stored temporarily before being transferred to the Reducers. Storing the data on the local disk allows for efficient access and retrieval when it is needed for the Reducers to process and combine the data.

3. Assuming default settings, which best describes the order of data provided to a reducer's reduce method:

The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are.

Both the keys and values passed to a reducer always appear in sorted order.

Neither keys nor values are in any predictable order.

Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.

Explanation

Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.

4. Task scheduling is handled by

Reduce task

Task tracker

Map task

Job tracker

The correct answer is Job tracker. In Hadoop, task scheduling is handled by the Job tracker. The Job tracker is responsible for assigning tasks to the available Task trackers in the cluster. It keeps track of the overall progress of the job and coordinates the execution of Map and Reduce tasks. The Job tracker also handles the failure of tasks and reschedules them if necessary.

Explanation

The correct answer is Job tracker. In Hadoop, task scheduling is handled by the Job tracker. The Job tracker is responsible for assigning tasks to the available Task trackers in the cluster. It keeps track of the overall progress of the job and coordinates the execution of Map and Reduce tasks. The Job tracker also handles the failure of tasks and reschedules them if necessary.

5. When is the earliest point at which the reduce method of a given Reducer can be called?

As soon as at least one mapper has finished processing its input split.

As soon as a mapper has emitted at least one record.

Not until all mappers have finished processing all records.

It depends on the InputFormat used for the job.

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
Note: The reduce phase has 3 steps: shuffle, sort, and reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer.

Explanation

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
Note: The reduce phase has 3 steps: shuffle, sort, and reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer.

6. Data locality is considered when scheduling

Job tracker

Map task

Reduce task

Task tracker

Job tracker and task tracker are daemons and are not scheduled. It is not possible to consider data locality for reduce tasks as they are dependent on output of the Map tasks

Explanation

Job tracker and task tracker are daemons and are not scheduled. It is not possible to consider data locality for reduce tasks as they are dependent on output of the Map tasks

7. Input splits created by

Driver program

Job tracker

Map task

Reduce task

The correct answer is the Driver program because it is responsible for dividing the input data into smaller chunks called input splits. These input splits are then assigned to the map tasks for processing. The Driver program determines the number and size of the input splits based on the input data size and the configuration settings.

Explanation

The correct answer is the Driver program because it is responsible for dividing the input data into smaller chunks called input splits. These input splits are then assigned to the map tasks for processing. The Driver program determines the number and size of the input splits based on the input data size and the configuration settings.

8. You've built a MapReduce job that denormalizes a very large table, resulting in an extremely large amount of output data. Which two cluster resources will your job stress? (Choose two).

Processor

RAM

Network I/O

Disk I/O

The MapReduce job denormalizes a large table, which means it combines data from multiple tables into one. This process requires a lot of data transfer over the network, as well as reading and writing data to the disk. Therefore, the job will stress both the Network I/O and Disk I/O resources of the cluster.

Explanation

The MapReduce job denormalizes a large table, which means it combines data from multiple tables into one. This process requires a lot of data transfer over the network, as well as reading and writing data to the disk. Therefore, the job will stress both the Network I/O and Disk I/O resources of the cluster.

Submit

9. Which describes how a client reads a file from HDFS?

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

Explanation

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

10. You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?

Combiner (Text, IntWritable, Text, IntWritable)

Mapper (Text, IntWritable, Text, IntWritable)

Reducer (Text, Text, IntWritable, IntWritable)

Combiner (Text, Text, IntWritable, IntWritable)

not-available-via-ai

Explanation

not-available-via-ai