1.
Which statements are false regarding MapReduce?
Correct Answer(s)
A. Is the core component for data ingestion in Hadoop framework.
B. Is the parent project of Apache Hadoop.
C. Helps to combine the input data set into a number of parts and run a program on all data parts parallel at once.
Explanation
MapReduce is not the core component for data ingestion in the Hadoop framework. The core component for data ingestion in Hadoop is HDFS (Hadoop Distributed File System). MapReduce is a programming model and processing framework used for parallel processing of large datasets in Hadoop. It helps to combine the input data set into a number of parts and run a program on all data parts parallel at once. The term MapReduce refers to two separate and distinct tasks, namely the map task and the reduce task, which are performed in parallel to process the data.
2.
Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs)
Correct Answer
A. Mapper
Explanation
A mapper is a component in the MapReduce framework that takes a set of data and converts it into another set of data. It breaks down individual elements into tuples, which are key/value pairs. The mapper processes each input record independently and generates intermediate key/value pairs as output. These intermediate key/value pairs are then passed to the reducer for further processing.
3.
Combines Key-value pairs based on the key and accordingly modifies the value of the key.
Correct Answer
B. Reducer
Explanation
The given correct answer is Reducer. In the context of MapReduce programming model, the Reducer is responsible for combining the key-value pairs generated by the Mapper and performing operations on them based on the key. It takes the output of the Mapper as input and processes it to produce the final result. The Reducer combines the values associated with the same key and applies the required modifications or computations to the values. Therefore, the Reducer plays a crucial role in aggregating and summarizing the data generated by the Mapper.
4.
The reducer receives the key-value pair from _________ map job(s)
Correct Answer
B. Multiple
Explanation
The reducer receives the key-value pair from multiple map jobs. In the MapReduce framework, the input data is divided into multiple chunks and processed in parallel by multiple map tasks. Each map task processes a portion of the input data and generates intermediate key-value pairs. These intermediate key-value pairs are then grouped by their keys and sent to the reducer. The reducer receives the key-value pairs from all the map tasks and performs the final aggregation and computation on the data. Therefore, the reducer receives input from multiple map jobs.
5.
The splitting parameter can be anything, e.g. splitting by space, comma, semicolon, or even by a new line (‘\n’).
Correct Answer
A. True
Explanation
The statement is true because the splitting parameter in a program can be any character or sequence of characters that is used to divide a string into separate parts. This can include common delimiters like space, comma, semicolon, or even a new line character. The choice of splitting parameter depends on the specific requirements of the program and the structure of the input data.
6.
This stage is the combination of the Shuffle stage and itself.
Correct Answer
B. Reducer
Explanation
The given correct answer is "Reducer". In the MapReduce framework, the Reducer stage is responsible for combining the intermediate key-value pairs generated by the Mapper stage. It takes the output from the Shuffle stage, where the data is sorted and grouped by keys, and performs the required operations to produce the final output. Therefore, the Reducer stage can be seen as a combination of the Shuffle stage and itself, as it takes the sorted and grouped data and further processes it to obtain the desired result.
7.
__________is used for reading files in sequence. It is a specific compressed binary file format that is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Correct Answer
A. Sequencefileinputformat
Explanation
SequenceFileInputFormat is used for reading files in sequence. It is a specific compressed binary file format that is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
8.
Sets the mapper class and all the stuff related to map jobs such as reading data and generating a key-value pair out of the mapper.
Correct Answer
B. Conf.setMapperclass
Explanation
The correct answer is "Conf.setMapperclass". This answer suggests that the Conf object is used to set the mapper class for a map job. In Hadoop, the Conf object is used to configure various aspects of a job, including setting the mapper class. By calling the setMapperclass method on the Conf object, the user can specify the mapper class to be used for a particular map job. This allows the user to customize the behavior of the map job by providing their own implementation of the mapper class.
9.
Loads the data from its source and converts it into a key, value pairs suitable for reading by the Mapper.
Correct Answer
C. RecordReader
Explanation
The RecordReader is responsible for reading data from its source and converting it into key-value pairs that can be processed by the Mapper. In this context, the RecordReader is used by the SequenceFileInputFormat to read data from a SequenceFile and provide it to the Mapper. The SequenceFileInputFormat is set as the input format for the job using the Conf.setMapperclass method. The Mapper then uses the key-value pairs provided by the RecordReader to perform its processing tasks.
10.
Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
Correct Answer
A. Apache.hadoop.mapreduce.Mapper
Explanation
To create Mapper and Reducer for Hadoop, the interface that needs to be implemented is apache.hadoop.mapreduce.Mapper. This interface is used to define the map function, which takes input key-value pairs and produces intermediate key-value pairs. The map function is responsible for processing each input record and generating intermediate output records. The Reducer interface, on the other hand, is used to define the reduce function, which takes the intermediate key-value pairs and produces the final output. However, in this case, the question specifically asks for the interface needed to create Mapper and Reducer, and the correct answer is apache.hadoop.mapreduce.Mapper.
11.
What are the main configuration parameters that user need to specify to run MapReduce Job?
Correct Answer(s)
A. Job’s input and output locations in the distributed file system
C. Input and output format
E. Class containing the map and reduce function
G. JAR file containing the mapper, reducer and driver classes
Explanation
To run a MapReduce job, the user needs to specify the job's input and output locations in the distributed file system, as well as the input and output format. Additionally, the user needs to specify the class containing the map and reduce function, as well as the JAR file containing the mapper, reducer, and driver classes. These parameters are essential for the MapReduce framework to correctly process the data and execute the job.
12.
Which of the following statements are true about key/value pairs in Hadoop?
Correct Answer(s)
B. A map() function can emit anything between zero and an unlimited number of key/value pairs.
D. A call to reduce() is guaranteed to receive key/value pairs from only one key.
Explanation
The first statement is true because a map() function can emit any number of key/value pairs, including zero or an unlimited number, depending on the Hadoop environment. The second statement is false because a reduce() function can iterate over key/value pairs multiple times. The third statement is true because a call to reduce() is guaranteed to receive key/value pairs from only one key.
13.
Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the frequency of phrases consisting of 3 words each instead of determining the frequency of single words. Which part of the (pseudo-)code do you need to adapt?
Correct Answer
A. Only map()
Explanation
In the WordCount example, the map() function is responsible for splitting the input into individual words and emitting each word with a count of 1. To determine the frequency of phrases consisting of 3 words each, we need to modify the map() function to split the input into phrases instead of individual words. The reduce() function, on the other hand, is used to aggregate the counts of the same word, so it does not need to be changed in this case. Therefore, the correct answer is "Only map()".
14.
Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the average amount of words per sentence. Which part of the (pseudo-)code do you need to adapt?
Correct Answer
C. Map() and reduce()
Explanation
In order to determine the average amount of words per sentence, we need to modify both the map() and reduce() functions in the pseudo-code for MapReduce's WordCount example. The map() function will be responsible for splitting the input sentences into words and emitting key-value pairs where the key is the sentence and the value is the number of words in that sentence. The reduce() function will then calculate the total number of words for each sentence and the total number of sentences, and finally, compute the average by dividing the total number of words by the total number of sentences. Therefore, both map() and reduce() functions need to be adapted to achieve this.
15.
Bob has a Hadoop cluster with 20 machines under default setup (replication 3, 128MB input split size). Each machine has 500GB of HDFS disk space. The cluster is currently empty (no job, no data). Bob intends to upload 5 Terabyte of plain text (in 10 files of approximately 500GB each), followed by running Hadoop’s standard WordCount1 job. What is going to happen?
Correct Answer
B. The data upload fails at a lager stage: the disks are full
Explanation
The correct answer is that the data upload fails at a later stage because the disks are full. This is because each machine in the Hadoop cluster has only 500GB of HDFS disk space, but Bob intends to upload a total of 5 Terabytes of data. Since each file is approximately 500GB, when Bob tries to upload the second file, the disks will be full and the upload will fail.
16.
Basic Input Parameters of a Mapper.
Correct Answer
A. LongWritable and Text
Explanation
The correct answer is LongWritable and Text. In the context of Hadoop MapReduce, the input parameters of a Mapper function define the types of the input key and value that the Mapper will receive. In this case, the Mapper is expecting a LongWritable object as the input key and a Text object as the input value. The LongWritable class represents a 64-bit integer, while the Text class represents a sequence of characters. These input parameters allow the Mapper to process data in the form of key-value pairs, where the key is a long integer and the value is a text string.
17.
Basic intermediate output parameters of a Mapper.
Correct Answer
B. Text and IntWritable
Explanation
The basic intermediate output parameters of a Mapper are Text and IntWritable. This means that the Mapper takes in key-value pairs where the key is of type Text and the value is of type IntWritable. The key represents the input data being processed, while the value represents the output data generated by the Mapper. The Text type is used for textual data, while the IntWritable type is used for integer data.
18.
You can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop ______________________ API.
Correct Answer
streaming
Explanation
The Hadoop streaming API allows users to write MapReduce jobs in any desired programming language, such as Ruby, Perl, Python, R, Awk, etc. This means that developers are not limited to using a specific language and can leverage their existing skills and knowledge in their preferred language to write MapReduce jobs. The streaming API acts as a bridge between the Hadoop framework and the user's chosen programming language, enabling seamless integration and execution of MapReduce jobs.
19.
Which are true statements regarding MapReduce?
Correct Answer(s)
A. Is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters.
B. is a processing technique and a program model for distributed computing based on java.
Explanation
MapReduce is a framework that allows developers to write applications to process large amounts of data in parallel on large clusters. It is a processing technique and program model for distributed computing based on Java. The MapReduce algorithm includes the important task of mapping the input data into key-value pairs and then reducing the pairs into a smaller set of key-value pairs.
20.
Intermediate splitting – the entire process in parallel on different clusters. In order to group them in “Reduce Phase” the similar KEY data should be on same _________.
Correct Answer
A. Cluster
Explanation
In order to group similar KEY data in the "Reduce Phase", it is necessary for the data to be on the same cluster. This means that the intermediate splitting process, which is performed in parallel on different clusters, needs to ensure that data with similar keys is distributed within the same cluster. This allows for efficient processing and grouping of the data during the reduce phase.
21.
Combining – The last phase where all the data (individual result set from each ________) is combine together to form a Result
Correct Answer
A. Cluster
Explanation
In the given question, the correct answer is "Cluster". In the last phase of combining, all the data from each individual result set is brought together to form a final result. A cluster refers to a group of interconnected computers or servers that work together to process and analyze large amounts of data. Therefore, it is logical to conclude that in this context, the data from different sources is combined in a cluster to form the final result.
22.
The input file is passed to the mapper function ________________
Correct Answer
A. Line by Line
Explanation
The input file is passed to the mapper function "Line by Line" means that each line of the input file is processed individually by the mapper function. This approach allows for efficient processing of large input files as it avoids loading the entire file into memory at once. Each line is treated as a separate input and can be processed independently, making it easier to perform operations such as filtering, transformation, or aggregation on the data.
23.
A ______________ comes into action which carries out shuffling so that all the tuples with same key are sent to same node.
Correct Answer
partitioner
Explanation
A partitioner is a component that is responsible for distributing data across multiple nodes in a distributed system. In this context, the partitioner comes into action to ensure that all the tuples with the same key are sent to the same node. This is done through a shuffling process, where the partitioner determines the appropriate node for each tuple based on its key. By sending tuples with the same key to the same node, the partitioner facilitates efficient data processing and analysis in a distributed computing environment.
24.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example,
Correct Answer
B. Bear, [1,1]; Car, [1,1,1]
Explanation
After the sorting and shuffling phase, the data is grouped by key, and each reducer is assigned a unique key along with a list of values that correspond to that key. In this example, the key "Bear" has two sets of values [1,1] and [2]. The key "Car" has one set of values [1,1,1]. This means that the reducer with the key "Bear" will receive two sets of values [1,1] and [2], while the reducer with the key "Car" will receive one set of values [1,1,1].
25.
Under the MapReduce model, the data processing ____________ are called mappers and reducers.
Correct Answer
primitives
Explanation
In the MapReduce model, the data processing operations are divided into two stages: mapping and reducing. The mapping stage is responsible for processing the input data and transforming it into intermediate key-value pairs. The reducing stage takes these intermediate results and combines them to produce the final output. These two stages, mapping and reducing, are the fundamental building blocks or primitives of the MapReduce model. They are the basic operations that are used to perform data processing in a distributed and parallel manner.
26.
In Java the ___________ are used for emitting key-value pairs, and they are parameterized by the output.
Correct Answer
context objects
Explanation
In Java, context objects are used for emitting key-value pairs, and they are parameterized by the output. These context objects provide a way to pass data between different stages of a program or between different parts of a system. They allow the programmer to store and retrieve key-value pairs, which can be used for various purposes such as sharing information, passing data to other components, or storing intermediate results. Context objects are a powerful tool in Java programming for managing and manipulating data in a flexible and efficient manner.
27.
The MapReduce framework provides a _________ instance . __________ object use to communicate with mapReduce system.
Correct Answer
context
Explanation
The MapReduce framework provides a "context" instance. This context object is used to communicate with the MapReduce system. It allows the mapper or reducer functions to interact with the framework and access various features and functionalities provided by the system. The context object provides methods and attributes that enable the mapper or reducer to read input data, write output data, and perform other necessary operations within the MapReduce framework.
28.
In Java, Tokenizing Input & Shuffle and Sort are associated with what Class.
Correct Answer
A. Mapper Class
Explanation
Tokenizing input and shuffle and sort are associated with the Mapper class in Java. The Mapper class is responsible for processing the input data and converting it into key-value pairs, which are then passed to the shuffle and sort phase. During the shuffle and sort phase, the key-value pairs are sorted and grouped based on their keys before being sent to the Reducer class for further processing. Therefore, the correct answer is Mapper Class.
29.
In Java, Searching is associated with what Class?
Correct Answer
B. Reducer Class
Explanation
In Java, searching is not specifically associated with the Mapper or Reducer class. The Mapper class is responsible for processing input data and producing intermediate key-value pairs, while the Reducer class is responsible for combining and reducing the intermediate key-value pairs. Searching is typically performed using other classes and methods such as the Collections class or the Arrays class, depending on the data structure being searched. Therefore, the given answer "Reducer Class" is incorrect.
30.
Which are good use cases for MapReduce?
Correct Answer(s)
A. Log Analysis: Trouble shooting, Audit and Security checks
C. Breadth-First Search
D. Votes Casting
Explanation
MapReduce is a programming model and software framework commonly used for processing large amounts of data in a distributed computing environment. Log analysis, specifically for troubleshooting, audit, and security checks, is a good use case for MapReduce as it involves analyzing and processing large volumes of log data. Analyzing many small files can also benefit from MapReduce as it allows for parallel processing of multiple files simultaneously. Breadth-First Search, a graph traversal algorithm, can be implemented using MapReduce to explore and analyze large graphs efficiently. Votes casting, however, is not typically associated with MapReduce and may not be a suitable use case.