Spark Training- Post Test

Reviewed by Editorial Team
The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.
Learn about Our Editorial Process
| By Ravisoftsource
R
Ravisoftsource
Community Contributor
Quizzes Created: 1 | Total Attempts: 1,811
| Attempts: 1,811 | Questions: 71
Please wait...
Question 1 / 71
0 %
0/100
Score 0/100
1. Spark is 100x faster than MapReduce due to

Explanation

In-memory computing is the reason why Spark is 100x faster than MapReduce. By keeping the data in memory, Spark eliminates the need to read and write data from disk, which significantly speeds up data processing. This allows Spark to perform operations much faster than MapReduce, which relies heavily on disk I/O. By leveraging the power of in-memory computing, Spark is able to achieve impressive performance gains and process large datasets more efficiently.

Submit
Please wait...
About This Quiz
Spark Training- Post Test - Quiz

The 'Spark Training- Post Test' assesses understanding of Apache Spark, focusing on core concepts like Spark SQL, DataFrame schemas, and in-memory computing. It evaluates entry-level skills necessary for... see moreefficient data processing and analytics, making it crucial for learners aiming to excel in data-intensive environments. see less

2. Which of the following statements are correct

Explanation

All of the statements are correct. Spark is designed to run on top of Hadoop and can process data stored in HDFS. It can also use Yarn as a resource management layer, which allows for efficient allocation of resources and scheduling of tasks in a Hadoop cluster. Therefore, all three statements are true.

Submit
3. Caching is optimizing technique?

Explanation

Caching is indeed an optimizing technique. It involves storing frequently accessed data or resources in a cache, which is a high-speed memory or storage system. By doing so, the system can retrieve the data or resources more quickly, reducing the need to access slower or more resource-intensive components. This can greatly improve the performance and efficiency of a system, making caching an effective optimization technique.

Submit
4. What are the features of Spark RDD?

Explanation

The features of Spark RDD include in-memory computation, lazy evaluations, and fault tolerance. In-memory computation allows Spark to store data in memory, which significantly speeds up data processing. Lazy evaluations enable Spark to optimize the execution of transformations on RDDs by postponing their execution until an action is called. Fault tolerance ensures that if a node fails, Spark can recover the lost data and continue processing without any disruption. Therefore, the correct answer is "All of the above."

Submit
5. SparkContext guides how to access the Spark cluster?

Explanation

The SparkContext is the entry point for accessing the Spark cluster. It is responsible for coordinating the execution of tasks and distributing data across the cluster. It provides methods for creating RDDs (Resilient Distributed Datasets) and performing operations on them. Therefore, it guides how to access the Spark cluster, making the answer TRUE.

Submit
6. What does Spark Engine do?

Explanation

The Spark Engine performs multiple tasks including scheduling, distributing data across a cluster, and monitoring data across the cluster. It is responsible for managing the execution of Spark applications, allocating resources, and coordinating tasks across the cluster. By handling these tasks, the Spark Engine enables efficient and parallel processing of large datasets, making it a powerful tool for big data analytics and processing.

Submit
7. What does the following code print? val lyrics = List("all", "that", "i", "know") println(lyrics.size)

Explanation

The code creates a list called "lyrics" with 4 elements: "all", "that", "i", and "know". The "println" statement prints the size of the list, which is 4.

Submit
8. Which type of processing Apache Spark can handle

Explanation

Apache Spark is a powerful data processing framework that can handle various types of processing tasks. It supports batch processing, which involves processing large volumes of data in a scheduled manner. It also supports stream processing, which involves processing real-time data as it arrives. Additionally, Apache Spark can handle graph processing, which involves analyzing and processing graph-based data structures. Lastly, it supports interactive processing, which involves querying and analyzing data interactively in real-time. Therefore, the correct answer is "All of the above" as Apache Spark is capable of handling all these types of processing.

Submit
9. Apache Spark has API's in

Explanation

Apache Spark has APIs in Java, Scala, and Python. This means that developers can use any of these programming languages to interact with and manipulate data in Apache Spark. The availability of multiple APIs allows developers to choose the language they are most comfortable with, making it easier for them to work with Spark and perform tasks such as data analysis, machine learning, and distributed processing.

Submit
10. Which of the following are Dataframe actions ?

Explanation

The given answer "All the above" is correct because all the mentioned options - count, first, take(n), and collect - are actions that can be performed on a DataFrame. These actions are used to retrieve or manipulate data from the DataFrame. The count action returns the number of rows in the DataFrame, the first action returns the first row, the take(n) action returns the first n rows, and the collect action retrieves all the rows from the DataFrame. Therefore, all the mentioned options are valid DataFrame actions.

Submit
11. How do you print schema of a dataframe?

Explanation

The correct answer is df.printSchema(). This is because the printSchema() function is a method in Spark DataFrame that prints the schema of the DataFrame in a tree format. It displays the column names and their corresponding data types, providing a concise overview of the structure of the DataFrame.

Submit
12. Identify correct transformation

Explanation

The correct answer is "All of the above" because the question is asking to identify the correct transformation, and all three options - Map, Filter, and Join - are valid transformations in data processing. Map is used to transform each element in a dataset, Filter is used to select specific elements based on a condition, and Join is used to combine two datasets based on a common key. Therefore, all three transformations can be used depending on the specific requirements of the data processing task.

Submit
13. Choose correct statement about RDD

Explanation

RDD stands for Resilient Distributed Dataset, which is a fundamental data structure in Apache Spark. It is not a database or a programming paradigm. RDD is a distributed data structure that allows data to be processed in parallel across a cluster of computers. RDDs are fault-tolerant and can be cached in memory, which enables faster processing. They provide a high-level abstraction for distributed data processing and are a key component in Spark's computational model.

Submit
14. How much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can?

Explanation

Apache Spark can potentially run batch-processing programs 100 times faster than MapReduce when processed in memory. This is because Spark is designed to store data in memory, which allows for faster data processing and eliminates the need to read and write data from disk, as in the case of MapReduce. Additionally, Spark utilizes a directed acyclic graph (DAG) execution engine, which optimizes the execution plan and minimizes the overhead of data shuffling. These factors contribute to the significant speed improvement of Spark over MapReduce.

Submit
15. On which cluster tasks are majorly launched in Prod world?

Explanation

In the production world, tasks are primarily launched on the Yarn cluster. Yarn is a distributed processing framework that allows for efficient resource management and job scheduling in Hadoop. It provides a flexible and scalable platform for running various types of applications, including MapReduce, Spark, and Hive. Yarn's ability to handle large workloads and optimize resource utilization makes it the preferred choice for launching tasks in the production environment.

Submit
16. What does the following code print? val numbers = List(11, 22, 33) var total = 0 for (i <- numbers) {   total += i } println(total)

Explanation

The given code initializes a list of numbers [11, 22, 33] and a variable total with the value 0. It then iterates over each element in the list using a for loop and adds each element to the total. Finally, it prints the value of total, which is 66.

Submit
17. Which Cluster Manager do Spark Support?

Explanation

Spark supports all of the above cluster managers, which include Standalone Cluster Manager, Mesos, and YARN. This means that Spark can be deployed and run on any of these cluster managers, providing flexibility and compatibility with different environments and infrastructures.

Submit
18. What does the following code print: var min = (a: Int, b: Int) => { if (a > b) b else a } println(min(78, 44))

Explanation

The given code defines a function called "min" which takes two parameters (a and b) and returns the smaller value between them. In this case, the function is called with arguments 78 and 44, so it will return 44. The "println" statement then prints the returned value, which is 44.

Submit
19. What is the default block size in Hadoop 2 ?

Explanation

The default block size in Hadoop 2 is 128MB. This means that when data is stored in Hadoop, it is divided into blocks of this size. Each block is then distributed across the cluster for processing. The default block size of 128MB is chosen to strike a balance between efficient storage utilization and parallel processing. It allows for optimal performance by ensuring that each block can be processed independently by a single node in the cluster.

Submit
20. How many Spark Context can be active per job?

Explanation

The correct answer is "only one" because in Apache Spark, there can only be one active Spark Context per job. A Spark Context represents the entry point to the Spark cluster and coordinates the execution of tasks. Having multiple active Spark Contexts can lead to conflicts and inconsistencies in the execution environment. Therefore, it is recommended to have only one active Spark Context at a time.

Submit
21. Dataframes are _____________

Explanation

Dataframes are immutable, meaning that once they are created, their contents cannot be changed. This ensures data integrity and prevents accidental modifications to the dataframe. If any changes need to be made to a dataframe, a new dataframe must be created with the desired modifications. This immutability property also allows for easier debugging and reproducibility, as the original dataframe remains unchanged throughout the data processing pipeline.

Submit
22. The default storage level of cache() is?

Explanation

The default storage level of cache() is MEMORY_ONLY. This means that the RDD will be stored in memory as deserialized Java objects. This storage level provides fast access to the data but does not persist it on disk. If the memory is not sufficient to store the entire RDD, some partitions may be evicted and recomputed on the fly when needed.

Submit
23. RDD is

Explanation

RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark. It is immutable, meaning that once created, its data cannot be modified. RDDs are also recomputable, which means that if a node fails, the RDD can be reconstructed from the lineage information. Finally, RDDs are fault-tolerant, as they automatically recover from failures. Therefore, the correct answer is "All of the above" as RDDs possess all these characteristics.

Submit
24. What does the following code print? val numbers = List("one", "two") val letters = List("a", "b") val numbersRdd = sc.parallelize(numbers) val lettersRdd = sc.parallelize(letters) val both = numbersRdd.union(lettersRdd) println(both)

Explanation

The given code creates two lists, "numbers" and "letters", and then parallelizes them using the SparkContext "sc". It then combines the two RDDs using the "union" function, which concatenates the elements of both RDDs. Finally, it prints the combined RDD, which will be "List(one, two, a, b)".

Submit
25. What does the following code print? val simple = Map("r" -> "red", "g" -> "green") println(simple("g"))

Explanation

The code creates a Map called "simple" with two key-value pairs: "r" -> "red" and "g" -> "green". The code then prints the value associated with the key "g" in the map, which is "green". Therefore, the code will print "green".

Submit
26. For resource management spark can use

Explanation

Spark can use Yarn, Mesos, and Standalone cluster manager for resource management. Yarn is a popular choice for managing resources in Hadoop clusters, while Mesos is a distributed systems kernel that can also handle resource allocation. Additionally, Spark can run in a standalone cluster manager mode where it manages its own resources. Therefore, the correct answer is "All of the above" as Spark provides the flexibility to use any of these options for resource management based on the specific requirements and infrastructure of the system.

Submit
27. What are the Scala variables?

Explanation

Both A and B are correct because in Scala, there are two types of variables: var and val. The var keyword is used to declare mutable variables, which means their values can be changed. On the other hand, the val keyword is used to declare immutable variables, whose values cannot be changed once assigned. In the given code, myVar is a var variable and myVal is a val variable, so both types of variables are present.

Submit
28. Data transformations are executed 

Explanation

Data transformations are executed lazily. This means that the transformations are not immediately performed when the code is executed, but rather when the result is needed or requested. Laziness allows for more efficient execution as only the necessary transformations are performed, reducing unnecessary computation. It also enables the use of lazy evaluation strategies, such as memoization, which can further optimize the execution of data transformations.

Submit
29. Spark session variable was introduced in which Spark release?

Explanation

The Spark session variable was introduced in Spark 2.0. This release of Spark introduced the concept of a Spark session, which is the entry point for interacting with Spark functionality and allows for managing various Spark configurations and settings. Prior to Spark 2.0, users had to create a SparkContext object to interact with Spark, but the introduction of the Spark session simplified the process and provided a more user-friendly interface.

Submit
30. Which file format provides optimized binary storage of structured data ?

Explanation

Parquet is a file format that provides optimized binary storage of structured data. It is designed to efficiently store and process large amounts of data. Parquet uses columnar storage, which allows for efficient compression and encoding techniques to be applied to individual columns, resulting in reduced storage space and improved query performance. This makes Parquet an ideal choice for big data processing frameworks like Apache Hadoop and Apache Spark.

Submit
31. Spark is developed in

Explanation

Spark is developed in Scala. Scala is a programming language that runs on the Java Virtual Machine (JVM) and combines object-oriented and functional programming concepts. Spark was originally written in Scala because Scala provides concise syntax and strong support for functional programming, making it well-suited for building distributed data processing systems like Spark. However, Spark also provides APIs in other languages like Java, Python, and R, allowing developers to use Spark with their preferred programming language.

Submit
32. What is the default replication factor ?

Explanation

The default replication factor refers to the number of copies of data that are automatically created and stored across different nodes in a distributed system. In this case, the correct answer is 3, which means that by default, data is replicated three times to ensure fault tolerance and high availability. This replication factor helps in maintaining data integrity and durability, as it allows for data recovery in case of node failures or data corruption.

Submit
33. What does the following code print? var aa: String = "hello" aa = "pretty" println(aa)

Explanation

The code initializes a variable "aa" with the value "hello". Then it assigns the value "pretty" to the variable "aa". Finally, it prints the value of "aa", which is "pretty".

Submit
34. Common Dataframe transformation include

Explanation

The correct answer is "both a and b" because both "select" and "filter" are common DataFrame transformations. The "select" transformation is used to select specific columns from a DataFrame, while the "filter" transformation is used to filter rows based on a condition. Therefore, both a and b are valid options for common DataFrame transformations.

Submit
35. What does the following code print? println(5 < 6 && 10 == 10)

Explanation

The code will print "true" because it is using the logical AND operator (&&) to check if both conditions are true. The first condition, 5

Submit
36. How to get count of distinct records of a dataframe?

Explanation

The correct answer is mydf.distinct.count() because the distinct() function is used to remove duplicate records from a dataframe, and the count() function is used to get the total number of records in the dataframe after removing duplicates. This combination of distinct() and count() will give the count of distinct records in the dataframe.

Submit
37. Spark's core is a batch engine

Explanation

Spark's core is a batch engine. This means that Spark is designed to process large amounts of data in batches rather than in real-time. It allows for efficient and parallel processing of data by dividing it into smaller chunks called batches. This batch processing approach is suitable for tasks such as data analytics, machine learning, and data transformations where processing large volumes of data at once is more efficient than processing individual records in real-time. Therefore, the statement "Spark's core is a batch engine" is true.

Submit
38.  _________ is the default Partitioner for partitioning key space

Explanation

The HashPartitioner is the default Partitioner for partitioning key space. This means that when data is being distributed across partitions, the HashPartitioner is used to determine which partition a specific key should be assigned to. The HashPartitioner calculates a hash value for each key and then uses this value to determine the partition. This ensures an even distribution of keys across partitions, making it an efficient and balanced way to partition the key space.

Submit
39. Kafka maintains feeds of messages in categories called

Explanation

Kafka maintains feeds of messages in categories called "topics". Topics in Kafka are used to organize and categorize messages, allowing for efficient and scalable message processing. Producers write messages to specific topics, and consumers can subscribe to one or more topics to consume the messages. Topics enable Kafka to handle large amounts of data and distribute it across multiple brokers in a fault-tolerant manner.

Submit
40. Which dataframe method will display the first few rows in tabular format

Explanation

The show() method in a dataframe will display the first few rows in tabular format.

Submit
41. Which of the following is not true for Mapreduce and Spark?

Explanation

Both MapReduce and Spark do not have their own file system. They rely on external file systems such as Hadoop Distributed File System (HDFS) or any other compatible file system for storing and accessing data. MapReduce uses HDFS for data storage and retrieval, while Spark can work with various file systems including HDFS, Amazon S3, and local file systems.

Submit
42. What is transformation in Spark RDD?

Explanation

Transformation in Spark RDD refers to the operations that are performed on an RDD to create a new RDD. These operations are lazily evaluated, meaning they are not executed immediately but rather when an action is called. The transformation takes an RDD as input and produces one or more RDDs as output. Examples of transformations include map, filter, and reduceByKey. These transformations allow for the transformation of data in a distributed and parallel manner, enabling efficient data processing in Spark.

Submit
43. HBase is a distributed ________ database built on top of the Hadoop file system.

Explanation

HBase is a distributed database built on top of the Hadoop file system, and it is specifically designed to be column-oriented. This means that data is stored and retrieved based on columns rather than rows. This design allows for efficient querying and processing of large datasets, making it suitable for big data applications.

Submit
44. Spark Core Abstraction

Explanation

RDD stands for Resilient Distributed Dataset. It is a fundamental data structure in Spark that represents an immutable distributed collection of objects. RDDs are fault-tolerant and can be processed in parallel across a cluster of machines. They provide a high-level abstraction for performing distributed data processing tasks in Spark. RDDs are resilient, meaning they can recover from failures, and distributed, meaning they can be processed in parallel across multiple nodes. RDDs are the building blocks of Spark applications and provide a way to perform efficient and scalable data processing.

Submit
45. How would you convert "mydf" dataframe to rdd?

Explanation

The correct answer is "mydf.rdd" because the ".rdd" method is used to convert a DataFrame to a Resilient Distributed Dataset (RDD) in Apache Spark. RDD is the fundamental data structure in Spark, and converting a DataFrame to RDD allows for lower-level operations and more flexibility in data processing.

Submit
46. Which of the following is the entry point of Spark SQL in Spark 2.0?

Explanation

The correct answer is SparkSession (spark). In Spark 2.0, SparkSession is the entry point of Spark SQL. SparkSession provides a single point of entry for interacting with Spark SQL and it encapsulates the functionality of SparkContext, SQLContext, and HiveContext. It allows users to easily create DataFrames, execute SQL queries, and access various Spark SQL features. Therefore, both options a and b are correct.

Submit
47. What does the following code print? var number = {val x = 2 * 2; x + 40} println(number)

Explanation

The given code defines a variable called "number" and assigns it a value using a closure. The closure calculates the value of "x" as 2 multiplied by 2, which is 4. Then, it adds 40 to "x", resulting in a final value of 44. The "println" statement then prints the value of "number", which is 44.

Submit
48. Choose correct statement

Explanation

The correct answer is "Execution starts with the call of Action." In Spark, transformations are lazily evaluated, meaning they are not executed immediately when called. Instead, they create a plan of execution that is only triggered when an action is called. Actions are operations that trigger the execution of the transformations and produce a result or output. Therefore, the execution of a Spark program begins when an action is called, not when a transformation is called.

Submit
49. Which of the following is true about Scala type inference ?

Explanation

Scala has a powerful type inference system that allows the type of a variable to be determined by looking at its value. This means that in many cases, the data type of a variable does not need to be explicitly mentioned. The compiler analyzes the value assigned to the variable and infers its type based on that. This feature of Scala makes the code more concise and reduces the need for explicit type declarations, leading to cleaner and more expressive code.

Submit
50. DataFrames and _____________ are abstractions for representing structured data

Explanation

Datasets are abstractions for representing structured data, along with DataFrames. Both DataFrames and Datasets are used in Apache Spark to handle structured data. While DataFrames provide a high-level API and are optimized for performance, Datasets provide a type-safe, object-oriented programming interface. Datasets combine the benefits of both DataFrames and RDDs, allowing for strong typing and providing a more efficient execution engine. Therefore, the correct answer is Datasets.

Submit
51. Sqoop uses _________ to fetch data from RDBMS and stores that on HDFS.

Explanation

Sqoop uses MapReduce to fetch data from RDBMS and store it on HDFS. MapReduce is a programming model and software framework used for processing large amounts of data in parallel across a distributed cluster. Sqoop leverages MapReduce to efficiently import data from relational databases into Hadoop by dividing the import process into multiple tasks that can be executed in parallel across multiple nodes in the cluster. This allows for faster and more efficient data transfer from RDBMS to HDFS.

Submit
52. Which is not a component on the top of Spark Core?

Explanation

The correct answer is Spark RDD. Spark RDD is not a component on the top of Spark Core. RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, and it is the main component of Spark Core. Spark Streaming, MLlib, and graphX are all built on top of Spark Core and provide additional functionalities for real-time streaming processing, machine learning, and graph processing respectively.

Submit
53. Identify Correct Action

Explanation

The correct answer is "Reduce." In programming, the reduce function is used to combine all the elements in a collection into a single value. It applies a specified operation to each element and accumulates the result. This is useful when you want to perform calculations on a list of values and obtain a single output. The reduce function is commonly used for tasks such as calculating the sum or product of a list, finding the maximum or minimum value, or concatenating strings.

Submit
54. Datasets are only defined in Scala and ______

Explanation

Datasets are a feature in Apache Spark that provide the benefits of both RDDs and DataFrames. While they are primarily defined in Scala, they can also be used in Java. Therefore, the correct answer is Java.

Submit
55. Fault Tolerance in RDD is achieved using

Explanation

Fault tolerance in RDD is achieved using DAG (Directed Acyclic Graph) or Data Lineage. RDDs are fault-tolerant by design because they are immutable, meaning they cannot be modified once created. If a partition of an RDD is lost, it can be recomputed using the lineage information stored in the DAG. The DAG represents the logical execution plan of transformations applied to the RDD, and it allows RDDs to be reconstructed from their original input data. This ensures fault tolerance by allowing RDDs to recover from failures and continue processing. Lazy evaluation is a concept related to RDDs but not directly responsible for fault tolerance.

Submit
56. What is action in Spark RDD?

Explanation

The correct answer is "The ways to send result from executors to the driver." In Spark RDD, an action is an operation that triggers the execution of transformations and returns the result to the driver program. Actions are used to bring the data from RDDs back to the driver program or to perform some computation on the RDDs. They are responsible for executing the DAG (Directed Acyclic Graph) of computations that are created by transformations.

Submit
57. Which of the following is not a function of Spark Context ?

Explanation

Spark Context is the entry point for any Spark functionality and it provides access to various services, allows setting configurations, and enables checking the status of Spark applications. However, it is not responsible for serving as the entry point to Spark SQL. Spark SQL has its own entry point called SparkSession, which is used for working with structured data using SQL queries, DataFrame, and Dataset APIs.

Submit
58. Which of the following is not the feature of Spark?

Explanation

Spark is known for its features like supporting in-memory computation, fault-tolerance, and compatibility with other file storage systems. However, it is not specifically known for being cost efficient. While Spark does offer high performance and scalability, the cost of running Spark can vary depending on factors such as cluster size and resource requirements. Therefore, the statement "it is cost efficient" is not a feature commonly associated with Spark.

Submit
59. What does the following code print? var bb: Int = 10 bb = "funny" println(bb)

Explanation

The code will print an error. This is because the variable "bb" is declared as an Int, but then it is assigned a string value "funny". This is a type mismatch and the code will not compile.

Submit
60. RDD can not be created from data stored on

Explanation

RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark that allows for distributed processing of large datasets. In this context, the given correct answer states that an RDD cannot be created from data stored on an Oracle database. This is because RDDs are typically created from data sources that are supported by Spark, such as HDFS (Hadoop Distributed File System), S3 (Amazon Simple Storage Service), or LocalFS (local file system). Oracle is not listed among the supported data sources, hence an RDD cannot be directly created from data stored on an Oracle database.

Submit
61. DataFrame schemas are determined

Explanation

DataFrame schemas are determined eagerly, meaning that they are evaluated and determined immediately when the DataFrame is created. This allows for faster processing and optimization during the execution of operations on the DataFrame. In contrast, lazy schema determination would delay the evaluation of the schema until it is actually needed, which could potentially slow down the overall performance of the DataFrame operations.

Submit
62. How would you get the number of partitions of a dataframe "mydf" ?

Explanation

The correct answer is mydf.rdd.getNumPartitions. This is because the rdd.getNumPartitions method is used to get the number of partitions of a dataframe in Spark. RDD stands for Resilient Distributed Dataset, which is the fundamental data structure in Spark. By calling the getNumPartitions method on the RDD representation of the dataframe, we can obtain the number of partitions.

Submit
63. There is table in hive named as "products". What is the correct syntax to load this table into spark dataframe using Scala?

Explanation

The correct syntax to load the "products" table into a Spark DataFrame using Scala is "var tbl=spark.table(“products”)". This syntax uses the "table" method of the SparkSession object to load the table into a DataFrame named "tbl".

Submit
64. Mydf " is a dataframe having thousands of records. You need to look only 10 records .How would you get it done?

Explanation

The correct answer is "mydf.take(10)". This method will return the first 10 records from the dataframe "mydf". It is a commonly used method to retrieve a specific number of records from a dataframe.

Submit
65. What does the following code print? val dd: Double = 9.99 dd = 10.01 println(dd)

Explanation

The code will produce an error because the variable "dd" is declared as a "val", which means it is immutable and cannot be reassigned a new value. Therefore, the attempt to assign a new value to "dd" will result in a compilation error.

Submit
66. You cannot load a Dataset directly from a structured source

Explanation

Loading a dataset directly from a structured source is not possible because structured sources, such as databases or spreadsheets, contain organized and formatted data that needs to be processed and transformed before it can be loaded into a dataset. Therefore, the statement "You cannot load a Dataset directly from a structured source" is true.

Submit
67. Datasets are saved as DataFrames using

Explanation

The correct answer is "Dataset.write" because it is the method used to save datasets as DataFrames. The "write" method allows users to write the contents of a DataFrame to a variety of data sources, such as Parquet, CSV, or JSON files. It provides flexibility in specifying the format, mode, partitioning, and other options for writing the dataset.

Submit
68. Which of the following is true about DataFrame?

Explanation

The correct answer is "DataFrames provide a more user friendly API than RDDs." This is true because DataFrames provide a higher-level abstraction and a more structured and organized way to work with data compared to RDDs. DataFrames allow for easier manipulation and transformation of data using SQL-like queries and provide optimizations for performance. They also have a schema that provides compile-time type safety, ensuring that the data is correctly structured and typed.

Submit
69. What will be the output: val rawData = spark.read.textFile("PATH").rdd val result = rawData.filter...

Explanation

The code snippet is attempting to read a text file using Spark and convert it into an RDD (Resilient Distributed Dataset). However, the code is incomplete as the filter operation is not specified. Therefore, the code will not be executed and will result in a compilation error.

Submit
70. Spark caches the RDD automatically in the memory on its own

Explanation

Spark does not automatically cache the RDD in memory. Caching is an optional operation in Spark, and the user needs to explicitly instruct Spark to cache an RDD using the `cache()` or `persist()` methods. Caching an RDD allows for faster access to the data, as it is stored in memory and can be reused across multiple actions or transformations. However, if the user does not explicitly cache the RDD, Spark will not automatically cache it. Therefore, the given statement is false.

Submit
71. Which of the following data type is supported by Hive ?

Explanation

Hive supports the enum data type. The enum data type in Hive is used to represent a fixed set of values. It is similar to an enumeration in other programming languages. Enum data type allows users to define a set of named values, and each value can be assigned an ordinal number. This data type is useful when there is a need to restrict the values that can be assigned to a column or variable in Hive.

Submit
View My Results

Quiz Review Timeline (Updated): Sep 2, 2023 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Sep 02, 2023
    Quiz Edited by
    ProProfs Editorial Team
  • Oct 17, 2019
    Quiz Created by
    Ravisoftsource
Cancel
  • All
    All (71)
  • Unanswered
    Unanswered ()
  • Answered
    Answered ()
Spark is 100x faster than MapReduce due to
Which of the following statements are correct
Caching is optimizing technique?
What are the features of Spark RDD?
SparkContext guides how to access the Spark cluster?
What does Spark Engine do?
What does the following code print? ...
Which type of processing Apache Spark can handle
Apache Spark has API's in
Which of the following are Dataframe actions ?
How do you print schema of a dataframe?
Identify correct transformation
Choose correct statement about RDD
How much faster can Apache Spark potentially run batch-processing...
On which cluster tasks are majorly launched in Prod world?
What does the following code print? ...
Which Cluster Manager do Spark Support?
What does the following code print: ...
What is the default block size in Hadoop 2 ?
How many Spark Context can be active per job?
Dataframes are _____________
The default storage level of cache() is?
RDD is
What does the following code print?...
What does the following code print? ...
For resource management spark can use
What are the Scala variables?
Data transformations are executed 
Spark session variable was introduced in which Spark release?
Which file format provides optimized binary storage of structured data...
Spark is developed in
What is the default replication factor ?
What does the following code print? ...
Common Dataframe transformation include
What does the following code print? ...
How to get count of distinct records of a dataframe?
Spark's core is a batch engine
 _________ is the default Partitioner for partitioning key space
Kafka maintains feeds of messages in categories called
Which dataframe method will display the first few rows in tabular...
Which of the following is not true for Mapreduce and Spark?
What is transformation in Spark RDD?
HBase is a distributed ________ database built on top of the Hadoop...
Spark Core Abstraction
How would you convert "mydf" dataframe to rdd?
Which of the following is the entry point of Spark SQL in Spark 2.0?
What does the following code print? ...
Choose correct statement
Which of the following is true about Scala type inference ?
DataFrames and _____________ are abstractions for representing...
Sqoop uses _________ to fetch data from RDBMS and stores that on HDFS.
Which is not a component on the top of Spark Core?
Identify Correct Action
Datasets are only defined in Scala and ______
Fault Tolerance in RDD is achieved using
What is action in Spark RDD?
Which of the following is not a function of Spark Context ?
Which of the following is not the feature of Spark?
What does the following code print? ...
RDD can not be created from data stored on
DataFrame schemas are determined
How would you get the number of partitions of a dataframe...
There is table in hive named as "products". What is the correct syntax...
Mydf " is a dataframe having thousands of records. You need to...
What does the following code print? ...
You cannot load a Dataset directly from a structured source
Datasets are saved as DataFrames using
Which of the following is true about DataFrame?
What will be the output: ...
Spark caches the RDD automatically in the memory on its own
Which of the following data type is supported by Hive ?
Alert!

Advertisement