How Well Do You Know About Data Science? Data Science Quiz

1. Tableau can create worksheet-specific filters.

True

False

Tableau has the capability to create filters that are specific to individual worksheets. This means that users can apply filters to a particular worksheet without affecting the data displayed in other worksheets. By using worksheet-specific filters, users can easily analyze and visualize data based on specific criteria, allowing for more focused and targeted insights. This feature enhances the flexibility and customization options available to users when working with Tableau.

Explanation

Tableau has the capability to create filters that are specific to individual worksheets. This means that users can apply filters to a particular worksheet without affecting the data displayed in other worksheets. By using worksheet-specific filters, users can easily analyze and visualize data based on specific criteria, allowing for more focused and targeted insights. This feature enhances the flexibility and customization options available to users when working with Tableau.

2. Who is a data scientist?

Mathematician

Statistician

Software programmer

All of the above

A data scientist is someone who possesses a combination of skills in mathematics, statistics, and software programming. They use these skills to analyze and interpret complex data sets, identify patterns and trends, and develop algorithms and models to solve problems and make data-driven decisions. By having expertise in all three areas, data scientists are able to handle the entire process of data analysis, from collecting and cleaning data to implementing and deploying analytical solutions. Therefore, the correct answer is "All of the above" as all three roles (mathematician, statistician, and software programmer) are encompassed within the field of data science.

Explanation

A data scientist is someone who possesses a combination of skills in mathematics, statistics, and software programming. They use these skills to analyze and interpret complex data sets, identify patterns and trends, and develop algorithms and models to solve problems and make data-driven decisions. By having expertise in all three areas, data scientists are able to handle the entire process of data analysis, from collecting and cleaning data to implementing and deploying analytical solutions. Therefore, the correct answer is "All of the above" as all three roles (mathematician, statistician, and software programmer) are encompassed within the field of data science.

3. Positive Correlation:

Above -0.8

Below -0.8

Above 0.8

Below 0.65

The correct answer is "Above 0.8". In statistics, a positive correlation indicates that as one variable increases, the other variable also tends to increase. The value of 0.8 indicates a strong positive correlation, meaning that there is a high degree of linear relationship between the two variables. Therefore, when the correlation coefficient is above 0.8, it suggests a strong positive correlation between the variables being studied.

Explanation

The correct answer is "Above 0.8". In statistics, a positive correlation indicates that as one variable increases, the other variable also tends to increase. The value of 0.8 indicates a strong positive correlation, meaning that there is a high degree of linear relationship between the two variables. Therefore, when the correlation coefficient is above 0.8, it suggests a strong positive correlation between the variables being studied.

4. 3V's in Big Data

Velocity, Victory, Volume

Volume, Velocity, Variety

Volume, Viscous, Velocity

None of the above

The correct answer is Volume, Velocity, Variety. These are the three main characteristics of big data. Volume refers to the large amount of data being generated and collected. Velocity refers to the speed at which data is being generated and needs to be processed in real-time. Variety refers to the different types and formats of data, including structured, unstructured, and semi-structured data. These three V's are essential for understanding and analyzing big data effectively.

Explanation

The correct answer is Volume, Velocity, Variety. These are the three main characteristics of big data. Volume refers to the large amount of data being generated and collected. Velocity refers to the speed at which data is being generated and needs to be processed in real-time. Variety refers to the different types and formats of data, including structured, unstructured, and semi-structured data. These three V's are essential for understanding and analyzing big data effectively.

5. Raw data should be processed only one time.

True

False

Processing raw data multiple times can be necessary in certain situations. For example, if new information or updates are received, the raw data may need to be processed again to incorporate these changes. Additionally, different analyses or calculations may require different processing methods, leading to the need for multiple processing steps. Therefore, the statement that raw data should be processed only one time is incorrect.

Explanation

Processing raw data multiple times can be necessary in certain situations. For example, if new information or updates are received, the raw data may need to be processed again to incorporate these changes. Additionally, different analyses or calculations may require different processing methods, leading to the need for multiple processing steps. Therefore, the statement that raw data should be processed only one time is incorrect.

6. Which of the following can be considered as random variable ?

The outcome from the roll of a die

The outcome of flip of a coin

The outcome of exam

All of the Mentioned

All of the mentioned options can be considered as random variables. A random variable is a variable whose value is determined by the outcome of a random event. In this case, the outcome from the roll of a die, the outcome of a flip of a coin, and the outcome of an exam are all determined by random events. Therefore, all of these options can be considered as random variables.

Explanation

All of the mentioned options can be considered as random variables. A random variable is a variable whose value is determined by the outcome of a random event. In this case, the outcome from the roll of a die, the outcome of a flip of a coin, and the outcome of an exam are all determined by random events. Therefore, all of these options can be considered as random variables.

7. Point out the correct statement:

Machine learning focuses on prediction, based on known properties learned from the training data

Data Cleaning focuses on prediction, based on known properties learned from the training data.

None of the Mentioned

The correct answer is "Machine learning focuses on prediction, based on known properties learned from the training data." This statement accurately describes the main objective of machine learning, which is to make predictions or decisions based on patterns and relationships learned from a set of training data. Machine learning algorithms analyze the training data to identify these patterns and use them to make predictions on new, unseen data.

Explanation

The correct answer is "Machine learning focuses on prediction, based on known properties learned from the training data." This statement accurately describes the main objective of machine learning, which is to make predictions or decisions based on patterns and relationships learned from a set of training data. Machine learning algorithms analyze the training data to identify these patterns and use them to make predictions on new, unseen data.

8. Which of the following are "Measures of Central Tendency"?

Mean,Range, Mode

Mean, Standard Deviation, Range

Mode, Mean, Median

Range, Standard Deviation, Variance

The measures of central tendency are statistical measures used to describe the center or average of a data set. The mode is the most frequently occurring value, the mean is the average of all values, and the median is the middle value when the data set is arranged in ascending or descending order. Therefore, the correct answer is mode, mean, and median as they are all measures of central tendency.

Explanation

The measures of central tendency are statistical measures used to describe the center or average of a data set. The mode is the most frequently occurring value, the mean is the average of all values, and the median is the middle value when the data set is arranged in ascending or descending order. Therefore, the correct answer is mode, mean, and median as they are all measures of central tendency.

9. Will filters work when we do data blending?

True

False

When we do data blending, filters will still work. Data blending is a technique used to combine data from multiple sources or tables into a single view. Filters are used to narrow down the data based on specific criteria. Even when data blending is performed, filters can still be applied to limit the data being displayed or analyzed. Thus, filters will continue to work effectively during data blending.

Explanation

When we do data blending, filters will still work. Data blending is a technique used to combine data from multiple sources or tables into a single view. Filters are used to narrow down the data based on specific criteria. Even when data blending is performed, filters can still be applied to limit the data being displayed or analyzed. Thus, filters will continue to work effectively during data blending.

10. Which of the following testing is concerned with making decisions using data?

Probability

Hypothesis

Casual

None of the mentioned

Hypothesis testing is concerned with making decisions using data. In hypothesis testing, a researcher formulates a hypothesis about a population parameter and collects data to determine whether the evidence supports or contradicts the hypothesis. The goal is to make an inference about the population based on the sample data. This involves making decisions, such as accepting or rejecting the null hypothesis, based on the evidence provided by the data. Therefore, hypothesis testing is the correct answer as it involves using data to make decisions.

Explanation

Hypothesis testing is concerned with making decisions using data. In hypothesis testing, a researcher formulates a hypothesis about a population parameter and collects data to determine whether the evidence supports or contradicts the hypothesis. The goal is to make an inference about the population based on the sample data. This involves making decisions, such as accepting or rejecting the null hypothesis, based on the evidence provided by the data. Therefore, hypothesis testing is the correct answer as it involves using data to make decisions.

11. Why Machine Learning in Data Science?

For Visualization

For Prediction

For Cleaning

All the above

Machine learning is used in data science for prediction because it allows the development of models that can analyze patterns and make accurate predictions based on historical data. By training these models with known data, they can learn to recognize patterns and relationships, and then apply that knowledge to make predictions on new, unseen data. This prediction capability is valuable in various fields, such as finance, healthcare, and marketing, where accurate predictions can help in decision-making and improving outcomes.

Explanation

Machine learning is used in data science for prediction because it allows the development of models that can analyze patterns and make accurate predictions based on historical data. By training these models with known data, they can learn to recognize patterns and relationships, and then apply that knowledge to make predictions on new, unseen data. This prediction capability is valuable in various fields, such as finance, healthcare, and marketing, where accurate predictions can help in decision-making and improving outcomes.

12. ____________ is a multidisciplinary which involves extraction of knowledge from large volumes of data that are structured or unstructured.

Data Science

Data Analysis

Descriptive Analysis

None of the mentioned

Data Science is the correct answer because it is a multidisciplinary field that involves the extraction of knowledge from large volumes of data, whether it is structured or unstructured. Data scientists use various techniques and tools to analyze and interpret data in order to gain insights and make informed decisions. This field combines elements of statistics, mathematics, computer science, and domain knowledge to extract valuable information from data.

Explanation

Data Science is the correct answer because it is a multidisciplinary field that involves the extraction of knowledge from large volumes of data, whether it is structured or unstructured. Data scientists use various techniques and tools to analyze and interpret data in order to gain insights and make informed decisions. This field combines elements of statistics, mathematics, computer science, and domain knowledge to extract valuable information from data.

13. Which of the following diagram is used to view correlation?

Triangle

Boxplot

Corrgram

Histogram

A corrgram is a diagram used to view correlation. It displays a matrix of correlation coefficients between variables, usually represented by a grid of squares. Each square represents the correlation between two variables, with the color or shading indicating the strength and direction of the correlation. This diagram is useful for visually understanding the relationships between variables and identifying patterns or trends in the data.

Explanation

A corrgram is a diagram used to view correlation. It displays a matrix of correlation coefficients between variables, usually represented by a grid of squares. Each square represents the correlation between two variables, with the color or shading indicating the strength and direction of the correlation. This diagram is useful for visually understanding the relationships between variables and identifying patterns or trends in the data.

14. Which of the following technique comes under practical machine learning?

Decision Tree

Data Visualisation

Forecasting

None of the mentioned

Decision Tree is a technique that falls under practical machine learning. It is a supervised learning algorithm that is used for both classification and regression tasks. It is practical because it is easy to understand and interpret, and it can handle both categorical and numerical data. Decision Tree builds a model by learning simple decision rules inferred from the data features, making it a widely used technique in various industries and applications. Data visualization and forecasting, though related to machine learning, are not specific techniques but rather tools or methods that can be used in conjunction with different machine learning algorithms.

Explanation

Decision Tree is a technique that falls under practical machine learning. It is a supervised learning algorithm that is used for both classification and regression tasks. It is practical because it is easy to understand and interpret, and it can handle both categorical and numerical data. Decision Tree builds a model by learning simple decision rules inferred from the data features, making it a widely used technique in various industries and applications. Data visualization and forecasting, though related to machine learning, are not specific techniques but rather tools or methods that can be used in conjunction with different machine learning algorithms.

15. Which of the following is definition of Raw Data?

Set of Measurement on Recorded Values

Processed Data

Easy to use for data analysis

None of the Mentioned

Raw data refers to unprocessed and unorganized data that is collected directly from various sources. It consists of measurements or recorded values in their original form, without any manipulation or analysis. Raw data serves as the foundation for data analysis and is typically transformed and processed to extract meaningful insights and patterns. Therefore, the definition "Set of Measurement on Recorded Values" accurately describes raw data.

Explanation

Raw data refers to unprocessed and unorganized data that is collected directly from various sources. It consists of measurements or recorded values in their original form, without any manipulation or analysis. Raw data serves as the foundation for data analysis and is typically transformed and processed to extract meaningful insights and patterns. Therefore, the definition "Set of Measurement on Recorded Values" accurately describes raw data.

16. __________ is the standard deviation of a sampling distribution.

Sample error

Sampling error

Simple error

Standard error

Standard error is the correct answer because it represents the standard deviation of a sampling distribution. A sampling distribution is a distribution of statistics obtained from multiple samples of the same population. The standard error measures the variability or spread of these statistics, indicating how much they differ from the true population parameter. It is an important measure in inferential statistics as it helps estimate the precision of sample statistics and make inferences about the population.

Explanation

Standard error is the correct answer because it represents the standard deviation of a sampling distribution. A sampling distribution is a distribution of statistics obtained from multiple samples of the same population. The standard error measures the variability or spread of these statistics, indicating how much they differ from the true population parameter. It is an important measure in inferential statistics as it helps estimate the precision of sample statistics and make inferences about the population.

17. Which of the following is characteristic of Processed Data?

Data is not ready for analysis

All steps should be noted

Hard to use for data analysis

None of the mentioned

Processed data refers to information that has been organized, structured, or manipulated in some way to make it more useful and meaningful for analysis. It is the opposite of raw data, which is unprocessed and typically not ready for analysis. Therefore, the statement "None of the mentioned" is the correct answer because processed data is indeed ready for analysis and can be used effectively for data analysis purposes.

Explanation

Processed data refers to information that has been organized, structured, or manipulated in some way to make it more useful and meaningful for analysis. It is the opposite of raw data, which is unprocessed and typically not ready for analysis. Therefore, the statement "None of the mentioned" is the correct answer because processed data is indeed ready for analysis and can be used effectively for data analysis purposes.

18. Pick Lazy Algorithm

K-Mean

CNN

KNN

RNN

KNN stands for K-Nearest Neighbors, which is a lazy algorithm used for classification and regression tasks. It works by finding the k nearest neighbors to a given data point in the feature space and making predictions based on the majority class or average value of those neighbors. KNN is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution. It is simple to implement and can be effective for small to medium-sized datasets. However, it can be computationally expensive for large datasets and may not perform well in the presence of irrelevant or noisy features.

Explanation

KNN stands for K-Nearest Neighbors, which is a lazy algorithm used for classification and regression tasks. It works by finding the k nearest neighbors to a given data point in the feature space and making predictions based on the majority class or average value of those neighbors. KNN is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution. It is simple to implement and can be effective for small to medium-sized datasets. However, it can be computationally expensive for large datasets and may not perform well in the presence of irrelevant or noisy features.

19. Sequential Modelling is done on

CNN

KNN

RNN

ANN

Sequential modeling is a technique used to analyze and predict sequential data, such as time series or natural language. Recurrent Neural Networks (RNN) are particularly suitable for sequential modeling as they have a feedback loop that allows information to persist and be processed over time. Therefore, RNN is the correct answer as it is specifically designed for sequential modeling tasks. CNN (Convolutional Neural Networks) are mainly used for image and video analysis, KNN (K-Nearest Neighbors) is a non-parametric algorithm for classification and regression, and ANN (Artificial Neural Networks) is a general term that can refer to any type of neural network model.

Explanation

Sequential modeling is a technique used to analyze and predict sequential data, such as time series or natural language. Recurrent Neural Networks (RNN) are particularly suitable for sequential modeling as they have a feedback loop that allows information to persist and be processed over time. Therefore, RNN is the correct answer as it is specifically designed for sequential modeling tasks. CNN (Convolutional Neural Networks) are mainly used for image and video analysis, KNN (K-Nearest Neighbors) is a non-parametric algorithm for classification and regression, and ANN (Artificial Neural Networks) is a general term that can refer to any type of neural network model.

20. Which of the following of a random variable is a measure of spread?

Variance

Standard deviation

Empirical mean

All of the mentioned

Standard deviation is a measure of spread for a random variable. It quantifies the amount of dispersion or variability in the data set. It measures how far each data point is from the mean, providing an indication of the spread or dispersion around the average. A higher standard deviation indicates a greater spread, while a lower standard deviation indicates a narrower spread. Therefore, the correct answer is standard deviation.

Explanation

Standard deviation is a measure of spread for a random variable. It quantifies the amount of dispersion or variability in the data set. It measures how far each data point is from the mean, providing an indication of the spread or dispersion around the average. A higher standard deviation indicates a greater spread, while a lower standard deviation indicates a narrower spread. Therefore, the correct answer is standard deviation.

21. What is the order of execution of filters in tableau? 1) Context 2) Traditional 3) Custom 4) Show Me

1,2,3,4

2,3,4,1

3,1,2,4

4,3,2,1

The order of execution of filters in Tableau is 3) Custom, 1) Context, 2) Traditional, and 4) Show Me. This means that custom filters are applied first, followed by context filters, then traditional filters, and finally the Show Me filters.

Explanation

The order of execution of filters in Tableau is 3) Custom, 1) Context, 2) Traditional, and 4) Show Me. This means that custom filters are applied first, followed by context filters, then traditional filters, and finally the Show Me filters.

22. Which of the following model is usually gold standard for data analysis?

Inferential

Descriptive

Casual

All of the mentioned

The inferential model is usually considered the gold standard for data analysis because it allows researchers to make predictions and draw conclusions about a population based on a sample. This model involves using statistical techniques to analyze data and make inferences about a larger population. Descriptive analysis, on the other hand, focuses on summarizing and describing the data without making any predictions or inferences. Causal analysis is used to determine cause-and-effect relationships between variables, but it is not typically considered the gold standard for data analysis. Therefore, the correct answer is inferential.

Explanation

The inferential model is usually considered the gold standard for data analysis because it allows researchers to make predictions and draw conclusions about a population based on a sample. This model involves using statistical techniques to analyze data and make inferences about a larger population. Descriptive analysis, on the other hand, focuses on summarizing and describing the data without making any predictions or inferences. Causal analysis is used to determine cause-and-effect relationships between variables, but it is not typically considered the gold standard for data analysis. Therefore, the correct answer is inferential.

23. Weighted Average is used in:

Classification

Regression

Forecasting

Above All

Weighted average is commonly used in forecasting to calculate a weighted average of historical data. This allows for the consideration of different weights or importance assigned to each data point, based on factors such as recency or reliability. By using a weighted average, the forecast can reflect the significance of each data point and provide a more accurate prediction of future trends or values. Therefore, forecasting is a specific application where weighted average is utilized.

Explanation

Weighted average is commonly used in forecasting to calculate a weighted average of historical data. This allows for the consideration of different weights or importance assigned to each data point, based on factors such as recency or reliability. By using a weighted average, the forecast can reflect the significance of each data point and provide a more accurate prediction of future trends or values. Therefore, forecasting is a specific application where weighted average is utilized.

24. Which of the following is one of the key data science skill?

Statistics

Machine learning

Data visualization

All of the mentioned

Machine learning is one of the key data science skills because it involves the use of algorithms and statistical models to enable computers to learn from and make predictions or decisions based on data. It is a crucial skill in data science as it allows for the development of models that can analyze and interpret large amounts of data, identify patterns, and make accurate predictions or classifications. Machine learning is widely used in various industries for tasks such as fraud detection, recommendation systems, image recognition, and natural language processing.

Explanation

Machine learning is one of the key data science skills because it involves the use of algorithms and statistical models to enable computers to learn from and make predictions or decisions based on data. It is a crucial skill in data science as it allows for the development of models that can analyze and interpret large amounts of data, identify patterns, and make accurate predictions or classifications. Machine learning is widely used in various industries for tasks such as fraud detection, recommendation systems, image recognition, and natural language processing.

25. Which of the following is performed by Data Scientist?

Define the question

Create reproducible code

Challenge results

All of the Mentioned

Data scientists perform the task of challenging results. This involves critically analyzing and evaluating the outcomes of data analysis and machine learning models. They assess the reliability and accuracy of the results, identify any limitations or biases, and determine if the findings align with the initial research question or hypothesis. By challenging results, data scientists ensure the validity and robustness of the conclusions drawn from the data analysis process.

Explanation

Data scientists perform the task of challenging results. This involves critically analyzing and evaluating the outcomes of data analysis and machine learning models. They assess the reliability and accuracy of the results, identify any limitations or biases, and determine if the findings align with the initial research question or hypothesis. By challenging results, data scientists ensure the validity and robustness of the conclusions drawn from the data analysis process.