Statistical Business Analysis Quiz! Hardest Trivia Questions

Reviewed by Editorial Team

The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.

Learn about Our Editorial Process

| By Wendicai

Wendicai

Community Contributor

Quizzes Created: 1 | Total Attempts: 1,323

| Attempts: 1,329

Questions

Feedback

During the Quiz End of Quiz

Difficulty

Easy First Hard First Sequential

1/72 Questions

In order to perform an honest assessment on a predictive model, what is an acceptable division between training, validation, and testing data?
- Training: 50% Validation: 0% Testing: 50%
- Training: 100% Validation: 0% Testing: 0%
- Training: 0% Validation: 100% Testing: 0%
- Training: 50% Validation: 50% Testing: 0%

About This Quiz

Dive into the 'Statistical Business Analysis Quiz! Hardest Trivia Questions' to test and enhance your knowledge on ROC curves, data partitioning, logistic regression, and more. Essential for aspiring business analysts and data scientists aiming to sharpen their analytical skills.

Statistical Business Analysis Quiz! Hardest Trivia Questions - Quiz

Quiz Preview

2.

The plots represent two models, A and B, being fit to the same two data sets, training and validation. Model A is 90.5% accurate at distinguishing blue from red on the training data and 75.5% accurate at doing the same on validation data. Model B is 83% accurate at distinguishing blue from red on the training data and 78.3% accurate at doing the same on the validation data. Which of the two models should be selected and why?
- Model A. It is more complex with a higher accuracy than model B on training data.
- Model A. It performs better on the boundary for the training data.
- Model B. It is more complex with a higher accuracy than model A on validation data.
- Model B. It is simpler with a higher accuracy than model A on validation data.
Correct Answer

A. Model B. It is simpler with a higher accuracy than model A on validation data.

Explanation

Model B should be selected because it has a higher accuracy than model A on the validation data. Additionally, model B is simpler, which suggests that it may be more robust and less prone to overfitting.

Rate this question:
3.

This question will ask you to provide a missing option. Complete the following syntax to test the homogeneity of variance assumption in the GLM procedure: Means Region / <insert option here> =levene;

Correct Answer
hovtest

Explanation
The missing option to complete the syntax is "hovtest". This option is used to perform the homogeneity of variance test in the GLM procedure. The hovtest function calculates a test statistic to determine if the variances of the groups being compared are significantly different. By including this option, the syntax will run the hovtest procedure to test the assumption of homogeneity of variance in the GLM analysis.

Rate this question:
4.

Based on the control plot, which conclusion is justified regarding the means of the response?
- All groups are significantly different from each other.
- 2XL is significantly different from all other groups.
- Only XL and 2XL are not significantly different from each other.
- No groups are significantly different from each other.
Correct Answer

A. Only XL and 2XL are not significantly different from each other.

Explanation

The correct answer is "Only XL and 2XL are not significantly different from each other." This conclusion is justified because the control plot shows that all other groups are significantly different from each other, indicating that XL and 2XL are the only groups that are not significantly different.

Rate this question:

0
5.

An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale?
- R-Square
- Coeff Var
- Adj R-Sq
- Error DF
Correct Answer

A. Adj R-Sq

Explanation

The Adjusted R-Squared statistic justifies this rationale because it takes into account the number of predictors in the model. It adjusts the R-Squared value by penalizing for the inclusion of additional predictors that may not significantly contribute to the model fit. Therefore, a higher Adjusted R-Squared value indicates a better model fit, even when compared to a model with more predictors.

Rate this question:
6.

What is the total number of the sample size?

Correct Answer
100

Explanation
The total number of the sample size is 100, as stated in the answer. This means that there were 100 participants or observations in the sample.

Rate this question:
7.

Which SAS program will detect collinearity in a multiple regression application?
- A
- B
- C
- D
Correct Answer

A. B

Explanation

Program B is likely to detect collinearity in a multiple regression application. The question does not provide specific details about the programs, but based on the context, it can be inferred that Program B is designed to identify collinearity. Collinearity refers to the situation where two or more predictor variables in a regression model are highly correlated, which can lead to unstable and unreliable estimates of the regression coefficients. Therefore, Program B is the most appropriate choice for detecting collinearity in a multiple regression application.

Rate this question:
8.

An analyst fits a logistic regression model to predict whether or not a client will default on a loan. One of the predictors in the model is agent, and each agent serves 15-20 clients each. The model fails to converge. The analyst prints the summarized data, showing the number of defaulted loans per agent. See the partial output below: What is the most likely reason that the model fails to converge?
- There is quasi-complete separation in the data.
- There is collinearity among the predictors.
- There are missing values in the data.
- There are too many observations in the data.
Correct Answer

A. There is quasi-complete separation in the data.

Explanation

The most likely reason that the model fails to converge is that there is quasi-complete separation in the data. Quasi-complete separation occurs when there is a predictor that perfectly predicts the outcome variable, resulting in a division of the data into distinct groups. This can cause issues in logistic regression because it leads to infinite parameter estimates. In this case, the number of defaulted loans per agent may be perfectly predicting whether or not a client will default on a loan, causing the model to fail to converge.

Rate this question:

0
9.

An analyst investigates Region (A, B, or C) as an input variable in a logistic regression model. The analyst discovers that the probability of purchasing a certain item when Region = A is 1. What problem does this illustrate?
- Collinearity
- Influential observations
- Quasi-complete separation
- Problems that arise due to missing values
Correct Answer

A. Quasi-complete separation

Explanation

Quasi-complete separation occurs when a predictor variable perfectly predicts the outcome variable, resulting in extreme coefficients and standard errors in logistic regression. In this case, the probability of purchasing the item is 1 when Region = A, indicating a perfect separation between the predictor and the outcome variable. This can lead to convergence issues in the logistic regression model and make it difficult to estimate accurate coefficients and standard errors.

Rate this question:
10.

The Intercept estimate is interpreted as:
- The predicted value of the response when all the predictors are at their current values.
- The predicted value of the response when all predictors are at their means.
- The predicted value of the response when all predictors = 0.
- The predicted value of the response when all predictors are at their minimum values.
Correct Answer

A. The predicted value of the response when all predictors = 0.

Explanation

The intercept estimate is interpreted as the predicted value of the response when all predictors are at their minimum values. This means that when all predictors are set to 0, the intercept estimate represents the expected value of the response variable.

Rate this question:
11.

1. Refer to the ROC curve: As you move along the curve, what changes?
- The priors in the population
- The true negative rate in the population
- The proportion of events in the training data
- The probability cutoff for scoring
Correct Answer

A. The probability cutoff for scoring

Explanation

As you move along the ROC curve, the probability cutoff for scoring changes. The ROC curve is a graphical representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) for different probability cutoffs. As the cutoff changes, the sensitivity and specificity values also change, causing the points on the ROC curve to shift. The probability cutoff determines the threshold for classifying an observation as positive or negative, and adjusting it can affect the balance between correctly identifying true positives and minimizing false positives.

Rate this question:
12.

In partitioning data for model assessment, which sampling methods are acceptable?
- Simple random sampling without replacement
- Simple random sampling with replacement
- Stratified random sampling without replacement
- Sequential random sampling with replacement
Correct Answer(s)

A. Simple random sampling without replacement
A. Stratified random sampling without replacement

Explanation

In partitioning data for model assessment, both simple random sampling without replacement and stratified random sampling without replacement are acceptable methods. Simple random sampling without replacement involves randomly selecting a subset of data without replacement, ensuring that each sample is unique. Stratified random sampling without replacement involves dividing the data into distinct groups or strata based on certain characteristics and then randomly selecting samples from each stratum without replacement. Sequential random sampling with replacement and simple random sampling with replacement are not acceptable methods for partitioning data for model assessment.

Rate this question:
13.

A confusion matrix is created for data that were oversampled due to a rare target. What values are not affected by this oversampling?
- Sensitivity and PV+
- Specificity and PV-
- PV+ and PV-
- Sensitivity and Specificity
Correct Answer

A. Sensitivity and Specificity

Explanation

When data is oversampled due to a rare target, it means that the rare target is artificially increased in the dataset to balance the class distribution. Sensitivity and specificity are performance metrics that measure the accuracy of a classification model and are calculated based on the true positive, true negative, false positive, and false negative values. Since oversampling does not change the true positive and true negative values, sensitivity and specificity remain unaffected.

Rate this question:
14.

27. Which statement is correct at an alpha level of 0.05?
- School*Gender should be removed because it is non-significant.
- Gender should be removed because it is non-significant.
- School should be removed because it is significant.
- Gender should not be removed due to its involvement in the significant interaction.
Correct Answer

A. Gender should not be removed due to its involvement in the significant interaction.

Explanation

The correct answer is "Gender should not be removed due to its involvement in the significant interaction." This statement suggests that gender should not be removed from the analysis because it plays a significant role in the interaction between variables. This implies that gender has an impact on the outcome being studied and should therefore be considered in the analysis.

Rate this question:
15.

An analyst knows that the categorical predictor, storeId, is an important predictor of the target. However, store_Id has too many levels to be a feasible predictor in the model. The analyst wants to combine stores and treat them as members of the same class level. What are the two most effective ways to address the problem? (Choose two.)
- Eliminate store_id as a predictor in the model because it has too many levels to be feasible.
- Cluster by using Greenacre's method to combine stores that are similar.
- Use subject matter expertise to combine stores that are similar.
- Randomly combine the stores into five groups to keep the stochastic variation among the observations intact.
Correct Answer(s)

A. Cluster by using Greenacre's method to combine stores that are similar.
A. Use subject matter expertise to combine stores that are similar.

Explanation

The two most effective ways to address the problem of having too many levels for the storeId predictor are to cluster the stores using Greenacre's method to combine similar stores and to use subject matter expertise to combine similar stores. Clustering allows for grouping of stores based on similarities, while subject matter expertise allows for a more nuanced understanding of the stores and their similarities. Both approaches help to reduce the number of levels for the storeId predictor, making it more feasible for the model.

Rate this question:
16.

Including redundant input variables in a regression model can:
- Stabilize parameter estimates and increase the risk of overfitting.
- Destabilize parameter estimates and increase the risk of overfitting.
- Stabilize parameter estimates and decrease the risk of overfitting.
- Destabilize parameter estimates and decrease the risk of overfitting.
Correct Answer

A. Destabilize parameter estimates and increase the risk of overfitting.

Explanation

Including redundant input variables in a regression model can destabilize parameter estimates because these variables do not contribute any additional information to the model. This can lead to unstable and unreliable estimates of the parameters. Additionally, including redundant variables increases the risk of overfitting, where the model becomes too complex and fits the noise in the data rather than the underlying relationship. Overfitting can result in poor generalization to new data and decreased model performance.

Rate this question:
17.

An analyst examined logistic regression models for predicting whether a customer would make a purchase. The ROC curve displayed summarizes the models. Using the selected model and the analyst's decision rule, 25% of the customers who did not make a purchase are incorrectly classified as purchasers. What can be concluded from the graph?
- About 25% of the customers who did make a purchase are correctly classified as making a purchase.
- About 50% of the customers who did make a purchase are correctly classified as making a purchase.
- About 85% of the customers who did make a purchase are correctly classified as making a purchase.
- About 95% of the customers who did make a purchase are correctly classified as making a purchase.
Correct Answer

A. About 85% of the customers who did make a purchase are correctly classified as making a purchase.

Explanation

The ROC curve summarizes the performance of the logistic regression model in predicting whether a customer would make a purchase. The fact that 25% of the customers who did not make a purchase are incorrectly classified as purchasers indicates that there is a certain level of misclassification. From the graph, it can be concluded that about 85% of the customers who did make a purchase are correctly classified as making a purchase. This suggests that the model has a relatively high accuracy in predicting customers who make a purchase.

Rate this question:
18.

A marketing manager attempts to determine those customers most likely to purchase additional products as the result of a nation-wide marketing campaign. The manager possesses a historical dataset (CAMPAIGN) of a similar campaign from last year. It has the following characteristics: Target variable Respond (0,1) Continuous predictor Income Categorical predictor Homeowner(Y,N) Which SAS program performs this analysis?
- A
- B
- C
- D
Correct Answer

A. A

Explanation

The correct answer is A because the marketing manager wants to determine the customers most likely to purchase additional products, which suggests a predictive modeling problem. The manager has a historical dataset with a target variable (Respond) and predictor variables (Income and Homeowner). To perform this analysis, the manager would need to use a predictive modeling technique, such as logistic regression or decision tree, which can be implemented using SAS programming.

Rate this question:
19.

Which of the following describes a concordant pair of observations in the LOGISTIC procedure?
- An observation with the event has an equal probability as another observation with the event.
- An observation with the event has a lower predicted probability than the observation without the event.
- An observation with the event has an equal predicted probability as the observation without the event.
- An observation with the event has a higher predicted probability than the observation without the event
Correct Answer

A. An observation with the event has a higher predicted probability than the observation without the event

Explanation

In the LOGISTIC procedure, a concordant pair of observations refers to a situation where an observation with the event (outcome of interest) has a higher predicted probability than an observation without the event. This implies that the model is correctly predicting the occurrence of the event, as the observation with the event has a higher likelihood of experiencing it compared to the observation without the event.

Rate this question:
20.

What is the default method in the LOGISTIC procedure to handle observations with missing data?
- Missing values are imputed.
- Parameters are estimated accounting for the missing values.
- Parameter estimates are made on all available data.
- Only cases with variables that are fully populated are used.
Correct Answer

A. Only cases with variables that are fully populated are used.

Explanation

The default method in the LOGISTIC procedure to handle observations with missing data is to only use cases with variables that are fully populated. This means that any observation with missing data will be excluded from the analysis.

Rate this question:
21.

A company has branch offices in eight regions. Customers within each region are classified as either "High Value" or "Medium Value" and are coded using the variable name VALUE. In the last year, the total amount of purchases per customer is used as the response variable. Suppose there is a significant interaction between REGION and VALUE. What can you conclude?
- More high value customers are found in some regions than others.
- The difference between average purchases for medium and high value customers depends on the region.
- Regions with higher average purchases have more high value customers.
- Regions with higher average purchases have more medium value customers.
Correct Answer

A. The difference between average purchases for medium and high value customers depends on the region.

Explanation

The significant interaction between REGION and VALUE indicates that the relationship between the average purchases for medium and high value customers varies depending on the region. This suggests that the impact of customer value on purchases differs across different regions, indicating that the regional factor plays a role in influencing customer behavior and purchase patterns.

Rate this question:
22.

Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small (S), medium (M), and large (L) wrist sizes?
- Medium wrist size is significantly different than small wrist size.
- Large wrist size is significantly different than medium wrist size.
- Large wrist size is significantly different than small wrist size.
- There is no significant difference due to wrist size.
Correct Answer

A. Large wrist size is significantly different than small wrist size.

Explanation

The given answer suggests that there is a significant difference in percentage of body fat between individuals with large wrist size and individuals with small wrist size. This conclusion is based on the information provided in the question, which states that the percentage of body fat varies significantly depending on wrist size. Therefore, it can be inferred that individuals with larger wrist sizes have a significantly different percentage of body fat compared to individuals with smaller wrist sizes.

Rate this question:
23.

The standard form of a linear regression is : Y= beta0+beta1*X+ error Which statement best summarizes the assumptions placed on the errors?
- The errors are correlated, normally distributed with constant mean and zero variance.
- The errors are correlated, normally distributed with zero mean and constant variance.
- The errors are independent, normally distributed with constant mean and zero variance.
- The errors are independent, normally distributed with zero mean and constant variance.
Correct Answer

A. The errors are independent, normally distributed with zero mean and constant variance.

Explanation

The assumption placed on the errors in linear regression is that they are independent, normally distributed with zero mean and constant variance. This means that the errors are not correlated with each other, they follow a normal distribution, their average value is zero, and their variance remains constant across all levels of the predictor variable.

Rate this question:
24.

A linear model has the following characteristics: *A dependent variable (y) *One continuous variable (xl), including a quadratic term (x12) *One categorical (d with 3 levels) predictor variable and an interaction term (d by x1) How many parameters, including the intercept, are associated with this model? Enter your numeric answer in the space below. Do not add leading or trailing spaces to your answer.

Correct Answer
7

Explanation
A linear model with a dependent variable, one continuous variable with a quadratic term, one categorical predictor variable, and an interaction term would have a total of 7 parameters. These parameters include the intercept, the coefficients for the continuous variable and its quadratic term, the coefficients for the categorical variable's three levels, and the coefficient for the interaction term.

Rate this question:
25.

An analyst has a sufficient volume of data to perform a 3-way partition of the data into training, validation, and test sets to perform honest assessment during the model building process. What is the purpose of the test data set?
- To provide a unbiased measure of assessment for the final model.
- To compare models and select and fine-tune the final model.
- To reduce total sample size to make computations more efficient.
- To build the predictive models.
Correct Answer

A. To provide a unbiased measure of assessment for the final model.

Explanation

The purpose of the test data set is to provide an unbiased measure of assessment for the final model. This means that the test data set is used to evaluate the performance of the model that has been built using the training and validation data sets. By using a separate test data set, it ensures that the evaluation of the model is not influenced by the data that was used to build and fine-tune the model. This helps to provide a more accurate assessment of how the model will perform on new, unseen data.

Rate this question:
26.

What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data prior to partitioning the data for honest assessment as opposed to performing the data cleansing after partitioning the data?
- It violates assumptions of the model.
- It requires extra computational effort and time.
- It omits the training (and test) data sets from the benefits of the cleansing methods.
- There is no ability to compare the effectiveness of different cleansing methods.
Correct Answer

A. There is no ability to compare the effectiveness of different cleansing methods.

Explanation

Performing data cleansing on raw data prior to partitioning the data for honest assessment means that the cleansing methods cannot be compared for their effectiveness. This is because the data is already cleaned before it is divided into training and test sets, so there is no way to evaluate how well each cleansing method performs on different subsets of the data. This drawback limits the ability to make informed decisions about which cleansing methods are most effective for improving the quality of the data.

Rate this question:
27.

The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why?
- Normality, because Prob > F < .0001.
- Normality, because the interquartile ranges are different in different ad campaigns.
- Constant variance, because Prob > F < .0001.
- Constant variance, because the interquartile ranges are different in different ad campaigns.
Correct Answer

A. Constant variance, because the interquartile ranges are different in different ad campaigns.

Explanation

The correct answer is constant variance because the interquartile ranges are different in different ad campaigns. In ANOVA, one of the assumptions is that the variances of the groups being compared are equal. The interquartile range is a measure of dispersion, and if the interquartile ranges are different between the ad campaigns, it suggests that the variances are not equal. Therefore, the assumption of constant variance is violated.

Rate this question:
28.

A non-contributing predictor variable (Pr > |t| =0.658) is added to an existing multiple linear regression model. What will be the result?
- An increase in R-Square
- A decrease in R-Square
- A decrease in Mean Square Error
- No change in R-Square
Correct Answer

A. An increase in R-Square

Explanation

When a non-contributing predictor variable is added to an existing multiple linear regression model, it means that the variable does not significantly contribute to the prediction of the dependent variable. Therefore, adding this variable will not improve the model's predictive power. As a result, there will be no change in the R-Square value, which measures the proportion of variance in the dependent variable that is explained by the independent variables.

Rate this question:
29.

Which SAS program will correctly use backward elimination selection criterion within the REG procedure?
- A
- B
- C
- D
Correct Answer

A. B

Explanation

Option B is the correct answer because it is the only option that mentions the use of backward elimination selection criterion within the REG procedure in SAS. Backward elimination is a variable selection method where variables are removed from the model one by one based on their significance level. The REG procedure in SAS is used for regression analysis, and backward elimination is one of the variable selection methods available in this procedure. Therefore, option B is the correct choice for using backward elimination in the REG procedure.

Rate this question:
30.

When mean imputation is performed on data after the data is partitioned for an honest assessment, what is the most appropriate method for handling the mean imputation?
- The sample means from the validation data set are applied to the training and test data sets.
- The sample means from the training data set are applied to the validation and test data sets.
- The sample means from the test data set are applied to the training and validation data sets.
- The sample means from each partition of the data are applied to their own partition.
Correct Answer

A. The sample means from the training data set are applied to the validation and test data sets.

Explanation

When mean imputation is performed on data after the data is partitioned for an honest assessment, the most appropriate method for handling the mean imputation is to apply the sample means from the training data set to the validation and test data sets. This ensures that the imputation is based on the patterns and characteristics of the training data, which is the most accurate representation of the overall dataset. Applying the sample means from the validation or test data sets to the other sets would introduce bias and potentially distort the assessment of the model's performance.

Rate this question:
31.

What does the reference line at lift = 1 corresponds to?
- The predicted lift for the best 50% of validation data cases
- The predicted lift if the entire population is scored as event cases
- The predicted lift if none of the population are scored as event cases
- The predicted lift if 50% of the population are randomly scored as event cases
Correct Answer

A. The predicted lift if the entire population is scored as event cases

Explanation

The reference line at lift = 1 corresponds to the predicted lift if the entire population is scored as event cases. This means that if all individuals in the population are classified as event cases, the lift will be equal to 1. Lift is a measure of the effectiveness of a predictive model in identifying the target variable, and a lift value of 1 indicates that the model is not performing better than random chance.

Rate this question:
32.

Based upon the comparative ROC plot for two competing models, which is the champion model and why?
- Candidate 1, because the area outside the curve is greater
- Candidate 2, because the area outside the curve is greater
- Candidate 1, because it is closer to the diagonal reference curve
- Candidate 2, because it shows less over fit than Candidate 1
Correct Answer

A. Candidate 2, because the area outside the curve is greater

Explanation

Candidate 2 is the champion model because the area outside the curve is greater. This indicates that Candidate 2 has a higher true positive rate and a lower false positive rate compared to Candidate 1. A larger area under the curve suggests better overall performance and accuracy in distinguishing between positive and negative cases. Therefore, Candidate 2 is considered the better model based on the comparative ROC plot.

Rate this question:

0
33.

Customers were surveyed to assess their intent to purchase a product. An analyst divided the customers into groups defined by the company's pre-assigned market segments and tested for difference in the customers' average intent to purchase. The following is the output from the GLM procedure: What percentage of customers' intent to purchase is explained by market segment?
- 35%
- 65%
- 76%
Correct Answer

A. 76%

Explanation

The GLM procedure output indicates that 76% of the customers' intent to purchase is explained by the market segment. This means that the customers' likelihood or willingness to purchase the product is significantly influenced by the market segment they belong to. The analyst divided the customers into different groups based on the company's pre-assigned market segments and found that these segments have a strong impact on the customers' intent to purchase.

Rate this question:

0
34.

Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables?
- Concordant and discordant pairs of ranked observations
- Logit link (log(p/1-p))
- Rank-ordered values of the variables
- Weighted sum of chi-square statistics for 2x2 tables
Correct Answer

A. Rank-ordered values of the variables

Explanation

The Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between rank-ordered values of the variables. This means that the Spearman statistics measure the strength and direction of the monotonic relationship between variables, which can help identify variables that are not relevant or do not have a significant impact on the outcome. By ranking the values of the variables and calculating the Spearman correlation coefficient, researchers can determine if there is a consistent pattern or trend between the variables, which can inform further analysis and variable selection.

Rate this question:
35.

Which of the following describes a discordant pair of observations in the LOGISTIC procedure?
- An observation with the event has an equal probability as another observation with the event.
- An observation with the event has a lower predicted probability than the observation without the event.
- An observation with the event has an equal predicted probability as the observation without the event.
- An observation with the event has a higher predicted probability than the observation without the event
Correct Answer

A. An observation with the event has a lower predicted probability than the observation without the event.

Explanation

This answer describes a discordant pair of observations in the LOGISTIC procedure where an observation with the event has a lower predicted probability than the observation without the event. This suggests that the presence of the event is not strongly associated with a higher probability of occurrence, which may indicate a contradiction or inconsistency in the data.

Rate this question:
36.

Screening for non-linearity in binary logistic regression can be achieved by visualizing:
- A scatter plot of binary response versus a predictor variable.
- A trend plot of empirical logit versus a predictor variable.
- A logistic regression plot of predicted probability values versus a predictor variable.
- A box plot of the odds ratio values versus a predictor variable.
Correct Answer

A. A trend plot of empirical logit versus a predictor variable.

Explanation

A trend plot of empirical logit versus a predictor variable can be used to screen for non-linearity in binary logistic regression. This plot helps to visualize the relationship between the predictor variable and the log odds of the binary response. If the trend plot shows a non-linear pattern, it suggests that there may be a non-linear relationship between the predictor variable and the log odds, indicating the need for non-linear modeling techniques or transformations of the predictor variable.

Rate this question:
37.

A predictive model uses a data set that has several variables with missing values. What two problems can arise with this model? (Choose two.)
- The model will likely be overfit.
- There will be a high rate of collinearity among input variables.
- Complete case analysis means that fewer observations will be used in the model building process.
- New cases with missing values on input variables cannot be scored without extra data processing.
Correct Answer(s)

A. Complete case analysis means that fewer observations will be used in the model building process.
A. New cases with missing values on input variables cannot be scored without extra data processing.

Explanation

A predictive model that uses a dataset with missing values can lead to two problems. First, complete case analysis means that fewer observations will be used in the model building process. This can result in a loss of valuable information and potential bias in the model. Second, new cases with missing values on input variables cannot be scored without extra data processing. This means that the model may not be able to make accurate predictions for new data points that have missing values, requiring additional steps to handle missing data before scoring.

Rate this question:

1
38.

Which statistic indicates a better model when it gets larger?
- Adjusted R Square
- Mallow's Cp
- Chi Square
- Average Squared Error
Correct Answer

A. Adjusted R Square

Explanation

Adjusted R Square is a statistic that indicates the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. As Adjusted R Square increases, it suggests that the model is better at explaining the variation in the dependent variable. Therefore, a larger value of Adjusted R Square indicates a better model.

Rate this question:
39.

SAS output from the RSOUARE selection method, within the REG procedure, is shown. The top two models in each subset are given. Based on the AIC statistic, which model is the champion model?
- Age Weight RunTime RunPulse MaxPulse
- Age Weight RunTime RunPulse RestPulse MaxPulse
- RestPulse
- RunTime
Correct Answer

A. Age Weight RunTime RunPulse MaxPulse

Explanation

The champion model is the one with the lowest AIC statistic. Since the AIC statistic is not provided in the given information, it is not possible to determine which model is the champion model.

Rate this question:
40.

Suppose training data are oversampled in the event group to make the number of events and nonevents roughly equal. Logistic regression is run and the probabilities are output to a data set NEW and given the variable name PE. A decision rule considered is, "Classify data as an event if the probability is greater than 0.5." Also, the data set NEW contains a variable TG that indicates whether there is an event (1=Event, 0= No event). The following SAS program was used. What does this program calculate?
- Depth
- Sensitivity
- Specificity
- Positive predictive value
Correct Answer

A. Sensitivity

Explanation

The given SAS program calculates the sensitivity. Sensitivity is a measure of the proportion of actual positive cases that are correctly identified as positive by the model. In this case, the program uses logistic regression to calculate probabilities of events and nonevents, and then applies a decision rule to classify data as an event if the probability is greater than 0.5. Sensitivity measures how well the model identifies the actual positive cases correctly.

Rate this question:
41.

There are variable cluster in the input variables for a regression application. Which SAS procedure provides a viable solution?

Correct Answer
varclus

Explanation
The varclus procedure in SAS provides a viable solution for regression applications with variable clusters in the input variables. This procedure helps in identifying groups of variables that have high correlation with each other, allowing for the creation of clusters. By using varclus, researchers can reduce the dimensionality of their data and select representative variables from each cluster, which can then be used in regression analysis.

Rate this question:

1
42.

Excluding redundant input variables in a regression model can:
- Stabilize parameter estimates and increase the risk of overfitting.
- Destabilize parameter estimates and increase the risk of overfitting.
- Stabilize parameter estimates and decrease the risk of overfitting.
- Destabilize parameter estimates and decrease the risk of overfitting.
Correct Answer

A. Stabilize parameter estimates and decrease the risk of overfitting.

Explanation

Excluding redundant input variables in a regression model can stabilize parameter estimates by removing unnecessary variables that do not contribute significantly to the model's prediction. This helps to reduce the variability in the estimated coefficients, making them more reliable. Additionally, removing redundant variables can decrease the risk of overfitting, which occurs when a model fits the training data too closely and performs poorly on new, unseen data. By simplifying the model and focusing on the most relevant variables, the risk of overfitting is reduced, leading to better generalization and predictive accuracy.

Rate this question:
43.

Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?
- Proc surveryselect data=SASUSER.DATABASE samprate=0.6 out=sample; strata country; run;
- Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; run;
- Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; strata county; run;
- Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample; strata county; eun;
Correct Answer

A. Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; strata county; run;

Explanation

The correct answer is the last option. This is because it first sorts the data set by county using the PROC SORT step. Then, it uses the PROC SURVEYSELECT step to divide the data set into a training set and a validation set, with a sampling rate of 0.6 (60%). The stratification is done by county, ensuring that the two sets have a proportional representation of each county in the original data set. The OUT= option is used to create a new data set named "sample" that contains the selected observations. The OUTALL option is used to include all observations in the output data set, not just the selected ones.

Rate this question:

0
44.

The total modeling data has been split into training, validation, and test data. What is the best data to use for model assessment?
- Training data
- Total data
- Test data
- Validation data
Correct Answer

A. Validation data

Explanation

The best data to use for model assessment is the validation data. This is because the validation data is not used during the training process and is therefore unseen by the model. It serves as an unbiased measure of the model's performance and can be used to fine-tune the model's hyperparameters or compare different models. The test data, on the other hand, should be kept separate and only used at the very end to provide a final evaluation of the model's performance.

Rate this question:
45.
- The association between the continuous predictor and the binary response is quadratic.
- The association between the continuous predictor and the log-odds is quadratic.
- The association between the continuous predictor and the continuous response is quadratic.
- The association between the binary predictor and the log-odds is quadratic.
Correct Answer

A. The association between the continuous predictor and the log-odds is quadratic.

Explanation

The correct answer is that the association between the continuous predictor and the log-odds is quadratic. This means that as the continuous predictor increases, the log-odds of the binary response also increase, but at a decreasing rate. In other words, the relationship between the continuous predictor and the log-odds is not linear, but instead follows a quadratic pattern. This suggests that there may be a curvilinear relationship between the continuous predictor and the binary response.

Rate this question:
46.

At a depth of 0.1, Lift=3.14. What does this mean?
- Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%.
- Selecting the observations with a response probability of at least 10% should result in 3.14 times more events than a random draw of 10%.
- Selecting the top 10% of the population scored by the model should result in 3.14 timesgreater accuracy than a random draw of 10%.
- Selecting the observations with a response probability of atleast 10% should result in 3.14times greater accuracy than a random draw of 10%.
Correct Answer

A. Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%.

Explanation

This means that if we select the top 10% of the population based on the model's score, we can expect to see 3.14 times more events compared to randomly selecting 10% of the population. In other words, the model is able to accurately identify a higher proportion of events when selecting from the top 10% based on its score.

Rate this question:
47.

Given the following SAS dataset TEST: Inc_Group 1 2 3 4 5 Which SAS program is NOT a correct way to create dummy variables?
- Option A
- Option B
- Option C
- Option D
Correct Answer

A. Option D

Explanation

Option D is not a correct way to create dummy variables because it does not include any code to create dummy variables. The other options (A, B, and C) may include code that creates dummy variables based on the values in the "Inc_Group" variable. Without seeing the code in each option, it is not possible to determine which option is the correct way to create dummy variables.

Rate this question:
48.

Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4).
- A
- B
- C
- D
Correct Answer

A. B
49.

The selection criterion used in the forward selection method in the REG procedure is:

Correct Answer
SLE

Explanation
The selection criterion used in the forward selection method in the REG procedure is SLE, which stands for significance level of entry. This criterion is used to determine the level of significance at which a predictor variable should be included in the regression model. The forward selection method starts with an empty model and iteratively adds variables that have the lowest p-values (significance levels) until no more variables meet the predetermined significance level for entry. This helps to identify the most significant predictor variables for the regression model.

Rate this question:

Quiz Review Timeline (Updated): Mar 21, 2023 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

Current Version
Mar 21, 2023

Quiz Edited by
ProProfs Editorial Team
Sep 15, 2014

Quiz Created by
Wendicai

Recent Quizzes

Big Data Analytics Quiz! Big Data Analytics Quiz!

Google Analytics Skills Assessment Test Google Analytics Skills Assessment Test

Test on Monitoring Sales: Quiz! Test on Monitoring Sales: Quiz!

Back to top

Statistical Business Analysis Quiz! Hardest Trivia Questions

In order to perform an honest assessment on a predictive model, what is an acceptable division between training, validation, and testing data?

Quiz Preview

This question will ask you to provide a missing option. Complete the following syntax to test the homogeneity of variance assumption in the GLM procedure: Means Region / <insert option here> =levene;

Based on the control plot, which conclusion is justified regarding the means of the response?

An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale?

What is the total number of the sample size?

Which SAS program will detect collinearity in a multiple regression application?

An analyst investigates Region (A, B, or C) as an input variable in a logistic regression model. The analyst discovers that the probability of purchasing a certain item when Region = A is 1. What problem does this illustrate?

The Intercept estimate is interpreted as:

1. Refer to the ROC curve: As you move along the curve, what changes?

In partitioning data for model assessment, which sampling methods are acceptable?

A confusion matrix is created for data that were oversampled due to a rare target. What values are not affected by this oversampling?

27. Which statement is correct at an alpha level of 0.05?

Including redundant input variables in a regression model can:

Which of the following describes a concordant pair of observations in the LOGISTIC procedure?

What is the default method in the LOGISTIC procedure to handle observations with missing data?

Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small (S), medium (M), and large (L) wrist sizes?

The standard form of a linear regression is : Y= beta0+beta1*X+ error Which statement best summarizes the assumptions placed on the errors?

An analyst has a sufficient volume of data to perform a 3-way partition of the data into training, validation, and test sets to perform honest assessment during the model building process. What is the purpose of the test data set?

What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data prior to partitioning the data for honest assessment as opposed to performing the data cleansing after partitioning the data?

The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why?

A non-contributing predictor variable (Pr > |t| =0.658) is added to an existing multiple linear regression model. What will be the result?

Which SAS program will correctly use backward elimination selection criterion within the REG procedure?

When mean imputation is performed on data after the data is partitioned for an honest assessment, what is the most appropriate method for handling the mean imputation?

What does the reference line at lift = 1 corresponds to?

Based upon the comparative ROC plot for two competing models, which is the champion model and why?

Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables?

Which of the following describes a discordant pair of observations in the LOGISTIC procedure?

Screening for non-linearity in binary logistic regression can be achieved by visualizing:

A predictive model uses a data set that has several variables with missing values. What two problems can arise with this model? (Choose two.)

Which statistic indicates a better model when it gets larger?

SAS output from the RSOUARE selection method, within the REG procedure, is shown. The top two models in each subset are given. Based on the AIC statistic, which model is the champion model?

There are variable cluster in the input variables for a regression application. Which SAS procedure provides a viable solution?

Excluding redundant input variables in a regression model can:

Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?

The total modeling data has been split into training, validation, and test data. What is the best data to use for model assessment?

At a depth of 0.1, Lift=3.14. What does this mean?

Given the following SAS dataset TEST: Inc_Group 1 2 3 4 5 Which SAS program is NOT a correct way to create dummy variables?

Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4).

The selection criterion used in the forward selection method in the REG procedure is:

**The standard form of a linear regression is : Y= beta0+beta1*X+ error** Which statement best summarizes the assumptions placed on the errors?