# Statistical Business Analysis Quiz! Hardest Trivia Questions

Approved & Edited by ProProfs Editorial Team
The editorial team at ProProfs Quizzes consists of a select group of subject experts, trivia writers, and quiz masters who have authored over 10,000 quizzes taken by more than 100 million users. This team includes our in-house seasoned quiz moderators and subject matter experts. Our editorial experts, spread across the world, are rigorously trained using our comprehensive guidelines to ensure that you receive the highest quality quizzes.
| By Wendicai
W
Wendicai
Community Contributor
Quizzes Created: 1 | Total Attempts: 1,291
Questions: 72 | Attempts: 1,291

Settings

.

• 1.

### 1. Refer to the ROC curve:  As you move along the curve, what changes?

• A.

The priors in the population

• B.

The true negative rate in the population

• C.

The proportion of events in the training data

• D.

The probability cutoff for scoring

D. The probability cutoff for scoring
Explanation
As you move along the ROC curve, the probability cutoff for scoring changes. The ROC curve is a graphical representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) for different probability cutoffs. As the cutoff changes, the sensitivity and specificity values also change, causing the points on the ROC curve to shift. The probability cutoff determines the threshold for classifying an observation as positive or negative, and adjusting it can affect the balance between correctly identifying true positives and minimizing false positives.

Rate this question:

• 2.

### When mean imputation is performed on data after the data is partitioned for an honest assessment, what is the most appropriate method for handling the mean imputation?

• A.

The sample means from the validation data set are applied to the training and test data sets.

• B.

The sample means from the training data set are applied to the validation and test data sets.

• C.

The sample means from the test data set are applied to the training and validation data sets.

• D.

The sample means from each partition of the data are applied to their own partition.

B. The sample means from the training data set are applied to the validation and test data sets.
Explanation
When mean imputation is performed on data after the data is partitioned for an honest assessment, the most appropriate method for handling the mean imputation is to apply the sample means from the training data set to the validation and test data sets. This ensures that the imputation is based on the patterns and characteristics of the training data, which is the most accurate representation of the overall dataset. Applying the sample means from the validation or test data sets to the other sets would introduce bias and potentially distort the assessment of the model's performance.

Rate this question:

• 3.

### An analyst generates a model using the LOGISTIC procedure. They are now interested in getting the sensitivity and specificity statistics on a validation data set for a variety of cutoff values. Which statement and option combination will generate these statistics?

• A.

Score data=valid1 out=roc;

• B.

Score data=valid1 outroc=roc;

• C.

Mode1resp(event= '1') = gender region/outroc=roc;

• D.

Mode1resp(event"1") = gender region/ out=roc;

B. Score data=valid1 outroc=roc;
Explanation
The correct combination of statement and option to generate sensitivity and specificity statistics on a validation data set for a variety of cutoff values is "Score data=valid1 outroc=roc". This statement will score the validation data set and output the receiver operating characteristic (ROC) curve, which can be used to calculate sensitivity and specificity at different cutoff values.

Rate this question:

• 4.

### In partitioning data for model assessment, which sampling methods are acceptable?

• A.

Simple random sampling without replacement

• B.

Simple random sampling with replacement

• C.

Stratified random sampling without replacement

• D.

Sequential random sampling with replacement

A. Simple random sampling without replacement
C. Stratified random sampling without replacement
Explanation
In partitioning data for model assessment, both simple random sampling without replacement and stratified random sampling without replacement are acceptable methods. Simple random sampling without replacement involves randomly selecting a subset of data without replacement, ensuring that each sample is unique. Stratified random sampling without replacement involves dividing the data into distinct groups or strata based on certain characteristics and then randomly selecting samples from each stratum without replacement. Sequential random sampling with replacement and simple random sampling with replacement are not acceptable methods for partitioning data for model assessment.

Rate this question:

• 5.

### Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?

• A.

Proc surveryselect data=SASUSER.DATABASE samprate=0.6 out=sample; strata country; run;

• B.

Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; run;

• C.

Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; strata county; run;

• D.

Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample; strata county; eun;

C. Proc sort data=SASUSER.DATABASE; by county; run; proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall; strata county; run;
Explanation
The correct answer is the last option. This is because it first sorts the data set by county using the PROC SORT step. Then, it uses the PROC SURVEYSELECT step to divide the data set into a training set and a validation set, with a sampling rate of 0.6 (60%). The stratification is done by county, ensuring that the two sets have a proportional representation of each county in the original data set. The OUT= option is used to create a new data set named "sample" that contains the selected observations. The OUTALL option is used to include all observations in the output data set, not just the selected ones.

Rate this question:

• 6.

### At a depth of 0.1, Lift=3.14. What does this mean?

• A.

Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%.

• B.

Selecting the observations with a response probability of at least 10% should result in 3.14 times more events than a random draw of 10%.

• C.

Selecting the top 10% of the population scored by the model should result in 3.14 timesgreater accuracy than a random draw of 10%.

• D.

Selecting the observations with a response probability of atleast 10% should result in 3.14times greater accuracy than a random draw of 10%.

A. Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%.
Explanation
This means that if we select the top 10% of the population based on the model's score, we can expect to see 3.14 times more events compared to randomly selecting 10% of the population. In other words, the model is able to accurately identify a higher proportion of events when selecting from the top 10% based on its score.

Rate this question:

• 7.

### What does the reference line at lift = 1 corresponds to?

• A.

The predicted lift for the best 50% of validation data cases

• B.

The predicted lift if the entire population is scored as event cases

• C.

The predicted lift if none of the population are scored as event cases

• D.

The predicted lift if 50% of the population are randomly scored as event cases

B. The predicted lift if the entire population is scored as event cases
Explanation
The reference line at lift = 1 corresponds to the predicted lift if the entire population is scored as event cases. This means that if all individuals in the population are classified as event cases, the lift will be equal to 1. Lift is a measure of the effectiveness of a predictive model in identifying the target variable, and a lift value of 1 indicates that the model is not performing better than random chance.

Rate this question:

• 8.

### What is the use of the reference line in gains chart?

If we contact X% of customers then we will receive X% of the total positive responses.
Explanation
The reference line in the gains chart is used to determine the effectiveness of a marketing campaign. It shows the relationship between the percentage of customers contacted and the percentage of positive responses received. By following the reference line, we can estimate the expected response rate based on the percentage of customers contacted. For example, if we contact 50% of customers, we can expect to receive 50% of the total positive responses. This helps in evaluating the efficiency and success of the campaign in terms of reaching the target audience and generating positive outcomes.

Rate this question:

• 9.

### Suppose training data are oversampled in the event group to make the number of events and nonevents roughly equal. Logistic regression is run and the probabilities are output to a data set NEW and given the variable name PE. A decision rule considered is, "Classify data as an event if the probability is greater than 0.5." Also, the data set NEW contains a variable TG that indicates whether there is an event (1=Event, 0= No event). The following SAS program was used. What does this program calculate?

• A.

Depth

• B.

Sensitivity

• C.

Specificity

• D.

Positive predictive value

B. Sensitivity
Explanation
The given SAS program calculates the sensitivity. Sensitivity is a measure of the proportion of actual positive cases that are correctly identified as positive by the model. In this case, the program uses logistic regression to calculate probabilities of events and nonevents, and then applies a decision rule to classify data as an event if the probability is greater than 0.5. Sensitivity measures how well the model identifies the actual positive cases correctly.

Rate this question:

• 10.

### The plots represent two models, A and B, being fit to the same two data sets, training and validation. Model A is 90.5% accurate at distinguishing blue from red on the training data and 75.5% accurate at doing the same on validation data. Model B is 83% accurate at distinguishing blue from red on the training data and 78.3% accurate at doing the same on the validation data. Which of the two models should be selected and why?

• A.

Model A. It is more complex with a higher accuracy than model B on training data.

• B.

Model A. It performs better on the boundary for the training data.

• C.

Model B. It is more complex with a higher accuracy than model A on validation data.

• D.

Model B. It is simpler with a higher accuracy than model A on validation data.

D. Model B. It is simpler with a higher accuracy than model A on validation data.
Explanation
Model B should be selected because it has a higher accuracy than model A on the validation data. Additionally, model B is simpler, which suggests that it may be more robust and less prone to overfitting.

Rate this question:

• 11.

### Assume a \$10 cost for soliciting a non-responder and a \$200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Pinch, a 1/0 variable coded as 1 for the responder. Customers will be solicited when their probability score is more than 0.05. Which SAS program computes the profit for each customer in the data set VALID?

• A.

Profit=(P_R>0.05)*Purch*200-(P_R>.05)*(1-Purch)*10;

• B.

Profit=(P_R.05)*(1-Purch)*10;

• C.

If P_R> 0.05; profit=(P_R>0.05)*Purch*200-(P_R>.05)*(1-Purch)*10;

• D.

If P_R> 0.05; profit=(P_R>0.05)*Purch*200+(P_R

A. Profit=(P_R>0.05)*Purch*200-(P_R>.05)*(1-Purch)*10;
Explanation
The given SAS program computes the profit for each customer in the data set VALID. It uses the probability score P_R to determine if a customer should be solicited. If the probability score is greater than 0.05, the customer is considered a responder and the profit is calculated as (P_R>0.05)*Purch*200. If the probability score is not greater than 0.05, the customer is considered a non-responder and the profit is calculated as (P_R>.05)*(1-Purch)*10. The program subtracts the cost of soliciting a non-responder from the profit of soliciting a responder to calculate the final profit.

Rate this question:

• 12.

### In order to perform an honest assessment on a predictive model, what is an acceptable division between training, validation, and testing data?

• A.

Training: 50% Validation: 0% Testing: 50%

• B.

Training: 100% Validation: 0% Testing: 0%

• C.

Training: 0% Validation: 100% Testing: 0%

• D.

Training: 50% Validation: 50% Testing: 0%

D. Training: 50% Validation: 50% Testing: 0%
Explanation
An acceptable division between training, validation, and testing data is to allocate 50% of the data for training the predictive model, 50% for validating the model's performance, and 0% for testing. This division allows for a balanced evaluation of the model's accuracy and generalization ability. The training data is used to train the model, the validation data is used to fine-tune the model and select the best hyperparameters, and the testing data is used to assess the final performance of the model on unseen data.

Rate this question:

• 13.

### Based upon the comparative ROC plot for two competing models, which is the champion model and why?

• A.

Candidate 1, because the area outside the curve is greater

• B.

Candidate 2, because the area outside the curve is greater

• C.

Candidate 1, because it is closer to the diagonal reference curve

• D.

Candidate 2, because it shows less over fit than Candidate 1

B. Candidate 2, because the area outside the curve is greater
Explanation
Candidate 2 is the champion model because the area outside the curve is greater. This indicates that Candidate 2 has a higher true positive rate and a lower false positive rate compared to Candidate 1. A larger area under the curve suggests better overall performance and accuracy in distinguishing between positive and negative cases. Therefore, Candidate 2 is considered the better model based on the comparative ROC plot.

Rate this question:

• 14.

### A confusion matrix is created for data that were oversampled due to a rare target. What values are not affected by this oversampling?

• A.

Sensitivity and PV+

• B.

Specificity and PV-

• C.

PV+ and PV-

• D.

Sensitivity and Specificity

D. Sensitivity and Specificity
Explanation
When data is oversampled due to a rare target, it means that the rare target is artificially increased in the dataset to balance the class distribution. Sensitivity and specificity are performance metrics that measure the accuracy of a classification model and are calculated based on the true positive, true negative, false positive, and false negative values. Since oversampling does not change the true positive and true negative values, sensitivity and specificity remain unaffected.

Rate this question:

• 15.

### This question will ask you to provide missing code segments. A logistic regression model was fit on a data set where 40% of the outcomes were events (TARGET=1) and 60% were non-events (TARGET=0). The analyst knows that the population where the model will be deployed has 5% events and 95% non-events. The analyst also knows that the company's profit margin for correctly targeted events is nine times higher than the company's loss for incorrectly targeted non-event. Given the following SAS program:What X and Y values should be added to the program to correctly score the data?

• A.

X=40, Y=10

• B.

X=.05, Y=10

• C.

X=.05, Y=.40

• D.

X=.10,Y=.05

B. X=.05, Y=10
Explanation
The X value represents the percentage of events in the population where the model will be deployed, which is 5%. The Y value represents the company's profit margin for correctly targeted events, which is 10. Therefore, the correct values to be added to the program are X=.05 and Y=10. This ensures that the logistic regression model is appropriately calibrated to the population and takes into account the company's profit margin for correctly targeted events.

Rate this question:

• 16.

### An analyst has a sufficient volume of data to perform a 3-way partition of the data into training, validation, and test sets to perform honest assessment during the model building process. What is the purpose of the test data set?

• A.

To provide a unbiased measure of assessment for the final model.

• B.

To compare models and select and fine-tune the final model.

• C.

To reduce total sample size to make computations more efficient.

• D.

To build the predictive models.

A. To provide a unbiased measure of assessment for the final model.
Explanation
The purpose of the test data set is to provide an unbiased measure of assessment for the final model. This means that the test data set is used to evaluate the performance of the model that has been built using the training and validation data sets. By using a separate test data set, it ensures that the evaluation of the model is not influenced by the data that was used to build and fine-tune the model. This helps to provide a more accurate assessment of how the model will perform on new, unseen data.

Rate this question:

• 17.

### Calculate the sensitivity, accuracy, error rate.

25/48
83/150
67/150
Explanation
The given answer provides the sensitivity, accuracy, and error rate for a certain calculation. However, without any context or additional information, it is not possible to determine what calculation or scenario these values represent. Therefore, an explanation cannot be provided.

Rate this question:

• 18.

### The total modeling data has been split into training, validation, and test data. What is the best data to use for model assessment?

• A.

Training data

• B.

Total data

• C.

Test data

• D.

Validation data

D. Validation data
Explanation
The best data to use for model assessment is the validation data. This is because the validation data is not used during the training process and is therefore unseen by the model. It serves as an unbiased measure of the model's performance and can be used to fine-tune the model's hyperparameters or compare different models. The test data, on the other hand, should be kept separate and only used at the very end to provide a final evaluation of the model's performance.

Rate this question:

• 19.

### What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data prior to partitioning the data for honest assessment as opposed to performing the data cleansing after partitioning the data?

• A.

It violates assumptions of the model.

• B.

It requires extra computational effort and time.

• C.

It omits the training (and test) data sets from the benefits of the cleansing methods.

• D.

There is no ability to compare the effectiveness of different cleansing methods.

D. There is no ability to compare the effectiveness of different cleansing methods.
Explanation
Performing data cleansing on raw data prior to partitioning the data for honest assessment means that the cleansing methods cannot be compared for their effectiveness. This is because the data is already cleaned before it is divided into training and test sets, so there is no way to evaluate how well each cleansing method performs on different subsets of the data. This drawback limits the ability to make informed decisions about which cleansing methods are most effective for improving the quality of the data.

Rate this question:

• 20.

### A company has branch offices in eight regions. Customers within each region are classified as either "High Value" or "Medium Value" and are coded using the variable name VALUE. In the last year, the total amount of purchases per customer is used as the response variable. Suppose there is a significant interaction between REGION and VALUE. What can you conclude?

• A.

More high value customers are found in some regions than others.

• B.

The difference between average purchases for medium and high value customers depends on the region.

• C.

Regions with higher average purchases have more high value customers.

• D.

Regions with higher average purchases have more medium value customers.

B. The difference between average purchases for medium and high value customers depends on the region.
Explanation
The significant interaction between REGION and VALUE indicates that the relationship between the average purchases for medium and high value customers varies depending on the region. This suggests that the impact of customer value on purchases differs across different regions, indicating that the regional factor plays a role in influencing customer behavior and purchase patterns.

Rate this question:

• 21.

### This question will ask you to provide a missing option. Complete the following syntax to test the homogeneity of variance assumption in the GLM procedure: Means Region / <insert option here> =levene;

hovtest
Explanation
The missing option to complete the syntax is "hovtest". This option is used to perform the homogeneity of variance test in the GLM procedure. The hovtest function calculates a test statistic to determine if the variances of the groups being compared are significantly different. By including this option, the syntax will run the hovtest procedure to test the assumption of homogeneity of variance in the GLM analysis.

Rate this question:

• 22.

### Based on the control plot, which conclusion is justified regarding the means of the response?

• A.

All groups are significantly different from each other.

• B.

2XL is significantly different from all other groups.

• C.

Only XL and 2XL are not significantly different from each other.

• D.

No groups are significantly different from each other.

C. Only XL and 2XL are not significantly different from each other.
Explanation
The correct answer is "Only XL and 2XL are not significantly different from each other." This conclusion is justified because the control plot shows that all other groups are significantly different from each other, indicating that XL and 2XL are the only groups that are not significantly different.

Rate this question:

• 23.

### Customers were surveyed to assess their intent to purchase a product. An analyst divided the customers into groups defined by the company's pre-assigned market segments and tested for difference in the customers' average intent to purchase. The following is the output from the GLM procedure: What percentage of customers' intent to purchase is explained by market segment?

• A.

• B.

35%

• C.

65%

• D.

76%

D. 76%
Explanation
The GLM procedure output indicates that 76% of the customers' intent to purchase is explained by the market segment. This means that the customers' likelihood or willingness to purchase the product is significantly influenced by the market segment they belong to. The analyst divided the customers into different groups based on the company's pre-assigned market segments and found that these segments have a strong impact on the customers' intent to purchase.

Rate this question:

• 24.

### The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why?

• A.

Normality, because Prob > F < .0001.

• B.

Normality, because the interquartile ranges are different in different ad campaigns.

• C.

Constant variance, because Prob > F < .0001.

• D.

Constant variance, because the interquartile ranges are different in different ad campaigns.

D. Constant variance, because the interquartile ranges are different in different ad campaigns.
Explanation
The correct answer is constant variance because the interquartile ranges are different in different ad campaigns. In ANOVA, one of the assumptions is that the variances of the groups being compared are equal. The interquartile range is a measure of dispersion, and if the interquartile ranges are different between the ad campaigns, it suggests that the variances are not equal. Therefore, the assumption of constant variance is violated.

Rate this question:

• 25.

### Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small (S), medium (M), and large (L) wrist sizes?

• A.

Medium wrist size is significantly different than small wrist size.

• B.

Large wrist size is significantly different than medium wrist size.

• C.

Large wrist size is significantly different than small wrist size.

• D.

There is no significant difference due to wrist size.

C. Large wrist size is significantly different than small wrist size.
Explanation
The given answer suggests that there is a significant difference in percentage of body fat between individuals with large wrist size and individuals with small wrist size. This conclusion is based on the information provided in the question, which states that the percentage of body fat varies significantly depending on wrist size. Therefore, it can be inferred that individuals with larger wrist sizes have a significantly different percentage of body fat compared to individuals with smaller wrist sizes.

Rate this question:

• 26.

### An analyst compares the mean salaries of men and women working at a company. The SAS data set SALARY contains variables: Gender (M or F) Pay (dollars per year) Which SAS programs can be used to find the p-value for comparing men's salaries with women's salaries? (Choose two.)

• A.

Proc glm data=salary; class gender; model pay=gender; run;

• B.

Proc ttest data=salary; class gender; var pay; run;

• C.

Proc glm data=salary; class pay; model pay=gender; run;

• D.

Proc ttest data=salary; class gender; model pay=gender; run;

A. Proc glm data=salary; class gender; model pay=gender; run;
B. Proc ttest data=salary; class gender; var pay; run;
Explanation
The first program, "proc glm data=salary; class gender; model pay=gender; run;", can be used to compare men's and women's salaries by fitting a linear regression model with gender as the predictor variable and pay as the response variable. The p-value for the gender coefficient in the model can be used to determine if there is a significant difference in salaries between men and women.

The second program, "proc ttest data=salary; class gender; var pay; run;", can be used to compare men's and women's salaries by conducting an independent samples t-test. The p-value from the t-test can be used to determine if there is a significant difference in means between the two groups.

Rate this question:

• 27.

### 27. Which statement is correct at an alpha level of 0.05?

• A.

School*Gender should be removed because it is non-significant.

• B.

Gender should be removed because it is non-significant.

• C.

School should be removed because it is significant.

• D.

Gender should not be removed due to its involvement in the significant interaction.

D. Gender should not be removed due to its involvement in the significant interaction.
Explanation
The correct answer is "Gender should not be removed due to its involvement in the significant interaction." This statement suggests that gender should not be removed from the analysis because it plays a significant role in the interaction between variables. This implies that gender has an impact on the outcome being studied and should therefore be considered in the analysis.

Rate this question:

• 28.

### There are missing values in the input variables for a regression application. Which SAS procedure provides a viable solution?

STDIZE
Explanation
STDIZE is a SAS procedure that can be used to standardize variables by subtracting the mean and dividing by the standard deviation. Standardizing variables can be helpful in regression analysis when there are missing values in the input variables. By standardizing the variables, the missing values can be replaced with the mean value of the variable, ensuring that the data is still centered around the mean. This procedure provides a viable solution for handling missing values in regression applications.

Rate this question:

• 29.

### There are variable cluster in the input variables for a regression application. Which SAS procedure provides a viable solution?

varclus
Explanation
The varclus procedure in SAS provides a viable solution for regression applications with variable clusters in the input variables. This procedure helps in identifying groups of variables that have high correlation with each other, allowing for the creation of clusters. By using varclus, researchers can reduce the dimensionality of their data and select representative variables from each cluster, which can then be used in regression analysis.

Rate this question:

• 30.

### Screening for non-linearity in binary logistic regression can be achieved by visualizing:

• A.

A scatter plot of binary response versus a predictor variable.

• B.

A trend plot of empirical logit versus a predictor variable.

• C.

A logistic regression plot of predicted probability values versus a predictor variable.

• D.

A box plot of the odds ratio values versus a predictor variable.

B. A trend plot of empirical logit versus a predictor variable.
Explanation
A trend plot of empirical logit versus a predictor variable can be used to screen for non-linearity in binary logistic regression. This plot helps to visualize the relationship between the predictor variable and the log odds of the binary response. If the trend plot shows a non-linear pattern, it suggests that there may be a non-linear relationship between the predictor variable and the log odds, indicating the need for non-linear modeling techniques or transformations of the predictor variable.

Rate this question:

• 31.

### Given the following SAS dataset TEST: Inc_Group 1 2 3 4 5 Which SAS program is NOT a correct way to create dummy variables?

• A.

Option A

• B.

Option B

• C.

Option C

• D.

Option D

D. Option D
Explanation
Option D is not a correct way to create dummy variables because it does not include any code to create dummy variables. The other options (A, B, and C) may include code that creates dummy variables based on the values in the "Inc_Group" variable. Without seeing the code in each option, it is not possible to determine which option is the correct way to create dummy variables.

Rate this question:

• 32.

### An analyst fits a logistic regression model to predict whether or not a client will default on a loan. One of the predictors in the model is agent, and each agent serves 15-20 clients each. The model fails to converge. The analyst prints the summarized data, showing the number of defaulted loans per agent. See the partial output below: What is the most likely reason that the model fails to converge?

• A.

There is quasi-complete separation in the data.

• B.

There is collinearity among the predictors.

• C.

There are missing values in the data.

• D.

There are too many observations in the data.

A. There is quasi-complete separation in the data.
Explanation
The most likely reason that the model fails to converge is that there is quasi-complete separation in the data. Quasi-complete separation occurs when there is a predictor that perfectly predicts the outcome variable, resulting in a division of the data into distinct groups. This can cause issues in logistic regression because it leads to infinite parameter estimates. In this case, the number of defaulted loans per agent may be perfectly predicting whether or not a client will default on a loan, causing the model to fail to converge.

Rate this question:

• 33.

### An analyst knows that the categorical predictor, storeId, is an important predictor of the target. However, store_Id has too many levels to be a feasible predictor in the model. The analyst wants to combine stores and treat them as members of the same class level. What are the two most effective ways to address the problem? (Choose two.)

• A.

Eliminate store_id as a predictor in the model because it has too many levels to be feasible.

• B.

Cluster by using Greenacre's method to combine stores that are similar.

• C.

Use subject matter expertise to combine stores that are similar.

• D.

Randomly combine the stores into five groups to keep the stochastic variation among the observations intact.

B. Cluster by using Greenacre's method to combine stores that are similar.
C. Use subject matter expertise to combine stores that are similar.
Explanation
The two most effective ways to address the problem of having too many levels for the storeId predictor are to cluster the stores using Greenacre's method to combine similar stores and to use subject matter expertise to combine similar stores. Clustering allows for grouping of stores based on similarities, while subject matter expertise allows for a more nuanced understanding of the stores and their similarities. Both approaches help to reduce the number of levels for the storeId predictor, making it more feasible for the model.

Rate this question:

• 34.

### Including redundant input variables in a regression model can:

• A.

Stabilize parameter estimates and increase the risk of overfitting.

• B.

Destabilize parameter estimates and increase the risk of overfitting.

• C.

Stabilize parameter estimates and decrease the risk of overfitting.

• D.

Destabilize parameter estimates and decrease the risk of overfitting.

B. Destabilize parameter estimates and increase the risk of overfitting.
Explanation
Including redundant input variables in a regression model can destabilize parameter estimates because these variables do not contribute any additional information to the model. This can lead to unstable and unreliable estimates of the parameters. Additionally, including redundant variables increases the risk of overfitting, where the model becomes too complex and fits the noise in the data rather than the underlying relationship. Overfitting can result in poor generalization to new data and decreased model performance.

Rate this question:

• 35.

### Excluding redundant input variables in a regression model can:

• A.

Stabilize parameter estimates and increase the risk of overfitting.

• B.

Destabilize parameter estimates and increase the risk of overfitting.

• C.

Stabilize parameter estimates and decrease the risk of overfitting.

• D.

Destabilize parameter estimates and decrease the risk of overfitting.

C. Stabilize parameter estimates and decrease the risk of overfitting.
Explanation
Excluding redundant input variables in a regression model can stabilize parameter estimates by removing unnecessary variables that do not contribute significantly to the model's prediction. This helps to reduce the variability in the estimated coefficients, making them more reliable. Additionally, removing redundant variables can decrease the risk of overfitting, which occurs when a model fits the training data too closely and performs poorly on new, unseen data. By simplifying the model and focusing on the most relevant variables, the risk of overfitting is reduced, leading to better generalization and predictive accuracy.

Rate this question:

• 36.

### An analyst investigates Region (A, B, or C) as an input variable in a logistic regression model. The analyst discovers that the probability of purchasing a certain item when Region = A is 1. What problem does this illustrate?

• A.

Collinearity

• B.

Influential observations

• C.

Quasi-complete separation

• D.

Problems that arise due to missing values

C. Quasi-complete separation
Explanation
Quasi-complete separation occurs when a predictor variable perfectly predicts the outcome variable, resulting in extreme coefficients and standard errors in logistic regression. In this case, the probability of purchasing the item is 1 when Region = A, indicating a perfect separation between the predictor and the outcome variable. This can lead to convergence issues in the logistic regression model and make it difficult to estimate accurate coefficients and standard errors.

Rate this question:

• 37.
• A.

The association between the continuous predictor and the binary response is quadratic.

• B.

The association between the continuous predictor and the log-odds is quadratic.

• C.

The association between the continuous predictor and the continuous response is quadratic.

• D.

The association between the binary predictor and the log-odds is quadratic.

B. The association between the continuous predictor and the log-odds is quadratic.
Explanation
The correct answer is that the association between the continuous predictor and the log-odds is quadratic. This means that as the continuous predictor increases, the log-odds of the binary response also increase, but at a decreasing rate. In other words, the relationship between the continuous predictor and the log-odds is not linear, but instead follows a quadratic pattern. This suggests that there may be a curvilinear relationship between the continuous predictor and the binary response.

Rate this question:

• 38.

### This question will ask you to provide a missing option. Given the following SAS program:

• A.

OUTPUT=estimates

• B.

OUTP=estimates

• C.

OUTSTAT=estimates

• D.

OUTCORR=estimates

B. OUTP=estimates
Explanation
The missing option in the given SAS program is "OUTP=estimates". The OUTPUT= option is used to specify the name of the output data set, OUTSTAT= option is used to specify the name of the output statistics data set, and OUTCORR= option is used to specify the name of the output correlation data set. Therefore, the missing option OUTP=estimates is used to specify the name of the output parameter estimates data set.

Rate this question:

• 39.

### A predictive model uses a data set that has several variables with missing values. What two problems can arise with this model? (Choose two.)

• A.

The model will likely be overfit.

• B.

There will be a high rate of collinearity among input variables.

• C.

Complete case analysis means that fewer observations will be used in the model building process.

• D.

New cases with missing values on input variables cannot be scored without extra data processing.

C. Complete case analysis means that fewer observations will be used in the model building process.
D. New cases with missing values on input variables cannot be scored without extra data processing.
Explanation
A predictive model that uses a dataset with missing values can lead to two problems. First, complete case analysis means that fewer observations will be used in the model building process. This can result in a loss of valuable information and potential bias in the model. Second, new cases with missing values on input variables cannot be scored without extra data processing. This means that the model may not be able to make accurate predictions for new data points that have missing values, requiring additional steps to handle missing data before scoring.

Rate this question:

• 40.

### Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables?

• A.

Concordant and discordant pairs of ranked observations

• B.

• C.

Rank-ordered values of the variables

• D.

Weighted sum of chi-square statistics for 2x2 tables

C. Rank-ordered values of the variables
Explanation
The Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between rank-ordered values of the variables. This means that the Spearman statistics measure the strength and direction of the monotonic relationship between variables, which can help identify variables that are not relevant or do not have a significant impact on the outcome. By ranking the values of the variables and calculating the Spearman correlation coefficient, researchers can determine if there is a consistent pattern or trend between the variables, which can inform further analysis and variable selection.

Rate this question:

• 41.

### A non-contributing predictor variable (Pr > |t| =0.658) is added to an existing multiple linear regression model. What will be the result?

• A.

An increase in R-Square

• B.

A decrease in R-Square

• C.

A decrease in Mean Square Error

• D.

No change in R-Square

A. An increase in R-Square
Explanation
When a non-contributing predictor variable is added to an existing multiple linear regression model, it means that the variable does not significantly contribute to the prediction of the dependent variable. Therefore, adding this variable will not improve the model's predictive power. As a result, there will be no change in the R-Square value, which measures the proportion of variance in the dependent variable that is explained by the independent variables.

Rate this question:

• 42.

### A non-contributing predictor variable (Pr > |t| =0.658) is eliminated to an existing multiple linear regression model. What will be the result?

• A.

An increase in R-Square

• B.

A decrease in R-Square

• C.

A decrease in Mean Square Error

• D.

No change in R-Square

B. A decrease in R-Square
Explanation
When a non-contributing predictor variable is eliminated from a multiple linear regression model, it means that this variable does not have a significant impact on the response variable. As a result, removing this variable will lead to a decrease in the overall fit of the model, which is measured by the R-Square value. Therefore, the correct answer is a decrease in R-Square.

Rate this question:

• 43.

### The standard form of a linear regression is : Y= beta0+beta1*X+ error Which statement best summarizes the assumptions placed on the errors?

• A.

The errors are correlated, normally distributed with constant mean and zero variance.

• B.

The errors are correlated, normally distributed with zero mean and constant variance.

• C.

The errors are independent, normally distributed with constant mean and zero variance.

• D.

The errors are independent, normally distributed with zero mean and constant variance.

D. The errors are independent, normally distributed with zero mean and constant variance.
Explanation
The assumption placed on the errors in linear regression is that they are independent, normally distributed with zero mean and constant variance. This means that the errors are not correlated with each other, they follow a normal distribution, their average value is zero, and their variance remains constant across all levels of the predictor variable.

Rate this question:

• 44.

### What is the value of R-squared?

0.4115
Explanation
The value of R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. In this case, the value of R-squared is 0.4115, indicating that approximately 41.15% of the variability in the dependent variable can be explained by the independent variables in the model.

Rate this question:

• 45.

• A.

A

• B.

B

• C.

C

• D.

D

B. B
• 46.

### An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale?

• A.

R-Square

• B.

Coeff Var

• C.

• D.

Error DF

Explanation
The Adjusted R-Squared statistic justifies this rationale because it takes into account the number of predictors in the model. It adjusts the R-Squared value by penalizing for the inclusion of additional predictors that may not significantly contribute to the model fit. Therefore, a higher Adjusted R-Squared value indicates a better model fit, even when compared to a model with more predictors.

Rate this question:

• 47.

### What is the total number of the sample size?

100
Explanation
The total number of the sample size is 100, as stated in the answer. This means that there were 100 participants or observations in the sample.

Rate this question:

• 48.

### The selection criterion used in the forward selection method in the REG procedure is:

SLE
Explanation
The selection criterion used in the forward selection method in the REG procedure is SLE, which stands for significance level of entry. This criterion is used to determine the level of significance at which a predictor variable should be included in the regression model. The forward selection method starts with an empty model and iteratively adds variables that have the lowest p-values (significance levels) until no more variables meet the predetermined significance level for entry. This helps to identify the most significant predictor variables for the regression model.

Rate this question:

• 49.

### Which SAS program will correctly use backward elimination selection criterion within the REG procedure?

• A.

A

• B.

B

• C.

C

• D.

D

B. B
Explanation
Option B is the correct answer because it is the only option that mentions the use of backward elimination selection criterion within the REG procedure in SAS. Backward elimination is a variable selection method where variables are removed from the model one by one based on their significance level. The REG procedure in SAS is used for regression analysis, and backward elimination is one of the variable selection methods available in this procedure. Therefore, option B is the correct choice for using backward elimination in the REG procedure.

Rate this question:

• 50.

### The Intercept estimate is interpreted as:

• A.

The predicted value of the response when all the predictors are at their current values.

• B.

The predicted value of the response when all predictors are at their means.

• C.

The predicted value of the response when all predictors = 0.

• D.

The predicted value of the response when all predictors are at their minimum values.

C. The predicted value of the response when all predictors = 0.
Explanation
The intercept estimate is interpreted as the predicted value of the response when all predictors are at their minimum values. This means that when all predictors are set to 0, the intercept estimate represents the expected value of the response variable.

Rate this question:

Quiz Review Timeline +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

• Current Version
• Mar 21, 2023
Quiz Edited by
ProProfs Editorial Team
• Sep 15, 2014
Quiz Created by
Wendicai

Related Topics