Statistical Business Analysis Quiz! Hardest Trivia Questions

Reviewed by Editorial Team
The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.
Learn about Our Editorial Process
| By Wendicai
W
Wendicai
Community Contributor
Quizzes Created: 1 | Total Attempts: 1,335
| Attempts: 1,335 | Questions: 72
Please wait...
Question 1 / 72
0 %
0/100
Score 0/100
1. In order to perform an honest assessment on a predictive model, what is an acceptable division between training, validation, and testing data?

Explanation

An acceptable division between training, validation, and testing data is to allocate 50% of the data for training the predictive model, 50% for validating the model's performance, and 0% for testing. This division allows for a balanced evaluation of the model's accuracy and generalization ability. The training data is used to train the model, the validation data is used to fine-tune the model and select the best hyperparameters, and the testing data is used to assess the final performance of the model on unseen data.

Submit
Please wait...
About This Quiz
Statistical Business Analysis Quiz! Hardest Trivia Questions - Quiz

Dive into the 'Statistical Business Analysis Quiz! Hardest Trivia Questions' to test and enhance your knowledge on ROC curves, data partitioning, logistic regression, and more. Essential for aspiring... see morebusiness analysts and data scientists aiming to sharpen their analytical skills. see less

2. This question will ask you to provide a missing option. Complete the following syntax to test the homogeneity of variance assumption in the GLM procedure: Means Region / <insert option here> =levene;

Explanation

The missing option to complete the syntax is "hovtest". This option is used to perform the homogeneity of variance test in the GLM procedure. The hovtest function calculates a test statistic to determine if the variances of the groups being compared are significantly different. By including this option, the syntax will run the hovtest procedure to test the assumption of homogeneity of variance in the GLM analysis.

Submit
3. Based on the control plot, which conclusion is justified regarding the means of the response?

Explanation

The correct answer is "Only XL and 2XL are not significantly different from each other." This conclusion is justified because the control plot shows that all other groups are significantly different from each other, indicating that XL and 2XL are the only groups that are not significantly different.

Submit
4. An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale?  

Explanation

The Adjusted R-Squared statistic justifies this rationale because it takes into account the number of predictors in the model. It adjusts the R-Squared value by penalizing for the inclusion of additional predictors that may not significantly contribute to the model fit. Therefore, a higher Adjusted R-Squared value indicates a better model fit, even when compared to a model with more predictors.

Submit
5. What is the total number of the sample size? 

Explanation

The total number of the sample size is 100, as stated in the answer. This means that there were 100 participants or observations in the sample.

Submit
6. Which SAS program will detect collinearity in a multiple regression application?

Explanation

Program B is likely to detect collinearity in a multiple regression application. The question does not provide specific details about the programs, but based on the context, it can be inferred that Program B is designed to identify collinearity. Collinearity refers to the situation where two or more predictor variables in a regression model are highly correlated, which can lead to unstable and unreliable estimates of the regression coefficients. Therefore, Program B is the most appropriate choice for detecting collinearity in a multiple regression application.

Submit
7. The plots represent two models, A and B, being fit to the same two data sets, training and validation. Model A is 90.5% accurate at distinguishing blue from red on the training data and 75.5% accurate at doing the same on validation data. Model B is 83% accurate at distinguishing blue from red on the training data and 78.3% accurate at doing the same on the validation data. Which of the two models should be selected and why?

Explanation

Model B should be selected because it has a higher accuracy than model A on the validation data. Additionally, model B is simpler, which suggests that it may be more robust and less prone to overfitting.

Submit
8. An analyst fits a logistic regression model to predict whether or not a client will default on a loan. One of the predictors in the model is agent, and each agent serves 15-20 clients each. The model fails to converge. The analyst prints the summarized data, showing the number of defaulted loans per agent. See the partial output below: What is the most likely reason that the model fails to converge?

Explanation

The most likely reason that the model fails to converge is that there is quasi-complete separation in the data. Quasi-complete separation occurs when there is a predictor that perfectly predicts the outcome variable, resulting in a division of the data into distinct groups. This can cause issues in logistic regression because it leads to infinite parameter estimates. In this case, the number of defaulted loans per agent may be perfectly predicting whether or not a client will default on a loan, causing the model to fail to converge.

Submit
9. An analyst investigates Region (A, B, or C) as an input variable in a logistic regression model. The analyst discovers that the probability of purchasing a certain item when Region = A is 1. What problem does this illustrate?

Explanation

Quasi-complete separation occurs when a predictor variable perfectly predicts the outcome variable, resulting in extreme coefficients and standard errors in logistic regression. In this case, the probability of purchasing the item is 1 when Region = A, indicating a perfect separation between the predictor and the outcome variable. This can lead to convergence issues in the logistic regression model and make it difficult to estimate accurate coefficients and standard errors.

Submit
10. The Intercept estimate is interpreted as:

Explanation

The intercept estimate is interpreted as the predicted value of the response when all predictors are at their minimum values. This means that when all predictors are set to 0, the intercept estimate represents the expected value of the response variable.

Submit
11. In partitioning data for model assessment, which sampling methods are acceptable? 

Explanation

In partitioning data for model assessment, both simple random sampling without replacement and stratified random sampling without replacement are acceptable methods. Simple random sampling without replacement involves randomly selecting a subset of data without replacement, ensuring that each sample is unique. Stratified random sampling without replacement involves dividing the data into distinct groups or strata based on certain characteristics and then randomly selecting samples from each stratum without replacement. Sequential random sampling with replacement and simple random sampling with replacement are not acceptable methods for partitioning data for model assessment.

Submit
12. An analyst knows that the categorical predictor, storeId, is an important predictor of the target. However, store_Id has too many levels to be a feasible predictor in the model. The analyst wants to combine stores and treat them as members of the same class level. What are the two most effective ways to address the problem? (Choose two.)

Explanation

The two most effective ways to address the problem of having too many levels for the storeId predictor are to cluster the stores using Greenacre's method to combine similar stores and to use subject matter expertise to combine similar stores. Clustering allows for grouping of stores based on similarities, while subject matter expertise allows for a more nuanced understanding of the stores and their similarities. Both approaches help to reduce the number of levels for the storeId predictor, making it more feasible for the model.

Submit
13. Including redundant input variables in a regression model can:

Explanation

Including redundant input variables in a regression model can destabilize parameter estimates because these variables do not contribute any additional information to the model. This can lead to unstable and unreliable estimates of the parameters. Additionally, including redundant variables increases the risk of overfitting, where the model becomes too complex and fits the noise in the data rather than the underlying relationship. Overfitting can result in poor generalization to new data and decreased model performance.

Submit
14. A confusion matrix is created for data that were oversampled due to a rare target. What values are not affected by this oversampling?

Explanation

When data is oversampled due to a rare target, it means that the rare target is artificially increased in the dataset to balance the class distribution. Sensitivity and specificity are performance metrics that measure the accuracy of a classification model and are calculated based on the true positive, true negative, false positive, and false negative values. Since oversampling does not change the true positive and true negative values, sensitivity and specificity remain unaffected.

Submit
15. 27. Which statement is correct at an alpha level of 0.05?

Explanation

The correct answer is "Gender should not be removed due to its involvement in the significant interaction." This statement suggests that gender should not be removed from the analysis because it plays a significant role in the interaction between variables. This implies that gender has an impact on the outcome being studied and should therefore be considered in the analysis.

Submit
16. 1. Refer to the ROC curve:  As you move along the curve, what changes?

Explanation

As you move along the ROC curve, the probability cutoff for scoring changes. The ROC curve is a graphical representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) for different probability cutoffs. As the cutoff changes, the sensitivity and specificity values also change, causing the points on the ROC curve to shift. The probability cutoff determines the threshold for classifying an observation as positive or negative, and adjusting it can affect the balance between correctly identifying true positives and minimizing false positives.

Submit
17. A marketing manager attempts to determine those customers most likely to purchase additional products as the result of a nation-wide marketing campaign. The manager possesses a historical dataset (CAMPAIGN) of a similar campaign from last year. It has the following characteristics: Target variable Respond (0,1) Continuous predictor Income Categorical predictor Homeowner(Y,N) Which SAS program performs this analysis?

Explanation

The correct answer is A because the marketing manager wants to determine the customers most likely to purchase additional products, which suggests a predictive modeling problem. The manager has a historical dataset with a target variable (Respond) and predictor variables (Income and Homeowner). To perform this analysis, the manager would need to use a predictive modeling technique, such as logistic regression or decision tree, which can be implemented using SAS programming.

Submit
18. An analyst examined logistic regression models for predicting whether a customer would make a purchase. The ROC curve displayed summarizes the models. Using the selected model and the analyst's decision rule, 25% of the customers who did not make a purchase are incorrectly classified as purchasers. What can be concluded from the graph?

Explanation

The ROC curve summarizes the performance of the logistic regression model in predicting whether a customer would make a purchase. The fact that 25% of the customers who did not make a purchase are incorrectly classified as purchasers indicates that there is a certain level of misclassification. From the graph, it can be concluded that about 85% of the customers who did make a purchase are correctly classified as making a purchase. This suggests that the model has a relatively high accuracy in predicting customers who make a purchase.

Submit
19. Which of the following describes a concordant pair of observations in the LOGISTIC procedure?

Explanation

In the LOGISTIC procedure, a concordant pair of observations refers to a situation where an observation with the event (outcome of interest) has a higher predicted probability than an observation without the event. This implies that the model is correctly predicting the occurrence of the event, as the observation with the event has a higher likelihood of experiencing it compared to the observation without the event.

Submit
20.   What is the default method in the LOGISTIC procedure to handle observations with missing data?

Explanation

The default method in the LOGISTIC procedure to handle observations with missing data is to only use cases with variables that are fully populated. This means that any observation with missing data will be excluded from the analysis.

Submit
21. A linear model has the following characteristics: *A dependent variable (y) *One continuous variable (xl), including a quadratic term (x12) *One categorical (d with 3 levels) predictor variable and an interaction term (d by x1) How many parameters, including the intercept, are associated with this model? Enter your numeric answer in the space below. Do not add leading or trailing spaces to your answer.

Explanation

A linear model with a dependent variable, one continuous variable with a quadratic term, one categorical predictor variable, and an interaction term would have a total of 7 parameters. These parameters include the intercept, the coefficients for the continuous variable and its quadratic term, the coefficients for the categorical variable's three levels, and the coefficient for the interaction term.

Submit
22. A company has branch offices in eight regions. Customers within each region are classified as either "High Value" or "Medium Value" and are coded using the variable name VALUE. In the last year, the total amount of purchases per customer is used as the response variable. Suppose there is a significant interaction between REGION and VALUE. What can you conclude?

Explanation

The significant interaction between REGION and VALUE indicates that the relationship between the average purchases for medium and high value customers varies depending on the region. This suggests that the impact of customer value on purchases differs across different regions, indicating that the regional factor plays a role in influencing customer behavior and purchase patterns.

Submit
23. Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small (S), medium (M), and large (L) wrist sizes?

Explanation

The given answer suggests that there is a significant difference in percentage of body fat between individuals with large wrist size and individuals with small wrist size. This conclusion is based on the information provided in the question, which states that the percentage of body fat varies significantly depending on wrist size. Therefore, it can be inferred that individuals with larger wrist sizes have a significantly different percentage of body fat compared to individuals with smaller wrist sizes.

Submit
24. The standard form of a linear regression is : Y= beta0+beta1*X+ error Which statement best summarizes the assumptions placed on the errors?

Explanation

The assumption placed on the errors in linear regression is that they are independent, normally distributed with zero mean and constant variance. This means that the errors are not correlated with each other, they follow a normal distribution, their average value is zero, and their variance remains constant across all levels of the predictor variable.

Submit
25. What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data prior to partitioning the data for honest assessment as opposed to performing the data cleansing after partitioning the data?

Explanation

Performing data cleansing on raw data prior to partitioning the data for honest assessment means that the cleansing methods cannot be compared for their effectiveness. This is because the data is already cleaned before it is divided into training and test sets, so there is no way to evaluate how well each cleansing method performs on different subsets of the data. This drawback limits the ability to make informed decisions about which cleansing methods are most effective for improving the quality of the data.

Submit
26. The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why?

Explanation

The correct answer is constant variance because the interquartile ranges are different in different ad campaigns. In ANOVA, one of the assumptions is that the variances of the groups being compared are equal. The interquartile range is a measure of dispersion, and if the interquartile ranges are different between the ad campaigns, it suggests that the variances are not equal. Therefore, the assumption of constant variance is violated.

Submit
27.  Which SAS program will correctly use backward elimination selection criterion within the REG procedure?

Explanation

Option B is the correct answer because it is the only option that mentions the use of backward elimination selection criterion within the REG procedure in SAS. Backward elimination is a variable selection method where variables are removed from the model one by one based on their significance level. The REG procedure in SAS is used for regression analysis, and backward elimination is one of the variable selection methods available in this procedure. Therefore, option B is the correct choice for using backward elimination in the REG procedure.

Submit
28. An analyst has a sufficient volume of data to perform a 3-way partition of the data into training, validation, and test sets to perform honest assessment during the model building process. What is the purpose of the test data set?

Explanation

The purpose of the test data set is to provide an unbiased measure of assessment for the final model. This means that the test data set is used to evaluate the performance of the model that has been built using the training and validation data sets. By using a separate test data set, it ensures that the evaluation of the model is not influenced by the data that was used to build and fine-tune the model. This helps to provide a more accurate assessment of how the model will perform on new, unseen data.

Submit
29. A non-contributing predictor variable (Pr > |t| =0.658) is added to an existing multiple linear regression model. What will be the result?

Explanation

When a non-contributing predictor variable is added to an existing multiple linear regression model, it means that the variable does not significantly contribute to the prediction of the dependent variable. Therefore, adding this variable will not improve the model's predictive power. As a result, there will be no change in the R-Square value, which measures the proportion of variance in the dependent variable that is explained by the independent variables.

Submit
30. Customers were surveyed to assess their intent to purchase a product. An analyst divided the customers into groups defined by the company's pre-assigned market segments and tested for  difference in the customers' average intent to purchase. The following is the output from the GLM procedure: What percentage of customers' intent to purchase is explained by market segment?

Explanation

The GLM procedure output indicates that 76% of the customers' intent to purchase is explained by the market segment. This means that the customers' likelihood or willingness to purchase the product is significantly influenced by the market segment they belong to. The analyst divided the customers into different groups based on the company's pre-assigned market segments and found that these segments have a strong impact on the customers' intent to purchase.

Submit
31. What does the reference line at lift = 1 corresponds to?

Explanation

The reference line at lift = 1 corresponds to the predicted lift if the entire population is scored as event cases. This means that if all individuals in the population are classified as event cases, the lift will be equal to 1. Lift is a measure of the effectiveness of a predictive model in identifying the target variable, and a lift value of 1 indicates that the model is not performing better than random chance.

Submit
32. Which of the following describes a discordant pair of observations in the LOGISTIC procedure?

Explanation

This answer describes a discordant pair of observations in the LOGISTIC procedure where an observation with the event has a lower predicted probability than the observation without the event. This suggests that the presence of the event is not strongly associated with a higher probability of occurrence, which may indicate a contradiction or inconsistency in the data.

Submit
33. Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables?

Explanation

The Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between rank-ordered values of the variables. This means that the Spearman statistics measure the strength and direction of the monotonic relationship between variables, which can help identify variables that are not relevant or do not have a significant impact on the outcome. By ranking the values of the variables and calculating the Spearman correlation coefficient, researchers can determine if there is a consistent pattern or trend between the variables, which can inform further analysis and variable selection.

Submit
34. Based upon the comparative ROC plot for two competing models, which is the champion model and why?

Explanation

Candidate 2 is the champion model because the area outside the curve is greater. This indicates that Candidate 2 has a higher true positive rate and a lower false positive rate compared to Candidate 1. A larger area under the curve suggests better overall performance and accuracy in distinguishing between positive and negative cases. Therefore, Candidate 2 is considered the better model based on the comparative ROC plot.

Submit
35. When mean imputation is performed on data after the data is partitioned for an honest assessment, what is the most appropriate method for handling the mean imputation?

Explanation

When mean imputation is performed on data after the data is partitioned for an honest assessment, the most appropriate method for handling the mean imputation is to apply the sample means from the training data set to the validation and test data sets. This ensures that the imputation is based on the patterns and characteristics of the training data, which is the most accurate representation of the overall dataset. Applying the sample means from the validation or test data sets to the other sets would introduce bias and potentially distort the assessment of the model's performance.

Submit
36. Screening for non-linearity in binary logistic regression can be achieved by visualizing:

Explanation

A trend plot of empirical logit versus a predictor variable can be used to screen for non-linearity in binary logistic regression. This plot helps to visualize the relationship between the predictor variable and the log odds of the binary response. If the trend plot shows a non-linear pattern, it suggests that there may be a non-linear relationship between the predictor variable and the log odds, indicating the need for non-linear modeling techniques or transformations of the predictor variable.

Submit
37. Which statistic indicates a better model when it gets larger? 

Explanation

Adjusted R Square is a statistic that indicates the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. As Adjusted R Square increases, it suggests that the model is better at explaining the variation in the dependent variable. Therefore, a larger value of Adjusted R Square indicates a better model.

Submit
38. A predictive model uses a data set that has several variables with missing values. What two problems can arise with this model? (Choose two.)

Explanation

A predictive model that uses a dataset with missing values can lead to two problems. First, complete case analysis means that fewer observations will be used in the model building process. This can result in a loss of valuable information and potential bias in the model. Second, new cases with missing values on input variables cannot be scored without extra data processing. This means that the model may not be able to make accurate predictions for new data points that have missing values, requiring additional steps to handle missing data before scoring.

Submit
39. SAS output from the RSOUARE selection method, within the REG procedure, is shown. The top two models in each subset are given. Based on the AIC statistic, which model is the champion model?

Explanation

The champion model is the one with the lowest AIC statistic. Since the AIC statistic is not provided in the given information, it is not possible to determine which model is the champion model.

Submit
40. Suppose training data are oversampled in the event group to make the number of events and nonevents roughly equal. Logistic regression is run and the probabilities are output to a data set NEW and given the variable name PE. A decision rule considered is, "Classify data as an event if the probability is greater than 0.5." Also, the data set NEW contains a variable TG that indicates whether there is an event (1=Event, 0= No event). The following SAS program was used. What does this program calculate?

Explanation

The given SAS program calculates the sensitivity. Sensitivity is a measure of the proportion of actual positive cases that are correctly identified as positive by the model. In this case, the program uses logistic regression to calculate probabilities of events and nonevents, and then applies a decision rule to classify data as an event if the probability is greater than 0.5. Sensitivity measures how well the model identifies the actual positive cases correctly.

Submit
41. There are variable cluster in the input variables for a regression application. Which SAS procedure provides a viable solution?

Explanation

The varclus procedure in SAS provides a viable solution for regression applications with variable clusters in the input variables. This procedure helps in identifying groups of variables that have high correlation with each other, allowing for the creation of clusters. By using varclus, researchers can reduce the dimensionality of their data and select representative variables from each cluster, which can then be used in regression analysis.

Submit
42. Excluding redundant input variables in a regression model can:

Explanation

Excluding redundant input variables in a regression model can stabilize parameter estimates by removing unnecessary variables that do not contribute significantly to the model's prediction. This helps to reduce the variability in the estimated coefficients, making them more reliable. Additionally, removing redundant variables can decrease the risk of overfitting, which occurs when a model fits the training data too closely and performs poorly on new, unseen data. By simplifying the model and focusing on the most relevant variables, the risk of overfitting is reduced, leading to better generalization and predictive accuracy.

Submit
43. Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?

Explanation

The correct answer is the last option. This is because it first sorts the data set by county using the PROC SORT step. Then, it uses the PROC SURVEYSELECT step to divide the data set into a training set and a validation set, with a sampling rate of 0.6 (60%). The stratification is done by county, ensuring that the two sets have a proportional representation of each county in the original data set. The OUT= option is used to create a new data set named "sample" that contains the selected observations. The OUTALL option is used to include all observations in the output data set, not just the selected ones.

Submit
44. The total modeling data has been split into training, validation, and test data. What is the best data to use for model assessment?

Explanation

The best data to use for model assessment is the validation data. This is because the validation data is not used during the training process and is therefore unseen by the model. It serves as an unbiased measure of the model's performance and can be used to fine-tune the model's hyperparameters or compare different models. The test data, on the other hand, should be kept separate and only used at the very end to provide a final evaluation of the model's performance.

Submit
45.

Explanation

The correct answer is that the association between the continuous predictor and the log-odds is quadratic. This means that as the continuous predictor increases, the log-odds of the binary response also increase, but at a decreasing rate. In other words, the relationship between the continuous predictor and the log-odds is not linear, but instead follows a quadratic pattern. This suggests that there may be a curvilinear relationship between the continuous predictor and the binary response.

Submit
46. At a depth of 0.1, Lift=3.14. What does this mean?

Explanation

This means that if we select the top 10% of the population based on the model's score, we can expect to see 3.14 times more events compared to randomly selecting 10% of the population. In other words, the model is able to accurately identify a higher proportion of events when selecting from the top 10% based on its score.

Submit
47. Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4).

Explanation

not-available-via-ai

Submit
48. The selection criterion used in the forward selection method in the REG procedure is:

Explanation

The selection criterion used in the forward selection method in the REG procedure is SLE, which stands for significance level of entry. This criterion is used to determine the level of significance at which a predictor variable should be included in the regression model. The forward selection method starts with an empty model and iteratively adds variables that have the lowest p-values (significance levels) until no more variables meet the predetermined significance level for entry. This helps to identify the most significant predictor variables for the regression model.

Submit
49. Given the following SAS dataset TEST: Inc_Group 1 2 3 4 5 Which SAS program is NOT a correct way to create dummy variables?

Explanation

Option D is not a correct way to create dummy variables because it does not include any code to create dummy variables. The other options (A, B, and C) may include code that creates dummy variables based on the values in the "Inc_Group" variable. Without seeing the code in each option, it is not possible to determine which option is the correct way to create dummy variables.

Submit
50. An analyst generates a model using the LOGISTIC procedure. They are now interested in getting the sensitivity and specificity statistics on a validation data set for a variety of cutoff values. Which statement and option combination will generate these statistics?

Explanation

The correct combination of statement and option to generate sensitivity and specificity statistics on a validation data set for a variety of cutoff values is "Score data=valid1 outroc=roc". This statement will score the validation data set and output the receiver operating characteristic (ROC) curve, which can be used to calculate sensitivity and specificity at different cutoff values.

Submit
51. Given the following LOGISTIC procedure: What is the difference between the datasets OUTFILEJ and OUTFILE_2?

Explanation

The correct answer states that OUTFILE_1 contains the final parameter estimates while OUTFILE_2 contains the newly scored probabilities. This means that OUTFILE_1 provides the estimated values for the parameters of the logistic regression model, which are used to predict the probabilities of the outcome variable. On the other hand, OUTFILE_2 provides the probabilities that are calculated using the newly scored data.

Submit
52. An analyst compares the mean salaries of men and women working at a company. The SAS data set SALARY contains variables: Gender (M or F) Pay (dollars per year) Which SAS programs can be used to find the p-value for comparing men's salaries with women's salaries? (Choose two.)

Explanation

The first program, "proc glm data=salary; class gender; model pay=gender; run;", can be used to compare men's and women's salaries by fitting a linear regression model with gender as the predictor variable and pay as the response variable. The p-value for the gender coefficient in the model can be used to determine if there is a significant difference in salaries between men and women.

The second program, "proc ttest data=salary; class gender; var pay; run;", can be used to compare men's and women's salaries by conducting an independent samples t-test. The p-value from the t-test can be used to determine if there is a significant difference in means between the two groups.

Submit
53.   One common approach for predicting rare events in the LOGISTIprocedure is to build a model that disproportionately over-re presents those cases with an event occurring (e.g. a 50-50 event/non-event split). What problem does this present?

Explanation

When building a model that over-represents cases with an event occurring, the bias is introduced in the intercept estimate. This means that the estimated value for the intercept will be inaccurate and skewed. However, the non-intercept parameter estimates and sensitivity estimates are not affected by this approach and remain unbiased.

Submit
54. Select the equivalent LOGISTIC procedure model statements. (Choose two.)

Explanation

The correct answer includes two equivalent LOGISTIC procedure model statements. The first statement "Mode1 Purchase * Gender Age Region;" specifies that the variable "Purchase" is the dependent variable and "Gender", "Age", and "Region" are the independent variables. The second statement "Mode1 Purchase * Gender|Age|Region @1;" is the same as the first statement but includes the interaction term between "Gender", "Age", and "Region" with a reference level of 1. Both statements are valid ways to specify the logistic regression model in the LOGISTIC procedure.

Submit
55.   There are missing values in the input variables for a regression application. Which SAS procedure provides a viable solution?

Explanation

STDIZE is a SAS procedure that can be used to standardize variables by subtracting the mean and dividing by the standard deviation. Standardizing variables can be helpful in regression analysis when there are missing values in the input variables. By standardizing the variables, the missing values can be replaced with the mean value of the variable, ensuring that the data is still centered around the mean. This procedure provides a viable solution for handling missing values in regression applications.

Submit
56. A non-contributing predictor variable (Pr > |t| =0.658) is eliminated to an existing multiple linear regression model. What will be the result?

Explanation

When a non-contributing predictor variable is eliminated from a multiple linear regression model, it means that this variable does not have a significant impact on the response variable. As a result, removing this variable will lead to a decrease in the overall fit of the model, which is measured by the R-Square value. Therefore, the correct answer is a decrease in R-Square.

Submit
57. Which method is NOT an appropriate way to score new observations with a known target in a logistic regression model?

Explanation

The correct answer is "Augment the training data set with new observations and rerun the LOGISTIC procedure." This is not an appropriate way to score new observations with a known target in a logistic regression model because it involves retraining the model with the augmented data set. Scoring new observations should be done using the saved parameter estimates from the original logistic regression procedure.

Submit
58. What is the value of R-squared?

Explanation

The value of R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. In this case, the value of R-squared is 0.4115, indicating that approximately 41.15% of the variability in the dependent variable can be explained by the independent variables in the model.

Submit
59. The following LOGISTIC procedure output analyzes the relationship between a binary response and an ordinal predictor variable, wrist_size Using reference cell coding, the analyst selects Large (L) as the reference level.   What is the estimated logit for a person with large wrist size? Click the calculator button to display a calculator if needed.

Explanation

not-available-via-ai

Submit
60. This question will ask you to provide a missing option. Given the following SAS program:

Explanation

The missing option in the given SAS program is "OUTP=estimates". The OUTPUT= option is used to specify the name of the output data set, OUTSTAT= option is used to specify the name of the output statistics data set, and OUTCORR= option is used to specify the name of the output correlation data set. Therefore, the missing option OUTP=estimates is used to specify the name of the output parameter estimates data set.

Submit
61. Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Pinch, a 1/0 variable coded as 1 for the responder. Customers will be solicited when their probability score is more than 0.05. Which SAS program computes the profit for each customer in the data set VALID?

Explanation

The given SAS program computes the profit for each customer in the data set VALID. It uses the probability score P_R to determine if a customer should be solicited. If the probability score is greater than 0.05, the customer is considered a responder and the profit is calculated as (P_R>0.05)*Purch*200. If the probability score is not greater than 0.05, the customer is considered a non-responder and the profit is calculated as (P_R>.05)*(1-Purch)*10. The program subtracts the cost of soliciting a non-responder from the profit of soliciting a responder to calculate the final profit.

Submit
62. A linear model has the following characteristics: A dependent variable (y) Three continuous predictor variables (x1-x3) One categorical predictor variable (c1with 3 levels) Which SAS program fits this model?

Explanation

The correct answer is "proc glm data=sasuser.mlr; class c1; model y=c1 x1-x3 /solution; run;". This program fits the given linear model because it includes the dependent variable (y), the three continuous predictor variables (x1-x3), and the categorical predictor variable (c1 with 3 levels) using the GLM procedure in SAS. The "class" statement is used to specify the categorical variable, and the "model" statement is used to define the relationship between the dependent and predictor variables. The "/solution" option is used to request parameter estimates and other statistics.

Submit
63. What is a correct interpretation of the estimate?

Explanation

This interpretation means that for every one thousand dollar increase in salary, the odds of the event happening are 1.142 times greater. It suggests that there is a positive relationship between salary and the likelihood of the event occurring.

Submit
64. Which statistic, calculated from a validation sample, can help decide which model to use for prediction of a binary target variable?

Explanation

The statistic that can help decide which model to use for prediction of a binary target variable is the Average Squared Error. This statistic measures the average of the squared differences between the predicted and actual values of the binary target variable. A lower average squared error indicates a better fit of the model to the data, making it a useful metric for comparing different models and selecting the one with the lowest error.

Submit
65. The question will ask you to provide a missing statement. Given the following SAS program: Which SAS statement will complete the program to correctly score the data set NEW_DATA?

Explanation

not-available-via-ai

Submit
66. This question will ask you to provide missing code segments. A logistic regression model was fit on a data set where 40% of the outcomes were events (TARGET=1) and 60% were non-events (TARGET=0). The analyst knows that the population where the model will be deployed has 5% events and 95% non-events. The analyst also knows that the company's profit margin for correctly targeted events is nine times higher than the company's loss for incorrectly targeted non-event. Given the following SAS program:What X and Y values should be added to the program to correctly score the data?

Explanation

The X value represents the percentage of events in the population where the model will be deployed, which is 5%. The Y value represents the company's profit margin for correctly targeted events, which is 10. Therefore, the correct values to be added to the program are X=.05 and Y=10. This ensures that the logistic regression model is appropriately calibrated to the population and takes into account the company's profit margin for correctly targeted events.

Submit
67. Given the following output from the LOGISTIC procedure: Which variables, among those that are statistically significant at an alpha of 0.05, have the greatest and least relative importance on the fitted model?

Explanation

The given answer states that the variable with the greatest relative importance on the fitted model is "DOWN_AMT" and the variable with the least relative importance is "CASH". This means that "DOWN_AMT" has the strongest impact on the model's outcome, while "CASH" has the weakest impact.

Submit
68. Consider scoring new observations in the SCORE procedure versus the SCORE statement in the LOGISTIC procedure. Which statement is true?

Explanation

The correct answer is that the SCORE statement in the LOGISTIC procedure returns only predicted probabilities, whereas the SCORE procedure returns only predicted logits. This means that the SCORE statement in the LOGISTIC procedure will provide the probability of an event occurring, while the SCORE procedure will provide the log-odds (logit) of the event occurring.

Submit
69. A financial services manager wants to assess the probability that certain clients will default on their Home Equity Line of Credit (HELOC). A former employee left the code listed below. The training data set is named HELOC, while a similar data set of more recent clients is named RECENT_HELOC. Which SAS data steps will calculate the predicted probability of default on recent clients? (Choose two.) data new_prob; set scored_heloc; <insert here>; run;

Explanation

The code "p=1/(1+exp(-default));" calculates the predicted probability of default using the logistic function. The code "odds=exp(default); p=odds/1+odds;" calculates the odds of default and then converts it to the predicted probability.

Submit
70. Calculate the sensitivity, accuracy, error rate. 

Explanation

The given answer provides the sensitivity, accuracy, and error rate for a certain calculation. However, without any context or additional information, it is not possible to determine what calculation or scenario these values represent. Therefore, an explanation cannot be provided.

Submit
71. proc surveyselect data=frame out=sample <insert here> outall;run;

Explanation

The correct answer is "sampsize=800" because it specifies the desired sample size for the surveyselect procedure. By setting the sampsize parameter to 800, the procedure will randomly select 800 observations from the input dataset "frame" and output them to the dataset "sample". The "outall" option is not necessary in this case as it is not specified in the given code.

Submit
72. What is the use of the reference line in gains chart?

Explanation

The reference line in the gains chart is used to determine the effectiveness of a marketing campaign. It shows the relationship between the percentage of customers contacted and the percentage of positive responses received. By following the reference line, we can estimate the expected response rate based on the percentage of customers contacted. For example, if we contact 50% of customers, we can expect to receive 50% of the total positive responses. This helps in evaluating the efficiency and success of the campaign in terms of reaching the target audience and generating positive outcomes.

Submit
View My Results

Quiz Review Timeline (Updated): Mar 21, 2023 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Mar 21, 2023
    Quiz Edited by
    ProProfs Editorial Team
  • Sep 15, 2014
    Quiz Created by
    Wendicai
Cancel
  • All
    All (72)
  • Unanswered
    Unanswered ()
  • Answered
    Answered ()
In order to perform an honest assessment on a predictive model, what...
This question will ask you to provide a missing option. ...
Based on the control plot, which conclusion is justified regarding the...
An analyst has selected this model as a champion because it shows...
What is the total number of the sample size? 
Which SAS program will detect collinearity in a multiple regression...
The plots represent two models, A and B, being fit to the same two...
An analyst fits a logistic regression model to predict whether or not...
An analyst investigates Region (A, B, or C) as an input variable in a...
The Intercept estimate is interpreted as:
In partitioning data for model assessment, which sampling methods are...
An analyst knows that the categorical predictor, storeId, is an...
Including redundant input variables in a regression model can:
A confusion matrix is created for data that were oversampled due to a...
27. Which statement is correct at an alpha level of 0.05?
1. Refer to the ROC curve: ...
A marketing manager attempts to determine those customers most likely...
An analyst examined logistic regression models for predicting whether...
Which of the following describes a concordant pair of observations in...
 ...
A linear model has the following characteristics: ...
A company has branch offices in eight regions. Customers within each...
Given alpha=0.02, which conclusion is justified regarding percentage...
The standard form of a linear regression is : Y= beta0+beta1*X+ error...
What is a drawback to performing data cleansing (imputation,...
The box plot was used to analyze daily sales data following three...
 Which SAS program will correctly use backward elimination...
An analyst has a sufficient volume of data to perform a 3-way...
A non-contributing predictor variable (Pr > |t| =0.658) is added to...
Customers were surveyed to assess their intent to purchase a product....
What does the reference line at lift = 1 corresponds to?
Which of the following describes a discordant pair of observations in...
Spearman statistics in the CORR procedure are useful for screening for...
Based upon the comparative ROC plot for two competing models, which is...
When mean imputation is performed on data after the data is...
Screening for non-linearity in binary logistic regression can be...
Which statistic indicates a better model when it gets larger? 
A predictive model uses a data set that has several variables with...
SAS output from the RSOUARE selection method, within the REG...
Suppose training data are oversampled in the event group to make the...
There are variable cluster in the input variables for a regression...
Excluding redundant input variables in a regression model can:
Which SAS program will divide the original data set into 60% training...
The total modeling data has been split into training, validation, and...
At a depth of 0.1, Lift=3.14. What does this mean?
Identify the correct SAS program for fitting a multiple linear...
The selection criterion used in the forward selection method in the...
Given the following SAS dataset TEST:...
An analyst generates a model using the LOGISTIC procedure. They are...
Given the following LOGISTIC procedure:...
An analyst compares the mean salaries of men and women working at a...
 ...
Select the equivalent LOGISTIC procedure model statements. (Choose...
 ...
A non-contributing predictor variable (Pr > |t| =0.658) is...
Which method is NOT an appropriate way to score new observations with...
What is the value of R-squared?
The following LOGISTIC procedure output analyzes the relationship...
This question will ask you to provide a missing option. Given the...
Assume a $10 cost for soliciting a non-responder and a $200 profit for...
A linear model has the following characteristics: ...
What is a correct interpretation of the estimate?
Which statistic, calculated from a validation sample, can help decide...
The question will ask you to provide a missing statement. Given the...
This question will ask you to provide missing code segments. ...
Given the following output from the LOGISTIC procedure: ...
Consider scoring new observations in the SCORE procedure versus the...
A financial services manager wants to assess the probability that...
Calculate the sensitivity, accuracy, error rate. 
Proc surveyselect data=frame out=sample <insert here>...
What is the use of the reference line in gains chart?
Alert!

Advertisement