1.
1. Refer to the ROC curve:
As you move along the curve, what changes?
A.
The priors in the population
B.
The true negative rate in the population
C.
The proportion of events in the training data
D.
The probability cutoff for scoring
2.
When mean imputation is performed on data after the data is partitioned for an honest assessment, what is the most appropriate method for handling the mean imputation?
A.
The sample means from the validation data set are applied to the training and test data sets.
B.
The sample means from the training data set are applied to the validation and test data sets.
C.
The sample means from the test data set are applied to the training and validation data sets.
D.
The sample means from each partition of the data are applied to their own partition.
3.
An analyst generates a model using the LOGISTIC procedure. They are now interested in getting
the sensitivity and specificity statistics on a validation data set for a variety of cutoff values.
Which statement and option combination will generate these statistics?
A.
Score data=valid1 out=roc;
B.
Score data=valid1 outroc=roc;
C.
Mode1resp(event= '1') = gender region/outroc=roc;
D.
Mode1resp(event"1") = gender region/ out=roc;
4.
In partitioning data for model assessment, which sampling methods are acceptable?
A.
Simple random sampling without replacement
B.
Simple random sampling with replacement
C.
Stratified random sampling without replacement
D.
Sequential random sampling with replacement
5.
Which SAS program will divide the original data set into 60% training and 40% validation data
sets, stratified by county?
A.
Proc surveryselect data=SASUSER.DATABASE samprate=0.6 out=sample;
strata country;
run;
B.
Proc sort data=SASUSER.DATABASE;
by county;
run;
proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall;
run;
C.
Proc sort data=SASUSER.DATABASE;
by county;
run;
proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample outall;
strata county;
run;
D.
Proc sort data=SASUSER.DATABASE;
by county;
run;
proc surveyselect data=SASUSER.DATABASE samprate=0.6 out=sample;
strata county;
eun;
6.
At a depth of 0.1, Lift=3.14. What does this mean?
A.
Selecting the top 10% of the population scored by the model should result in 3.14 times more events than a random draw of 10%.
B.
Selecting the observations with a response probability of at least 10% should result in 3.14
times more events than a random draw of 10%.
C.
Selecting the top 10% of the population scored by the model should result in 3.14 timesgreater
accuracy than a random draw of 10%.
D.
Selecting the observations with a response probability of atleast 10% should result in 3.14times
greater accuracy than a random draw of 10%.
7.
What does the reference line at lift = 1 corresponds to?
A.
The predicted lift for the best 50% of validation data cases
B.
The predicted lift if the entire population is scored as event cases
C.
The predicted lift if none of the population are scored as event cases
D.
The predicted lift if 50% of the population are randomly scored as event cases
8.
What is the use of the reference line in gains chart?
9.
Suppose training data are oversampled in the event group to make the number of events and nonevents
roughly equal. Logistic regression is run and the probabilities are output to a data set
NEW and given the variable name PE. A decision rule considered is, "Classify data as an event if
the probability is greater than 0.5." Also, the data set NEW contains a variable TG that indicates
whether there is an event (1=Event, 0= No event).
The following SAS program was used.
What does this program calculate?
A.
B.
C.
D.
Positive predictive value
10.
The plots represent two models, A and B, being fit to the same two data sets, training and
validation.
Model A is 90.5% accurate at distinguishing blue from red on the training data and 75.5% accurate
at doing the same on validation data. Model B is 83% accurate at distinguishing blue from red on
the training data and 78.3% accurate at doing the same on the validation data.
Which of the two models should be selected and why?
A.
Model A. It is more complex with a higher accuracy than model B on training data.
B.
Model A. It performs better on the boundary for the training data.
C.
Model B. It is more complex with a higher accuracy than model A on validation data.
D.
Model B. It is simpler with a higher accuracy than model A on validation data.
11.
Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The
logistic regression model gives a probability score named P_R on a SAS data set called VALID.
The VALID data set contains the responder variable Pinch, a 1/0 variable coded as 1 for
the responder. Customers will be solicited when their probability score is more than 0.05.
Which SAS program computes the profit for each customer in the data set VALID?
A.
Profit=(P_R>0.05)*Purch*200-(P_R>.05)*(1-Purch)*10;
B.
Profit=(P_R.05)*(1-Purch)*10;
C.
If P_R> 0.05; profit=(P_R>0.05)*Purch*200-(P_R>.05)*(1-Purch)*10;
D.
If P_R> 0.05;
profit=(P_R>0.05)*Purch*200+(P_R
12.
In order to perform an honest assessment on a predictive model, what is an acceptable division
between training, validation, and testing data?
A.
Training: 50% Validation: 0% Testing: 50%
B.
Training: 100% Validation: 0% Testing: 0%
C.
Training: 0% Validation: 100% Testing: 0%
D.
Training: 50% Validation: 50% Testing: 0%
13.
Based upon the comparative ROC plot for two competing models, which is the champion model
and why?
A.
Candidate 1, because the area outside the curve is greater
B.
Candidate 2, because the area outside the curve is greater
C.
Candidate 1, because it is closer to the diagonal reference curve
D.
Candidate 2, because it shows less over fit than Candidate 1
14.
A confusion matrix is created for data that were oversampled due to a rare target.
What values are not affected by this oversampling?
A.
B.
C.
D.
Sensitivity and Specificity
15.
This question will ask you to provide missing code segments.
A logistic regression model was fit on a data set where 40% of the outcomes were events
(TARGET=1) and 60% were non-events (TARGET=0). The analyst knows that the population
where the model will be deployed has 5% events and 95% non-events. The analyst also knows
that the company's profit margin for correctly targeted events is nine times higher than the
company's loss for incorrectly targeted non-event.
Given the following SAS program:What X and Y values should be added to the program to correctly score the data?
16.
An analyst has a sufficient volume of data to perform a 3-way partition of the data into training,
validation, and test sets to perform honest assessment during the model building process.
What is the purpose of the test data set?
A.
To provide a unbiased measure of assessment for the final model.
B.
To compare models and select and fine-tune the final model.
C.
To reduce total sample size to make computations more efficient.
D.
To build the predictive models.
17.
Calculate the sensitivity, accuracy, error rate.
18.
The total modeling data has been split into training, validation, and test data. What is the best data
to use for model assessment?
19.
What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data
prior to partitioning the data for honest assessment as opposed to performing the data cleansing
after partitioning the data?
A.
It violates assumptions of the model.
B.
It requires extra computational effort and time.
C.
It omits the training (and test) data sets from the benefits of the cleansing methods.
D.
There is no ability to compare the effectiveness of different cleansing methods.
20.
A company has branch offices in eight regions. Customers within each region are classified as
either "High Value" or "Medium Value" and are coded using the variable name VALUE. In the last
year, the total amount of purchases per customer is used as the response variable.
Suppose there is a significant interaction between REGION and VALUE. What can you conclude?
A.
More high value customers are found in some regions than others.
B.
The difference between average purchases for medium and high value customers depends on
the region.
C.
Regions with higher average purchases have more high value customers.
D.
Regions with higher average purchases have more medium value customers.
21.
This question will ask you to provide a missing option.
Complete the following syntax to test the homogeneity of variance assumption in the GLM
procedure:
Means Region / <insert option here> =levene;
22.
Based on the control plot, which conclusion is justified regarding the means of the response?
A.
All groups are significantly different from each other.
B.
2XL is significantly different from all other groups.
C.
Only XL and 2XL are not significantly different from each other.
D.
No groups are significantly different from each other.
23.
Customers were surveyed to assess their intent to purchase a product. An analyst divided the
customers into groups defined by the company's pre-assigned market segments and tested for
difference in the customers' average intent to purchase. The following is the output from the GLM
procedure:
What percentage of customers' intent to purchase is explained by market segment?
24.
The box plot was used to analyze daily sales data following three different ad campaigns.
The business analyst concludes that one of the assumptions of ANOVA was violated.
Which assumption has been violated and why?
A.
Normality, because Prob > F < .0001.
B.
Normality, because the interquartile ranges are different in different ad campaigns.
C.
Constant variance, because Prob > F < .0001.
D.
Constant variance, because the interquartile ranges are different in different ad campaigns.
25.
Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small
(S), medium (M), and large (L) wrist sizes?
A.
Medium wrist size is significantly different than small wrist size.
B.
Large wrist size is significantly different than medium wrist size.
C.
Large wrist size is significantly different than small wrist size.
D.
There is no significant difference due to wrist size.