AWS Machine Learning - Specialty (MLS-C01

1. A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

Use AWS Glue to catalogue the data and Amazon Athena to run queries.

Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

The correct answer is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It can automatically discover and catalog data stored in Amazon S3, making it easier to query the data using SQL. Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. This combination of AWS Glue and Amazon Athena requires the least effort as it eliminates the need for manual data transformation and provides a simple and efficient way to query the data.

Explanation

The correct answer is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It can automatically discover and catalog data stored in Amazon S3, making it easier to query the data using SQL. Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. This combination of AWS Glue and Amazon Athena requires the least effort as it eliminates the need for manual data transformation and provides a simple and efficient way to query the data.

2. A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

Use AWS Glue to catalogue the data and Amazon Athena to run queries.

Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

The solution that requires the least effort to be able to query the data is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue can automatically discover and catalog the structured and unstructured data stored in the Amazon S3 bucket, making it easier to query. Amazon Athena is a serverless query service that allows the Machine Learning Specialist to run SQL queries directly on the data without the need for any infrastructure setup or management. This solution eliminates the need for additional data transformation or ETL processes, reducing the overall effort required.

Explanation

The solution that requires the least effort to be able to query the data is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue can automatically discover and catalog the structured and unstructured data stored in the Amazon S3 bucket, making it easier to query. Amazon Athena is a serverless query service that allows the Machine Learning Specialist to run SQL queries directly on the data without the need for any infrastructure setup or management. This solution eliminates the need for additional data transformation or ETL processes, reducing the overall effort required.

3. 10. A Data Scientist for a credit card company is creating a solution to predict credit card fraud at the time of transaction. To that end, the Data Scientist is looking to create an ML model to predict fraud and will do so by training that model on an existing dataset of credit card transactions. That dataset contains 1,000 examples of transactions in total, only 50 of which are labeled as fraud. How should the Data Scientist deal with this class imbalance?

Use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records

Undersample the non-fraudulent records to improve the class imbalance

Use K-fold cross validation when training the model

Drop all the fraud examples, and use a One-Class SVM to classify

The Data Scientist should use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records. This technique generates synthetic samples of the minority class (fraudulent records) in order to balance the class distribution. By creating additional synthetic fraud examples, the ML model will have more data to learn from and improve its ability to predict fraud accurately. This approach is suitable for dealing with class imbalance and can help improve the performance of the model in detecting credit card fraud.

Explanation

The Data Scientist should use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records. This technique generates synthetic samples of the minority class (fraudulent records) in order to balance the class distribution. By creating additional synthetic fraud examples, the ML model will have more data to learn from and improve its ability to predict fraud accurately. This approach is suitable for dealing with class imbalance and can help improve the performance of the model in detecting credit card fraud.

4. A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800.000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1 5 MB in size. Most queries will span 5 to 10 columns only How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Convert the records to Apache Parquet format

Convert the records to JSON format

Convert the records to GZIP CSV format

Convert the records to XML format

By converting the records to Apache Parquet format, the Machine Learning Specialist can minimize query runtime. Apache Parquet is a columnar storage file format that is designed for efficient query performance and compression. It allows for column-wise compression and predicate pushdown, which means that only the relevant columns and rows are read during a query. This format is well-suited for analytical workloads and can significantly reduce the amount of data that needs to be processed, resulting in faster query execution times.

Explanation

By converting the records to Apache Parquet format, the Machine Learning Specialist can minimize query runtime. Apache Parquet is a columnar storage file format that is designed for efficient query performance and compression. It allows for column-wise compression and predicate pushdown, which means that only the relevant columns and rows are read during a query. This format is well-suited for analytical workloads and can significantly reduce the amount of data that needs to be processed, resulting in faster query execution times.

5. A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published A sample of the data being used is below. Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values. What technique should be used to convert this column to binary values?

Binarization

One-hot encoding

Tokenization

Normalization transformation

The technique that should be used to convert the Day-Of_Week column to binary values is one-hot encoding. One-hot encoding is a technique used to represent categorical variables as binary vectors. Each category is converted into a binary column, where a value of 1 represents the presence of that category and a value of 0 represents the absence. This is commonly used in machine learning algorithms to handle categorical data and allow them to be used in mathematical calculations.

Explanation

The technique that should be used to convert the Day-Of_Week column to binary values is one-hot encoding. One-hot encoding is a technique used to represent categorical variables as binary vectors. Each category is converted into a binary column, where a value of 1 represents the presence of that category and a value of 0 represents the absence. This is commonly used in machine learning algorithms to handle categorical data and allow them to be used in mathematical calculations.

6. A navigation and transportation company is using satellite images to model weather around the world in order to create optimal routes for its ships and planes. The company is using Amazon SageMaker training jobs to build and train its models. However, during training, it takes too long to download the company's 100 GB data from Amazon S3 to the training instance before the training starts. What should the company do to speed up its training jobs while keeping the costs low?

Increase the instance size for training

Increase the batch size in the model

Change the input mode to Pipe

Create an Amazon EBS volume with the data on it and attach it to the training job

Changing the input mode to Pipe would speed up the training jobs while keeping the costs low. By using Pipe mode, the company can stream the data directly from Amazon S3 to the training instance without the need to download the entire 100 GB data before training starts. This eliminates the time-consuming download process and allows for faster training. Additionally, it helps in reducing storage costs as there is no need to store the data on the training instance.

Explanation

Changing the input mode to Pipe would speed up the training jobs while keeping the costs low. By using Pipe mode, the company can stream the data directly from Amazon S3 to the training instance without the need to download the entire 100 GB data before training starts. This eliminates the time-consuming download process and allows for faster training. Additionally, it helps in reducing storage costs as there is no need to store the data on the training instance.

7. A multi-national banking organization provides loan services to customers worldwide. Many of its customers still submit loan applications in paper form in one of the bank's branch locations. The bank wants to speed up the loan approval process for this set of customers by using machine learning. More specifically, it wants to create a process in which customers submit the application to the clerk, who scans and uploads it to the system. The system then reads and provides an approval or denial of the application in a matter of minutes. What can the bank use to read and extract the necessary data from the loan applications without needing to manage the process?

A custom CNN model

An LSTM model

Amazon Textract

Amazon Personalize

The bank can use Amazon Textract to read and extract the necessary data from the loan applications without needing to manage the process. Amazon Textract is a machine learning service that automatically extracts text and data from scanned documents. It can analyze the loan applications and extract relevant information such as personal details, financial data, and loan request information. This would significantly speed up the loan approval process as the system can quickly analyze and provide an approval or denial of the application in a matter of minutes.

Explanation

The bank can use Amazon Textract to read and extract the necessary data from the loan applications without needing to manage the process. Amazon Textract is a machine learning service that automatically extracts text and data from scanned documents. It can analyze the loan applications and extract relevant information such as personal details, financial data, and loan request information. This would significantly speed up the loan approval process as the system can quickly analyze and provide an approval or denial of the application in a matter of minutes.

8. This graph shows the training and validation loss against the epochs for a neural network. The network being trained is as follows;

Two dense layers one output neuron
100 neurons in each layer
100 epochs
Random initialization of weights

Which technique can be used to improve model performance in terms of accuracy in the validation set?

Early stopping

Random initialization of weights with appropriate seed

Increasing the number of epochs

Adding another layer with the 100 neurons

Early stopping is a technique that can be used to improve model performance in terms of accuracy in the validation set. It involves stopping the training process before all epochs are completed, based on a predefined condition. This helps prevent overfitting and allows the model to generalize better to unseen data. By monitoring the validation loss during training, early stopping can determine when the model starts to overfit and stop training at that point, resulting in improved accuracy on the validation set.

Explanation

Early stopping is a technique that can be used to improve model performance in terms of accuracy in the validation set. It involves stopping the training process before all epochs are completed, based on a predefined condition. This helps prevent overfitting and allows the model to generalize better to unseen data. By monitoring the validation loss during training, early stopping can determine when the model starts to overfit and stop training at that point, resulting in improved accuracy on the validation set.

9. A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

Oversampling using bootstrapping

Undersampling

Oversampling using SMOTE

Class weight adjustment

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds
new information by adding synthetic data points to the minority class. This technique would be the most effective
in this scenario.

Explanation

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds
new information by adding synthetic data points to the minority class. This technique would be the most effective
in this scenario.

10. A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked. Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)

AWS CloudTrail

AWS Health

AWS Trusted Advisor

Amazon CloudWatch

AWS Config

The correct answer is AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is used to track API activity and monitor actions taken by users, including model deployments and endpoint invocations. Amazon CloudWatch is used to monitor resource utilization, such as GPU and CPU utilization on the deployed SageMaker endpoints. AWS Health, AWS Trusted Advisor, and AWS Config are not directly integrated with Amazon SageMaker for tracking this information.

Explanation

The correct answer is AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is used to track API activity and monitor actions taken by users, including model deployments and endpoint invocations. Amazon CloudWatch is used to monitor resource utilization, such as GPU and CPU utilization on the deployed SageMaker endpoints. AWS Health, AWS Trusted Advisor, and AWS Config are not directly integrated with Amazon SageMaker for tracking this information.

Submit

11. A video streaming company is looking to create a personalized experience for its customers on its platform. The company wants to provide recommended videos to stream based on what other similar users watched previously. To this end, it is collecting its platform's clickstream data using an ETL pipeline and storing the logs and syslogs in Amazon S3. What kind of algorithm should the company use to create the simplest solution in this situation?

Regression

Classification

Recommender system

Reinforcement learning

The company should use a recommender system algorithm to create a personalized experience for its customers. A recommender system analyzes clickstream data and user behavior to provide recommendations based on what other similar users watched previously. This algorithm would be the simplest solution for the company to implement in order to provide recommended videos to stream on its platform.

Explanation

The company should use a recommender system algorithm to create a personalized experience for its customers. A recommender system analyzes clickstream data and user behavior to provide recommendations based on what other similar users watched previously. This algorithm would be the simplest solution for the company to implement in order to provide recommended videos to stream on its platform.

12. IT leadership wants to transition a company's existing machine learning data storage environment to AWS as a temporary ad hoc solution. The company currently uses a custom software process that heavily leverages SOL as a query language and exclusively stores generated csv documents for machine learning. The ideal state for the company would be a solution that allows it to continue to use the current workforce of SQL experts. The solution must also support the storage of csv and JSON files, and be able to query over semi- structured data. The following are high priorities for the company:

Solution simplicity
Fast development time
Low cost
High flexibility

What technologies meet the company's requirements?

Amazon S3 and Amazon Athena

Amazon Redshift and AWS Glue

Amazon DynamoDB and DynamoDB Accelerator (DAX)

Amazon RDS and Amazon ES

The company's ideal state is to continue using SQL experts and support the storage of csv and JSON files, as well as querying over semi-structured data. Amazon S3 allows for the storage of csv and JSON files, while Amazon Athena is a serverless query service that supports SQL queries on data stored in Amazon S3. This solution is simple, has fast development time, low cost, and high flexibility, making it the best fit for the company's requirements.

Explanation

The company's ideal state is to continue using SQL experts and support the storage of csv and JSON files, as well as querying over semi-structured data. Amazon S3 allows for the storage of csv and JSON files, while Amazon Athena is a serverless query service that supports SQL queries on data stored in Amazon S3. This solution is simple, has fast development time, low cost, and high flexibility, making it the best fit for the company's requirements.

13. An employee found a video clip with audio on a company's social media feed. The language used in the video is Spanish. English is the employee's first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task?

Amazon Transcribe, Amazon Translate, and Amazon Comprehend

Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq

Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)

Amazon Transcribe, Amazon Translate, and Amazon SageMaker BlazingText

The most efficient combination of services to accomplish the task is Amazon Transcribe, Amazon Translate, and Amazon Comprehend. Amazon Transcribe can be used to transcribe the audio in the video clip from Spanish to text. Amazon Translate can then be used to translate the transcribed text from Spanish to English, allowing the employee to understand the content. Finally, Amazon Comprehend can be used to perform sentiment analysis on the translated text, providing insights into the sentiment expressed in the video clip.

Explanation

The most efficient combination of services to accomplish the task is Amazon Transcribe, Amazon Translate, and Amazon Comprehend. Amazon Transcribe can be used to transcribe the audio in the video clip from Spanish to text. Amazon Translate can then be used to translate the transcribed text from Spanish to English, allowing the employee to understand the content. Finally, Amazon Comprehend can be used to perform sentiment analysis on the translated text, providing insights into the sentiment expressed in the video clip.

14. A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier: Total number of images available = 1,000 Test set images = 100 (constant test set) The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners. Which techniques can be used by the ML Specialist to improve this specific test error?

Increase the training data by adding variation in rotation for training images.

Increase the number of epochs for model training.

Increase the number of layers for the neural network.

Increase the dropout rate for the second-to-last layer.

The ML Specialist can improve the test error by increasing the training data and adding variation in rotation for training images. This can help the model to learn and recognize cats even when they are held upside down by their owners. By exposing the model to a wider range of cat orientations during training, it can become more robust and accurate in identifying cats in different positions. Increasing the number of epochs, the number of layers, or the dropout rate may not specifically address the issue of misclassifying upside-down cats.

Explanation

The ML Specialist can improve the test error by increasing the training data and adding variation in rotation for training images. This can help the model to learn and recognize cats even when they are held upside down by their owners. By exposing the model to a wider range of cat orientations during training, it can become more robust and accurate in identifying cats in different positions. Increasing the number of epochs, the number of layers, or the dropout rate may not specifically address the issue of misclassifying upside-down cats.

15. A Data Scientist wants to implement a near-real-time anomaly detection solution for routine machine maintenance. The data is currently streamed from connected devices by AWS IoT to an Amazon S3 bucket and then sent downstream for further processing in a real-time dashboard. What service can the Data Scientist use to achieve the desired outcome with minimal change to the pipeline?

Amazon CloudWatch

Amazon SageMaker

Amazon EMR with Spark

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics is the correct answer because it is a service specifically designed for real-time data processing and analytics. It can easily integrate with AWS IoT and Amazon S3 to process the streamed data and provide near-real-time anomaly detection. Using Amazon Kinesis Data Analytics would require minimal changes to the existing pipeline, making it the ideal choice for the Data Scientist.

Explanation

Amazon Kinesis Data Analytics is the correct answer because it is a service specifically designed for real-time data processing and analytics. It can easily integrate with AWS IoT and Amazon S3 to process the streamed data and provide near-real-time anomaly detection. Using Amazon Kinesis Data Analytics would require minimal changes to the existing pipeline, making it the ideal choice for the Data Scientist.

16. The Chief Editor for a product catalog wants the Research and Development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company's retail brand. The team has a set of training data. Which machine learning algorithm should the researchers use that BEST meets their requirements?

Latent Dirichlet Allocation (LDA)

Recurrent neural network (RNN)

K-means

Convolutional neural network (CNN)

A Convolutional Neural Network (CNN) is the best machine learning algorithm for the given requirements. CNNs are specifically designed for image recognition tasks and have been proven to be highly effective in detecting patterns and features in images. They are able to automatically learn and extract relevant features from the training data, making them well-suited for identifying whether individuals in a collection of images are wearing the company's retail brand. LDA, RNN, and K-means are not specifically designed for image recognition tasks and may not perform as well in this scenario.

Explanation

A Convolutional Neural Network (CNN) is the best machine learning algorithm for the given requirements. CNNs are specifically designed for image recognition tasks and have been proven to be highly effective in detecting patterns and features in images. They are able to automatically learn and extract relevant features from the training data, making them well-suited for identifying whether individuals in a collection of images are wearing the company's retail brand. LDA, RNN, and K-means are not specifically designed for image recognition tasks and may not perform as well in this scenario.

17. A Machine Learning Specialist deployed a model that provides product recommendations on a company's website Initially, the model was performing very well and resulted in customers buying more products on average However within the past few months the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago. Which method should the Specialist try to improve model performance?

The model needs to be completely re-engineered because it is unable to handle product inventory changes

The model's hyperparameters should be periodically updated to prevent drift

The model should be periodically retrained using the original training data plus new data as product inventory changes

The correct answer is that the model should be periodically retrained using the original training data plus new data as product inventory changes. This is because the Specialist has noticed a decrease in the effectiveness of the model over time, indicating that the model needs to be updated. By periodically retraining the model with both the original training data and new data that reflects the changes in product inventory, the model can adapt to the evolving patterns and preferences of customers, leading to improved performance.

Explanation

The correct answer is that the model should be periodically retrained using the original training data plus new data as product inventory changes. This is because the Specialist has noticed a decrease in the effectiveness of the model over time, indicating that the model needs to be updated. By periodically retraining the model with both the original training data and new data that reflects the changes in product inventory, the model can adapt to the evolving patterns and preferences of customers, leading to improved performance.

18. A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output?

Dropout

Smooth L1 loss

Softmax

Rectified linear units (ReLU)

The Softmax function is used to produce a probability distribution of the likelihood that the input image belongs to each of the 10 classes. It converts the output of the neural network into a probability distribution by normalizing the values and ensuring they sum up to 1. This is suitable for multi-class classification tasks, such as classifying different types of animals in this case. Dropout is a regularization technique used to prevent overfitting, Smooth L1 loss is a loss function used in object detection, and Rectified linear units (ReLU) is an activation function commonly used in neural networks.

Explanation

The Softmax function is used to produce a probability distribution of the likelihood that the input image belongs to each of the 10 classes. It converts the output of the neural network into a probability distribution by normalizing the values and ensuring they sum up to 1. This is suitable for multi-class classification tasks, such as classifying different types of animals in this case. Dropout is a regularization technique used to prevent overfitting, Smooth L1 loss is a loss function used in object detection, and Rectified linear units (ReLU) is an activation function commonly used in neural networks.

19. A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users' behavior and product preferences to predict which products users would like based on the users' similarity to other users. What should the Specialist do to meet this objective?

Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.

Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR.

To meet the objective of predicting which products users would like based on their similarity to other users, the Machine Learning Specialist should build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. Collaborative filtering is a technique that analyzes the behavior of multiple users and identifies patterns and similarities between them to make recommendations. This approach is suitable for leveraging the large amount of information on users' behavior and product preferences that the company has. Apache Spark ML is a powerful machine learning library that can handle large-scale data processing, and Amazon EMR provides a scalable and cost-effective platform for running Spark applications.

Explanation

To meet the objective of predicting which products users would like based on their similarity to other users, the Machine Learning Specialist should build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. Collaborative filtering is a technique that analyzes the behavior of multiple users and identifies patterns and similarities between them to make recommendations. This approach is suitable for leveraging the large amount of information on users' behavior and product preferences that the company has. Apache Spark ML is a powerful machine learning library that can handle large-scale data processing, and Amazon EMR provides a scalable and cost-effective platform for running Spark applications.

20. Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

Recall

Misclassification rate

Mean absolute percentage error (MAPE)

Area Under the ROC Curve (AUC)

The Area Under the ROC Curve (AUC) is a commonly used metric to compare and evaluate machine learning classification models. It measures the ability of the model to distinguish between positive and negative classes by calculating the area under the Receiver Operating Characteristic (ROC) curve. A higher AUC value indicates a better model performance in terms of classification accuracy. Therefore, a Machine Learning Specialist should generally use AUC to compare and evaluate different classification models.

Explanation

The Area Under the ROC Curve (AUC) is a commonly used metric to compare and evaluate machine learning classification models. It measures the ability of the model to distinguish between positive and negative classes by calculating the area under the Receiver Operating Characteristic (ROC) curve. A higher AUC value indicates a better model performance in terms of classification accuracy. Therefore, a Machine Learning Specialist should generally use AUC to compare and evaluate different classification models.

21. An ML scientist has built a decision tree model using scikit-learn with 1,000 trees. The training accuracy for the model was 99.2% and the test accuracy was 70.3%. Should the Scientist use this model in production?

Yes, because it is generalizing well on the training set

No, because it is generalizing well on the training set

No, because it is not generalizing well on the test set

Yes, because it is not generalizing well on the test set

The model's training accuracy of 99.2% indicates that it is performing well on the training data. However, the test accuracy of only 70.3% suggests that the model is not generalizing well to new, unseen data. Therefore, it is not advisable to use this model in production as it may not perform well on real-world data.

Explanation

The model's training accuracy of 99.2% indicates that it is performing well on the training data. However, the test accuracy of only 70.3% suggests that the model is not generalizing well to new, unseen data. Therefore, it is not advisable to use this model in production as it may not perform well on real-world data.

22. A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL. Which storage scheme is MOST adapted to this scenario?

.Store datasets as files in Amazon S3.

Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.

Store datasets as tables in a multi-node Amazon Redshift cluster.

Store datasets as global tables in Amazon DynamoDB.

Storing datasets as files in Amazon S3 is the most adapted storage scheme for this scenario because it allows for scalability and cost-effectiveness. With S3, the Data Science team can easily store and retrieve large amounts of training data without worrying about capacity limitations. Additionally, S3 supports SQL-based querying using services like Amazon Athena, allowing for easy exploration of the data using SQL. This solution also aligns with the requirement of being able to create an arbitrary number of new datasets every day, as S3 can handle the storage of a large number of files.

Explanation

Storing datasets as files in Amazon S3 is the most adapted storage scheme for this scenario because it allows for scalability and cost-effectiveness. With S3, the Data Science team can easily store and retrieve large amounts of training data without worrying about capacity limitations. Additionally, S3 supports SQL-based querying using services like Amazon Athena, allowing for easy exploration of the data using SQL. This solution also aligns with the requirement of being able to create an arbitrary number of new datasets every day, as S3 can handle the storage of a large number of files.

23. A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant. Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test?

Generating an Amazon CloudWatch dashboard will allow the Specialist to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. This will provide a convenient and comprehensive way to review and monitor these metrics during the load test.

Explanation

Generating an Amazon CloudWatch dashboard will allow the Specialist to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. This will provide a convenient and comprehensive way to review and monitor these metrics during the load test.

24. A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis the Specialist observes that many features are highly correlated with each other. This may make the model unstable. What should be done to reduce the impact of having such a large number of features?

Perform one-hot encoding on highly correlated features

Use matrix multiplication on highly correlated features.

Create a new feature space using principal component analysis (PCA)

Apply the Pearson correlation coefficient

When many features in a prediction model are highly correlated, it can lead to instability in the model. To reduce the impact of this issue, the Machine Learning Specialist should create a new feature space using principal component analysis (PCA). PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. By creating a new feature space using PCA, the Specialist can eliminate the problem of high correlation among features and improve the stability of the prediction model.

Explanation

When many features in a prediction model are highly correlated, it can lead to instability in the model. To reduce the impact of this issue, the Machine Learning Specialist should create a new feature space using principal component analysis (PCA). PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. By creating a new feature space using PCA, the Specialist can eliminate the problem of high correlation among features and improve the stability of the prediction model.

25. A web-based company wants to improve its conversion rate on its landing page Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker. However, there is an over fitting problem training data shows 90% accuracy in predictions, while test data shows 70% accuracy only. The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validation data?

Increase the randomization of training data in the mini-batches used in training.

Allocate a higher proportion of the overall data to the training dataset

Apply L1 or L2 regularization and dropouts to the training.

Reduce the number of layers and units (or neurons) from the deep learning network.

Applying L1 or L2 regularization and dropouts to the training is recommended to provide the highest accuracy model for the company's test and validation data. Regularization techniques such as L1 or L2 help to prevent overfitting by adding a penalty term to the loss function, which discourages the model from relying too heavily on any single feature. Dropouts, on the other hand, randomly deactivate a certain proportion of neurons during training, forcing the model to learn more robust and generalizable representations. These techniques can help improve the model's ability to generalize and perform well on unseen data, thus boosting the accuracy on the test and validation datasets.

Explanation

Applying L1 or L2 regularization and dropouts to the training is recommended to provide the highest accuracy model for the company's test and validation data. Regularization techniques such as L1 or L2 help to prevent overfitting by adding a penalty term to the loss function, which discourages the model from relying too heavily on any single feature. Dropouts, on the other hand, randomly deactivate a certain proportion of neurons during training, forcing the model to learn more robust and generalizable representations. These techniques can help improve the model's ability to generalize and perform well on unseen data, thus boosting the accuracy on the test and validation datasets.

26. A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model. What should the Specialist do to prepare the data for model training?

Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude.

The Specialist should apply normalization to ensure each field has a mean of 0 and a variance of 1. This will standardize the magnitude of the input features and prevent variables with larger magnitudes from dominating the model. Normalization is a common preprocessing step in machine learning to scale the data and make it easier for the model to learn patterns and make accurate predictions.

Explanation

The Specialist should apply normalization to ensure each field has a mean of 0 and a variance of 1. This will standardize the magnitude of the input features and prevent variables with larger magnitudes from dominating the model. Normalization is a common preprocessing step in machine learning to scale the data and make it easier for the model to learn patterns and make accurate predictions.

27. A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information which model would have the HIGHEST accuracy?

Long short-term memory (LSTM) model with scaled exponential linear unit (SELL))

Logistic regression

Support vector machine (SVM) with non-linear kernel

Single perceptron with tanh activation function

Support vector machines (SVM) with a non-linear kernel are known for their ability to handle complex decision boundaries and capture non-linear relationships in the data. In this case, the data is not linearly separable, as there is overlap between the fraudulent and normal classes in both features. Therefore, using a non-linear kernel in the SVM model would likely result in the highest accuracy by effectively capturing the patterns and relationships in the data.

Explanation

Support vector machines (SVM) with a non-linear kernel are known for their ability to handle complex decision boundaries and capture non-linear relationships in the data. In this case, the data is not linearly separable, as there is overlap between the fraudulent and normal classes in both features. Therefore, using a non-linear kernel in the SVM model would likely result in the highest accuracy by effectively capturing the patterns and relationships in the data.

28. You work for the information security department of a major corporation. You have been asked to build a solution that detects web application log anomalies to protect your organization from fraudulent activity. The system needs to have near-real-time updates to the model where log entry data points dynamically change the underlying model as the log files are updated. Which AWS service component do you use to implement the best algorithm based on these requirements?

SageMaker Random Cut Forest

Kinesis Data Streams Naive Bayes Classifier

Kinesis Data Analytics Random Cut Forest

Kinesis Data Analytics Nearest Neighbor

The best algorithm to implement the given requirements is the Random Cut Forest algorithm provided by Kinesis Data Analytics. This algorithm is specifically designed for anomaly detection in streaming data, making it suitable for detecting web application log anomalies in near-real-time. It dynamically updates the underlying model as log files are updated, ensuring that the system can adapt to changing data patterns. Therefore, Kinesis Data Analytics Random Cut Forest is the most appropriate AWS service component for this scenario.

Explanation

The best algorithm to implement the given requirements is the Random Cut Forest algorithm provided by Kinesis Data Analytics. This algorithm is specifically designed for anomaly detection in streaming data, making it suitable for detecting web application log anomalies in near-real-time. It dynamically updates the underlying model as log files are updated, ensuring that the system can adapt to changing data patterns. Therefore, Kinesis Data Analytics Random Cut Forest is the most appropriate AWS service component for this scenario.

29. A Machine Learning Specialist was given a dataset consisting of unlabeled data The Specialist must create a model that can help the team classify the data into different buckets What model should be used to complete this work?

K-means clustering

Random Cut Forest (RCF)

XGBoost

BlazingText

K-means clustering should be used to complete this work because it is a popular unsupervised learning algorithm that is used for clustering data. It is suitable for this task because the dataset consists of unlabeled data and the goal is to classify the data into different buckets. K-means clustering works by partitioning the data into k clusters based on their similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. This algorithm is widely used for data clustering and can help the Machine Learning Specialist in this task.

Explanation

K-means clustering should be used to complete this work because it is a popular unsupervised learning algorithm that is used for clustering data. It is suitable for this task because the dataset consists of unlabeled data and the goal is to classify the data into different buckets. K-means clustering works by partitioning the data into k clusters based on their similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. This algorithm is widely used for data clustering and can help the Machine Learning Specialist in this task.

30. A Machine Learning Specialist is working with a large cyber security company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis. What is the MOST efficient way to accomplish these tasks?

The most efficient way to accomplish the tasks of ingesting and analyzing the data in real-time is by using Amazon Kinesis Data Firehose to ingest the data and Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. After detecting the anomalies, the results can be streamed to Amazon S3 using Kinesis Data Firehose. This approach allows for real-time analysis and storage of the results in a scalable and efficient manner.

Explanation

The most efficient way to accomplish the tasks of ingesting and analyzing the data in real-time is by using Amazon Kinesis Data Firehose to ingest the data and Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. After detecting the anomalies, the results can be streamed to Amazon S3 using Kinesis Data Firehose. This approach allows for real-time analysis and storage of the results in a scalable and efficient manner.

31. A Machine Learning Engineer is creating a regression model for forecasting company revenue based on an internal dataset made up of past sales and other related data. What metric should the Engineer use to evaluate the ML model?

Cross-entropy log loss

Sigmoid

Root Mean squared error (RMSE)

Precision

The Machine Learning Engineer should use Root Mean Squared Error (RMSE) to evaluate the regression model. RMSE is a commonly used metric for regression models as it measures the average difference between the predicted and actual values. It provides a single value that represents the overall accuracy of the model's predictions. The lower the RMSE, the better the model's performance in accurately forecasting the company's revenue based on the internal dataset.

Explanation

The Machine Learning Engineer should use Root Mean Squared Error (RMSE) to evaluate the regression model. RMSE is a commonly used metric for regression models as it measures the average difference between the predicted and actual values. It provides a single value that represents the overall accuracy of the model's predictions. The lower the RMSE, the better the model's performance in accurately forecasting the company's revenue based on the internal dataset.

32. Amazon Connect has recently been tolled out across a company as a contact call center. The solution has been configured to store voice call recordings on Amazon S3. The content of the voice calls is being analyzed for the incidents being discussed by the call operators. Amazon Transcribe is being used to convert the audio to text, and the output is stored on Amazon S3. Which approach will provide the information required for further analysis?

Use Amazon Comprehend with the transcribed files to build the key topics.

Use Amazon Translate with the transcribed files to train and build a model for the key topics.

Amazon Comprehend is a natural language processing (NLP) service that can be used to analyze text and extract key topics. In this scenario, the voice calls are being converted to text using Amazon Transcribe, and the transcribed files are stored on Amazon S3. By using Amazon Comprehend with the transcribed files, the company can extract the key topics discussed in the calls, providing the required information for further analysis. This approach leverages the capabilities of Amazon Comprehend to analyze the text data and identify important topics, making it the appropriate choice for this situation.

Explanation

Amazon Comprehend is a natural language processing (NLP) service that can be used to analyze text and extract key topics. In this scenario, the voice calls are being converted to text using Amazon Transcribe, and the transcribed files are stored on Amazon S3. By using Amazon Comprehend with the transcribed files, the company can extract the key topics discussed in the calls, providing the required information for further analysis. This approach leverages the capabilities of Amazon Comprehend to analyze the text data and identify important topics, making it the appropriate choice for this situation.

33. You wish to use a SageMaker notebook within a VPC. SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC. How would you use SageMaker within a VPC without opening up Internet access?

Uncheck the option for Internet access when creating your notebook instance, and it will handle the rest automatically.

No action is required, the VPC will block the notebook instances from accessing the Internet.

Use IAM to restrict Internet access from the notebook instance.

To use SageMaker within a VPC without opening up Internet access, you need to disable direct Internet access when specifying the VPC for your notebook instance. Additionally, you should use VPC interface endpoints (PrivateLink) to allow the connections needed for training and hosting your model. To ensure outbound connections for training and hosting, you should modify your instance's security group. This approach ensures that the SageMaker notebook instance remains within the VPC and does not create a potential security hole by accessing the Internet.

Explanation

To use SageMaker within a VPC without opening up Internet access, you need to disable direct Internet access when specifying the VPC for your notebook instance. Additionally, you should use VPC interface endpoints (PrivateLink) to allow the connections needed for training and hosting your model. To ensure outbound connections for training and hosting, you should modify your instance's security group. This approach ensures that the SageMaker notebook instance remains within the VPC and does not create a potential security hole by accessing the Internet.

34. During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?

The class distribution in the dataset is imbalanced

Dataset shuffling is disabled

The batch size is too big

The learning rate is very high

During mini-batch training, the learning rate determines the step size at which the model's parameters are updated. If the learning rate is set too high, the model may overshoot the optimal parameters, causing the training accuracy to oscillate. This is because the model is unable to converge to the optimal solution and keeps bouncing back and forth. Therefore, the most likely cause of the issue is that the learning rate is very high.

Explanation

During mini-batch training, the learning rate determines the step size at which the model's parameters are updated. If the learning rate is set too high, the model may overshoot the optimal parameters, causing the training accuracy to oscillate. This is because the model is unable to converge to the optimal solution and keeps bouncing back and forth. Therefore, the most likely cause of the issue is that the learning rate is very high.

35. The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?

The model predicts both the trend and the seasonality well.

The model predicts the trend well, but not the seasonality.

The model predicts the seasonality well, but not the trend.

The model does not predict the trend or the seasonality well.

The graph shows that the predicted values closely follow the actual values for both the trend and the seasonality. This indicates that the model is accurately capturing the patterns and fluctuations in the time series data, suggesting that it predicts both the trend and the seasonality well.

Explanation

The graph shows that the predicted values closely follow the actual values for both the trend and the seasonality. This indicates that the model is accurately capturing the patterns and fluctuations in the time series data, suggesting that it predicts both the trend and the seasonality well.

36. An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

Listwise deletion

Last observation carried forward

Multiple imputations

Mean substitution

The Specialist should use the multiple imputations approach to reconstruct the missing data. This approach involves creating multiple plausible values for the missing data based on the observed data and then analyzing the dataset multiple times with different imputed values. It helps to preserve the integrity of the dataset by accounting for the uncertainty associated with the missing data and providing more accurate estimates.

Explanation

The Specialist should use the multiple imputations approach to reconstruct the missing data. This approach involves creating multiple plausible values for the missing data based on the observed data and then analyzing the dataset multiple times with different imputed values. It helps to preserve the integrity of the dataset by accounting for the uncertainty associated with the missing data and providing more accurate estimates.

37. A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations. Which solution should the Specialist recommend?

Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.

A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database

Collaborative filtering based on user interactions and correlations to identify patterns in the customer database

Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database

The Specialist should recommend collaborative filtering based on user interactions and correlations to identify patterns in the customer database. This approach is suitable for analyzing customer behavior and preferences based on their past interactions with the website. By analyzing the correlations between users and their interactions, the Specialist can identify patterns and make personalized recommendations to enhance the website's service.

Explanation

The Specialist should recommend collaborative filtering based on user interactions and correlations to identify patterns in the customer database. This approach is suitable for analyzing customer behavior and preferences based on their past interactions with the website. By analyzing the correlations between users and their interactions, the Specialist can identify patterns and make personalized recommendations to enhance the website's service.

38. A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data. Which type of data repository is the MOST cost-effective solution?

An Amazon EBS-backed Amazon EC2 instance with hourly directories

An Amazon RDS database with hourly table partitions

An Amazon S3 data lake with hourly object prefixes

An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes

An Amazon S3 data lake with hourly object prefixes is the most cost-effective solution because it allows for storing a large amount of time-based training data and easily adding new files every hour. With S3, the volume of updates can change significantly throughout a 24-hour period without incurring additional costs. Additionally, using object prefixes allows for efficient organization and retrieval of the data based on the time intervals required for training.

Explanation

An Amazon S3 data lake with hourly object prefixes is the most cost-effective solution because it allows for storing a large amount of time-based training data and easily adding new files every hour. With S3, the volume of updates can change significantly throughout a 24-hour period without incurring additional costs. Additionally, using object prefixes allows for efficient organization and retrieval of the data based on the time intervals required for training.

39. An ad tech company is using an XGBoost model to classify its clickstream data. The company's Data Scientist is asked to explain how the model works to a group of non-technical colleagues. What is a simple explanation the Data Scientist can provide?

XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees

XGBoost is a state-of-the-art algorithm that uses logistic regression to split each feature of the data basedon certain conditions

XGBoost is a robust, flexible, scalable algorithm that uses logistic regression to classify data into buckets

XGBoost is an efficient and scalable neural network architecture.

XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees. It is a powerful and popular algorithm used for classification tasks. It works by iteratively combining multiple weak decision trees to create a strong predictive model. This algorithm is specifically designed to handle large datasets and provides high accuracy and efficiency.

Explanation

XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees. It is a powerful and popular algorithm used for classification tasks. It works by iteratively combining multiple weak decision trees to create a strong predictive model. This algorithm is specifically designed to handle large datasets and provides high accuracy and efficiency.

40. A healthcare company using the AWS Cloud has access to a variety of data types, including raw and preprocessed data. The company wants to start using this data for its ML pipeline, but also wants to make sure the data is highly available and located in a centralized repository. What approach should the company take to achieve the desired outcome?

Create a data lake using Amazon S3 as the data storage layer

Store unstructured data in Amazon DynamoDB and structured data in Amazon RDS

Use Amazon FSx to host the data for training

Use Amazon Elastic Block Store (Amazon EBS) volumes to store the data with data backup

The company should create a data lake using Amazon S3 as the data storage layer. This approach allows the company to store both raw and preprocessed data in a centralized repository. Amazon S3 provides highly available and scalable storage for various data types, making it an ideal choice for the company's ML pipeline.

Explanation

The company should create a data lake using Amazon S3 as the data storage layer. This approach allows the company to store both raw and preprocessed data in a centralized repository. Amazon S3 provides highly available and scalable storage for various data types, making it an ideal choice for the company's ML pipeline.

41. A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS. Which approach should the Specialist use for training a model using that data?

Write a direct connection to the SQL database within the notebook and pull data in.

Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.

The Specialist should push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook. This approach allows for efficient and scalable storage of the historical training data in Amazon S3, which can then be easily accessed and used for training the model in Amazon SageMaker. It also ensures that the data is securely stored and can be easily shared and accessed by other services or users within the AWS environment.

Explanation

The Specialist should push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook. This approach allows for efficient and scalable storage of the historical training data in Amazon S3, which can then be easily accessed and used for training the model in Amazon SageMaker. It also ensures that the data is securely stored and can be easily shared and accessed by other services or users within the AWS environment.

42. A Data Scientist created a correlation matrix between nine variables and the target variable. The correlation coefficient between two of the numerical variables, variable 1 and variable 5, is -0.95. How should the Data Scientist interpret the correlation coefficient?

As variable 1 increases, variable 5 increases

As variable 1 increases, variable 5 decreases

Variable 1 does not have any influence on variable 5

The data is not sufficient to make a well-informed interpretation

The correlation coefficient of -0.95 indicates a strong negative correlation between variable 1 and variable 5. This means that as variable 1 increases, variable 5 tends to decrease. The closer the correlation coefficient is to -1, the stronger the negative correlation. Therefore, the Data Scientist can interpret that there is a strong inverse relationship between variable 1 and variable 5.

Explanation

The correlation coefficient of -0.95 indicates a strong negative correlation between variable 1 and variable 5. This means that as variable 1 increases, variable 5 tends to decrease. The closer the correlation coefficient is to -1, the stronger the negative correlation. Therefore, the Data Scientist can interpret that there is a strong inverse relationship between variable 1 and variable 5.

43. A manufacturing company has a large set of labeled historical sales data The manufacturer would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem?

Logistic regression

Random Cut Forest (RCF)

Principal component analysis (PCA)

Linear regression

Linear regression is the appropriate machine learning approach to solve the problem of predicting the number of units of a particular part that should be produced each quarter. Linear regression is used for predicting a continuous numerical value, which aligns with the problem of predicting the quantity of units to be produced. Logistic regression, Random Cut Forest (RCF), and Principal Component Analysis (PCA) are not suitable in this case because they are used for different types of problems such as classification, anomaly detection, and dimensionality reduction, respectively.

Explanation

Linear regression is the appropriate machine learning approach to solve the problem of predicting the number of units of a particular part that should be produced each quarter. Linear regression is used for predicting a continuous numerical value, which aligns with the problem of predicting the quantity of units to be produced. Logistic regression, Random Cut Forest (RCF), and Principal Component Analysis (PCA) are not suitable in this case because they are used for different types of problems such as classification, anomaly detection, and dimensionality reduction, respectively.

44. A real estate company wants to provide its customers with a more accurate prediction of the final sale price for houses they are considering in various cities. To do this, the company wants to use a fully connected neural network trained on data from the previous ten years of home sales, as well as other features. What kind of machine learning problem does this situation represent?

Regression

Classification

Recommender system

Reinforcement learning

This situation represents a regression problem. Regression is a type of machine learning problem where the goal is to predict a continuous numerical value. In this case, the real estate company wants to predict the final sale price of houses, which is a continuous variable. By using a fully connected neural network trained on previous home sales data, the company can make more accurate predictions for their customers.

Explanation

This situation represents a regression problem. Regression is a type of machine learning problem where the goal is to predict a continuous numerical value. In this case, the real estate company wants to predict the final sale price of houses, which is a continuous variable. By using a fully connected neural network trained on previous home sales data, the company can make more accurate predictions for their customers.

45. A log analytics company wants to provide a history of Amazon SageMaker API calls made on its client's account for security analysis and operational troubleshooting purposes. What must be done in the client's account to ensure that the company can analyze the API calls?

Use IAM roles. “logs:*” are added to those IAM roles.

Enable AWS CloudTrail.

Enable CloudWatch logs.

Use the Amazon SageMaker SDK to call the ‘sagemaker_history()’ function.

Enabling AWS CloudTrail is necessary to ensure that the log analytics company can analyze the API calls made on the client's account. AWS CloudTrail provides a comprehensive history of API calls made within an AWS account, including Amazon SageMaker API calls. This allows for security analysis and operational troubleshooting purposes, as requested by the log analytics company. IAM roles and permissions, as well as CloudWatch logs, may be relevant for other purposes but are not specifically mentioned in relation to analyzing API calls. Using the Amazon SageMaker SDK and calling the 'sagemaker_history()' function is not mentioned as a requirement for this specific task.

Explanation

Enabling AWS CloudTrail is necessary to ensure that the log analytics company can analyze the API calls made on the client's account. AWS CloudTrail provides a comprehensive history of API calls made within an AWS account, including Amazon SageMaker API calls. This allows for security analysis and operational troubleshooting purposes, as requested by the log analytics company. IAM roles and permissions, as well as CloudWatch logs, may be relevant for other purposes but are not specifically mentioned in relation to analyzing API calls. Using the Amazon SageMaker SDK and calling the 'sagemaker_history()' function is not mentioned as a requirement for this specific task.

46. A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model's performance?

Receiver operating characteristic (ROC) curve

Misclassification rate

Root Mean Square Error (RM&)

L1 norm

The Specialist should use the Receiver Operating Characteristic (ROC) curve to understand how different classification thresholds will impact the model's performance. The ROC curve is a graphical representation that shows the trade-off between the true positive rate and the false positive rate at various classification thresholds. By analyzing the curve, the Specialist can determine the optimal threshold that balances the true positive rate and the false positive rate, and therefore maximize the model's performance.

Explanation

The Specialist should use the Receiver Operating Characteristic (ROC) curve to understand how different classification thresholds will impact the model's performance. The ROC curve is a graphical representation that shows the trade-off between the true positive rate and the false positive rate at various classification thresholds. By analyzing the curve, the Specialist can determine the optimal threshold that balances the true positive rate and the false positive rate, and therefore maximize the model's performance.

47. A Data Scientist wants to include "month" as a categorical column in a training dataset for an ML model that is being built. However, the ML algorithm gives an error when the column is added to the training data. What should the Data Scientist do to add this column?

Convert the “month” column to 12 different columns, one for each month, by using one-hot encodin

Map the “month" column data to the numbers 1 to 12 and use this new numerical mapped column.

Scale the months using StandardScaler.

Use pandas fillna() to convert the column to numerical data.

The Data Scientist should convert the "month" column to 12 different columns, one for each month, by using one-hot encoding. This is because the ML algorithm gives an error when the column is added to the training data, indicating that the algorithm cannot handle categorical data directly. One-hot encoding will create binary columns for each month, allowing the ML model to interpret and utilize the information effectively.

Explanation

The Data Scientist should convert the "month" column to 12 different columns, one for each month, by using one-hot encoding. This is because the ML algorithm gives an error when the column is added to the training data, indicating that the algorithm cannot handle categorical data directly. One-hot encoding will create binary columns for each month, allowing the ML model to interpret and utilize the information effectively.

48. A Data Scientist at a waste recycling company trained a CNN model to classify waste at the company's sites. Incoming waste was classified as either trash, compost, or recyclable to make it easier for the machines to split the incoming waste into the appropriate bins. During model testing, the F1 score was 0.918. The company's senior leadership originally asked the Data Scientist to reach an F1 score of at least 0.95. What should the Data scientists do to improve this score without spending too much time optimizing the model?

Use Amazon SageMaker tuning jobs to tune the hyperparameters used

Increase the batch size to improve the score in the Amazon SageMaker training job

Use momentum to improve the training in the Amazon SageMaker training job

Run the Amazon SageMaker training job for more epochs

The Data Scientist should use Amazon SageMaker tuning jobs to tune the hyperparameters used. This technique allows for automatic hyperparameter optimization, which can help improve the model's performance without spending too much time manually optimizing the model. By tuning the hyperparameters, the Data Scientist can potentially find the best combination of settings that will result in a higher F1 score.

Explanation

The Data Scientist should use Amazon SageMaker tuning jobs to tune the hyperparameters used. This technique allows for automatic hyperparameter optimization, which can help improve the model's performance without spending too much time manually optimizing the model. By tuning the hyperparameters, the Data Scientist can potentially find the best combination of settings that will result in a higher F1 score.

49. A video streaming company wants to analyze its VPC flow logs to build a real-time anomaly detection pipeline. The pipeline must be minimally managed and enable the business to build a near real-time dashboard. What combination of AWS service and algorithm can the company use for this pipeline?

Amazon SageMaker with RandomCutForest

Kinesis Data Analytics with RandomCutForest

Amazon QuickSight with ML Insights

Apache Spark on Amazon EMR with MLLib

The company can use Kinesis Data Analytics with RandomCutForest for this pipeline. Kinesis Data Analytics is a fully managed service that can process and analyze streaming data in real-time. RandomCutForest is an algorithm provided by Amazon SageMaker that can be used for anomaly detection. By combining these two services, the company can build a minimally managed pipeline to analyze VPC flow logs and generate real-time insights for a dashboard.

Explanation

The company can use Kinesis Data Analytics with RandomCutForest for this pipeline. Kinesis Data Analytics is a fully managed service that can process and analyze streaming data in real-time. RandomCutForest is an algorithm provided by Amazon SageMaker that can be used for anomaly detection. By combining these two services, the company can build a minimally managed pipeline to analyze VPC flow logs and generate real-time insights for a dashboard.

50. A manufacturing company asks its Machine Learning Specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100000 images per defect type for training. During the initial training of the image classification model the Specialist notices that the validation accuracy is 80%, while the training accuracy is 90%. It is known that human-level performance for this type of image classification is around 90%. What should the Specialist consider to fix this issue?

A longer training time

Making the network larger

Using a different optimizer

Using some form of regularization

The Specialist should consider using some form of regularization to fix the issue. Regularization techniques such as L1 or L2 regularization can help prevent overfitting, which is indicated by the higher training accuracy compared to the validation accuracy. By adding regularization, the model's complexity is reduced, preventing it from memorizing the training data and improving its generalization ability. This can help bring the validation accuracy closer to the human-level performance of 90%.

Explanation

The Specialist should consider using some form of regularization to fix the issue. Regularization techniques such as L1 or L2 regularization can help prevent overfitting, which is indicated by the higher training accuracy compared to the validation accuracy. By adding regularization, the model's complexity is reduced, preventing it from memorizing the training data and improving its generalization ability. This can help bring the validation accuracy closer to the human-level performance of 90%.

51. A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1:10] Considering the graph, what is a reasonable selection for the optimal choice of k?

1

4

7

10

Based on the graph, we can observe that the decrease in the sum of squared errors (SSE) is significant until k=4, after which the decrease becomes less pronounced. This suggests that k=4 is a reasonable selection for the optimal choice of k, as it provides a good balance between minimizing the SSE and avoiding overfitting.

Explanation

Based on the graph, we can observe that the decrease in the sum of squared errors (SSE) is significant until k=4, after which the decrease becomes less pronounced. This suggests that k=4 is a reasonable selection for the optimal choice of k, as it provides a good balance between minimizing the SSE and avoiding overfitting.

52. A Data Engineer needs to build a model using a dataset containing customer credit card information. How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?

The correct answer is to use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue. This solution ensures that the data is encrypted both at rest and in transit, using AWS KMS for encryption. Additionally, AWS Glue can be used to remove or redact the credit card numbers from the dataset, further enhancing the security of the data.

Explanation

The correct answer is to use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue. This solution ensures that the data is encrypted both at rest and in transit, using AWS KMS for encryption. Additionally, AWS Glue can be used to remove or redact the credit card numbers from the dataset, further enhancing the security of the data.

53. A team of Data Scientists wants to use Amazon SageMaker training jobs to run two different versions of the same model in parallel to compare the long-term effectiveness of the different versions in reaching the related business outcome. How should the team deploy these two model versions with minimum management?

Create an endpoint configuration with production variants for the two models with equal weights.

Create an endpoint configuration with production variants for the two models with a weight ratio of 90:10.

The team should create an endpoint configuration with production variants for the two models with equal weights. This means that both models will be deployed and used simultaneously, with an equal amount of traffic being directed to each model. By doing this, the team can compare the long-term effectiveness of the two versions in reaching the related business outcome. This approach requires minimum management as the models are deployed and managed automatically by Amazon SageMaker.

Explanation

The team should create an endpoint configuration with production variants for the two models with equal weights. This means that both models will be deployed and used simultaneously, with an equal amount of traffic being directed to each model. By doing this, the team can compare the long-term effectiveness of the two versions in reaching the related business outcome. This approach requires minimum management as the models are deployed and managed automatically by Amazon SageMaker.

54. A large JSON dataset for a project has been uploaded to a private Amazon S3 bucket. The Machine Learning Specialist wants to securely access and explore the data from an Amazon SageMaker notebook instance. A new VPC was created and assigned to the Specialist. How can the privacy and integrity of the data stored in Amazon S3 be maintained while granting access to the Specialist for analysis?

The correct answer is to launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, and define a custom S3 bucket policy to only allow requests from the VPC to access the S3 bucket. This ensures that the data stored in Amazon S3 remains private and can only be accessed by the Specialist through the VPC and the designated notebook instance. The S3 VPC endpoint establishes a private connection between the VPC and S3, eliminating the need for internet access. The custom S3 bucket policy further restricts access to the bucket, ensuring the integrity of the data.

Explanation

The correct answer is to launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, and define a custom S3 bucket policy to only allow requests from the VPC to access the S3 bucket. This ensures that the data stored in Amazon S3 remains private and can only be accessed by the Specialist through the VPC and the designated notebook instance. The S3 VPC endpoint establishes a private connection between the VPC and S3, eliminating the need for internet access. The custom S3 bucket policy further restricts access to the bucket, ensuring the integrity of the data.

55. A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available. Which model is MOST likely to provide the best results in Amazon SageMaker?

Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.

The Amazon SageMaker Linear Learner algorithm is most likely to provide the best results in this scenario. The task is to forecast air quality based on historical data, which is a regression problem. The Linear Learner algorithm is designed for regression tasks and can effectively learn patterns and make predictions based on the given time series data. Using a regressor predictor_type will allow the algorithm to accurately forecast the air quality in parts per million of contaminants for the next 2 days. The other options, such as k-Nearest-Neighbors and Random Cut Forest, may not be as suitable for this specific task.

Explanation

The Amazon SageMaker Linear Learner algorithm is most likely to provide the best results in this scenario. The task is to forecast air quality based on historical data, which is a regression problem. The Linear Learner algorithm is designed for regression tasks and can effectively learn patterns and make predictions based on the given time series data. Using a regressor predictor_type will allow the algorithm to accurately forecast the air quality in parts per million of contaminants for the next 2 days. The other options, such as k-Nearest-Neighbors and Random Cut Forest, may not be as suitable for this specific task.

56. A Machine Learning Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5. Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMaker Variant Invocations Per Instance setting?

10

30

600

2400

The Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for endpoint automatic scaling. They performed a load test and found that the peak requests per second (RPS) without service degradation is 20 RPS. Since the invocations per instance setting is measured on a per-minute basis, the Specialist intends to set the invocation safety factor to 0.5. To calculate the appropriate setting, we multiply the peak RPS by 60 (to convert it to requests per minute) and then divide it by the invocation safety factor (0.5). Therefore, the Specialist should set the SageMaker Variant Invocations Per Instance setting to 600.

Explanation

The Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for endpoint automatic scaling. They performed a load test and found that the peak requests per second (RPS) without service degradation is 20 RPS. Since the invocations per instance setting is measured on a per-minute basis, the Specialist intends to set the invocation safety factor to 0.5. To calculate the appropriate setting, we multiply the peak RPS by 60 (to convert it to requests per minute) and then divide it by the invocation safety factor (0.5). Therefore, the Specialist should set the SageMaker Variant Invocations Per Instance setting to 600.

57. A video streaming company wants to create a searchable video library that provides a personalized searching experience and automated content moderation for its users, so that when the users search for a keyword, they get all the videos that map to that keyword. The company wants to do this with minimal cost and limited need for management. What approach should the company take to building this solution?

Use Amazon SageMaker to create an ML model that extracts metadata from the videos

Use Amazon Rekognition Video to extract metadata from the videos

Use Amazon Kinesis Video Streams to stream the videos to Amazon EMR in order to create an ML model

Use AWS Batch to transform a batch of video files into metadata

The company should use Amazon Rekognition Video to extract metadata from the videos. Amazon Rekognition Video is a service that can analyze and extract meaningful metadata from videos, such as objects, scenes, and faces. By using this service, the company can automatically extract relevant information from the videos, which can then be used for creating a searchable video library. This approach would require minimal cost and limited management, as the extraction of metadata can be automated using Amazon Rekognition Video.

Explanation

The company should use Amazon Rekognition Video to extract metadata from the videos. Amazon Rekognition Video is a service that can analyze and extract meaningful metadata from videos, such as objects, scenes, and faces. By using this service, the company can automatically extract relevant information from the videos, which can then be used for creating a searchable video library. This approach would require minimal cost and limited management, as the extraction of metadata can be automated using Amazon Rekognition Video.

58. A Machine Learning Specialist is training a model to identify the make and model of vehicles in images. The Specialist wants to use transfer learning and an existing model trained on images of general objects. The Specialist collated a large custom dataset of pictures containing different vehicle makes and models. What should the Specialist do to initialize the model to re-train it with the custom data?

Initialize the model with random weights in all layers including the last fully connected layer

Initialize the model with pre-trained weights in all layers and replace the last fully connected layer.

Initialize the model with random weights in all layers and replace the last fully connected layer.

Initialize the model with pre-trained weights in all layers including the last fully connected layer.

To initialize the model and re-train it with the custom data, the Machine Learning Specialist should use the option of initializing the model with pre-trained weights in all layers and replacing the last fully connected layer. This approach allows the Specialist to leverage the knowledge and features learned by the existing model trained on general objects. By replacing the last fully connected layer, the model can be adapted to the specific task of identifying the make and model of vehicles using the custom dataset.

Explanation

To initialize the model and re-train it with the custom data, the Machine Learning Specialist should use the option of initializing the model with pre-trained weights in all layers and replacing the last fully connected layer. This approach allows the Specialist to leverage the knowledge and features learned by the existing model trained on general objects. By replacing the last fully connected layer, the model can be adapted to the specific task of identifying the make and model of vehicles using the custom dataset.

59. A Machine Learning Specialist is building a supervised model that will evaluate customers' satisfaction with their mobile phone service based on recent usage. The model's output should infer whether or not a customer is likely to switch to a competitor in the next 30 days. Which of the following modeling techniques should the Specialist use?

Time-series prediction

Anomaly detection

Binary classification

Regression

The Machine Learning Specialist should use binary classification for this task because the goal is to predict whether or not a customer is likely to switch to a competitor in the next 30 days. Binary classification is the appropriate modeling technique for predicting a binary outcome, in this case, whether a customer will switch or not. Time-series prediction would be more suitable if the goal was to predict future usage patterns, while anomaly detection is used to identify unusual or abnormal behavior. Regression is used for predicting continuous numerical values, which is not the case here.

Explanation

The Machine Learning Specialist should use binary classification for this task because the goal is to predict whether or not a customer is likely to switch to a competitor in the next 30 days. Binary classification is the appropriate modeling technique for predicting a binary outcome, in this case, whether a customer will switch or not. Time-series prediction would be more suitable if the goal was to predict future usage patterns, while anomaly detection is used to identify unusual or abnormal behavior. Regression is used for predicting continuous numerical values, which is not the case here.

60. Given the following confusion matrix for a movie classification model, what is the true class frequency for Romance and the predicted class frequency for Adventure?

The true class frequency for Romance is 77.56% and the predicted class frequency for Adventure is 20 85%

The true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 1312%

not-available-via-ai

Explanation

not-available-via-ai

61. A Machine Learning Specialist built an image classification deep learning model. However the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%r respectively. How should the Specialist address this issue and what is the reason behind it?

The learning rate should be increased because the optimization process was trapped at a local minimum.

The dropout rate at the flatten layer should be increased because the model is not generalized enough.

The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough.

The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

The overfitting problem occurs when a model performs well on the training data but poorly on the testing data, indicating that the model is not generalized enough. Increasing the dropout rate at the flatten layer helps to regularize the model by randomly dropping out some of the neurons during training, preventing over-reliance on specific features and improving generalization. Therefore, increasing the dropout rate is the appropriate solution to address the overfitting issue in this scenario.

Explanation

The overfitting problem occurs when a model performs well on the training data but poorly on the testing data, indicating that the model is not generalized enough. Increasing the dropout rate at the flatten layer helps to regularize the model by randomly dropping out some of the neurons during training, preventing over-reliance on specific features and improving generalization. Therefore, increasing the dropout rate is the appropriate solution to address the overfitting issue in this scenario.

62. A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance. How should the records be stored in Amazon S3 to improve query performance?

CSV files

Parquet files

Compressed JSON

RecordIO

Storing the records in Amazon S3 in Parquet file format can improve query performance. Parquet is a columnar storage file format that is optimized for query performance. It uses compression techniques and column-level encoding to reduce the amount of data that needs to be read from disk during queries. This can significantly improve the query execution time, especially for large volumes of data like in this scenario where 1 TB of data is generated every minute.

Explanation

Storing the records in Amazon S3 in Parquet file format can improve query performance. Parquet is a columnar storage file format that is optimized for query performance. It uses compression techniques and column-level encoding to reduce the amount of data that needs to be read from disk during queries. This can significantly improve the query execution time, especially for large volumes of data like in this scenario where 1 TB of data is generated every minute.

63. Which probability distribution would describe the likelihood of flipping a coin "heads"?

Bernoulli Distribution

Normal Distribution

Poisson Distribution

Binomial Distribution

The likelihood of flipping a coin "heads" can be described by the Binomial Distribution. This distribution is used when there are two possible outcomes (in this case, heads or tails) and each flip is independent. The Binomial Distribution calculates the probability of a certain number of successes (in this case, heads) in a fixed number of trials (the number of times the coin is flipped).

Explanation

The likelihood of flipping a coin "heads" can be described by the Binomial Distribution. This distribution is used when there are two possible outcomes (in this case, heads or tails) and each flip is independent. The Binomial Distribution calculates the probability of a certain number of successes (in this case, heads) in a fixed number of trials (the number of times the coin is flipped).

64. While working on a neural network project, a Machine Learning Specialist discovers that some features in the data have very high magnitude resulting in this data being weighted more in the cost function. What should the Specialist do to ensure better convergence during backpropagation?

Dimensionality reduction

Data normalization

Model regularization

Data augmentation for the minority class

Data normalization is the process of scaling the data to a standard range. In this case, the high magnitude of some features can cause the neural network to give more importance to those features, leading to slower convergence during backpropagation. By normalizing the data, the features will be on a similar scale, allowing the neural network to learn more effectively and converge faster. This helps to prevent any one feature from dominating the cost function and ensures better convergence during backpropagation.

Explanation

Data normalization is the process of scaling the data to a standard range. In this case, the high magnitude of some features can cause the neural network to give more importance to those features, leading to slower convergence during backpropagation. By normalizing the data, the features will be on a similar scale, allowing the neural network to learn more effectively and converge faster. This helps to prevent any one feature from dominating the cost function and ensures better convergence during backpropagation.

65. A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access. Which approach should the Specialist use to continue working?

Install Python 3 and boto3 on their laptop and continue the code development using that environment.

Download TensorFlow from tensorflow.org to emulate the TensorFlow kernel in the SageMaker environment.

The Specialist should download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment and use the Amazon SageMaker Python SDK to test the code. This approach allows the Specialist to have a local copy of the SageMaker environment, including the necessary dependencies and libraries, without requiring an internet connection. They can continue working on the project and test the code using the SageMaker Python SDK, ensuring compatibility and consistency with the SageMaker environment.

Explanation

The Specialist should download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment and use the Amazon SageMaker Python SDK to test the code. This approach allows the Specialist to have a local copy of the SageMaker environment, including the necessary dependencies and libraries, without requiring an internet connection. They can continue working on the project and test the code using the SageMaker Python SDK, ensuring compatibility and consistency with the SageMaker environment.

66. A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis. Which of the following services would both ingest and store this data in the correct format?

AWS DMS

Amazon Kinesis Data Streams

Amazon Kinesis Data Firehose

Amazon Kinesis Data Analytics

Amazon Kinesis Data Firehose is the correct answer because it is a fully managed service that can ingest streaming data and store it in Apache Parquet files. It can automatically convert the incoming data to the desired format and deliver it to various destinations, including Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. This makes it suitable for exploration and analysis of the data.

Explanation

Amazon Kinesis Data Firehose is the correct answer because it is a fully managed service that can ingest streaming data and store it in Apache Parquet files. It can automatically convert the incoming data to the desired format and deliver it to various destinations, including Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. This makes it suitable for exploration and analysis of the data.

67. A Machine Learning Specialist is evaluating an ML model using a custom Deep Learning Amazon Machine Image (AMI) with Anaconda installed to run workloads through the terminal. Unfortunately, the ML Specialist does not have any experience with the Deep Learning AMI and wants to log into the instance and create an ipython notebook (*.ipynb), but cannot access the notebook interface. After creating the AMI instance, what steps should the ML Specialist take to create a notebook?

SSH into the Deep Learning AMI instance, start a new Flask interface application, and create a newipython notebook

To create a notebook, the ML Specialist should SSH into the Deep Learning AMI instance with port forwarding at port 8888. They should then start a Jupyter notebook application, which will provide them with the notebook interface. From there, they can create a new ipython notebook.

Explanation

To create a notebook, the ML Specialist should SSH into the Deep Learning AMI instance with port forwarding at port 8888. They should then start a Jupyter notebook application, which will provide them with the notebook interface. From there, they can create a new ipython notebook.

68. Data and analytics company is expanding its platform on AWS. The company wants to build a serverless product that preprocesses large structured data while minimizing the cost for data storage and compute. The company also wants to integrate the new product with an existing ML product that uses Amazon EMR with Spark. What solution should the company use to build this new product?

Use AWS Lambda for data preprocessing. Save the data in Amazon S3 in CSV format.

Use AWS Glue for data preprocessing. Save the data in Amazon S3 in CSV format.

Use AWS Glue for data preprocessing. Save the data in Amazon S3 in Parquet format

Use AWS Lambda for data preprocessing. Save the data in Amazon S3 in Parquet format.

The company should use AWS Glue for data preprocessing because it is a serverless data integration service that can efficiently process large structured data. Saving the data in Amazon S3 in Parquet format is recommended because Parquet is a columnar storage format that provides better compression and query performance compared to CSV format. This solution will help minimize the cost for data storage and compute while integrating with the existing ML product that uses Amazon EMR with Spark.

Explanation

The company should use AWS Glue for data preprocessing because it is a serverless data integration service that can efficiently process large structured data. Saving the data in Amazon S3 in Parquet format is recommended because Parquet is a columnar storage format that provides better compression and query performance compared to CSV format. This solution will help minimize the cost for data storage and compute while integrating with the existing ML product that uses Amazon EMR with Spark.

69. A manufacturing company wants to increase the longevity of its factory machines by predicting when a machine part is about to stop working, jeopardizing the health of the machine. The company's team of Data Scientists will build an ML model to accomplish this goal. The model will be trained on data made up of consumption metrics from similar factory machines, and will span a time frame from one hour before a machine part broke down to five minutes after the part degraded. What kind of machine learning algorithm should the company use to build this model?

Amazon SageMaker DeepAR

SciKit Learn Regression

Convolutional neural network (CNN)

Scikit Learn Random Forest

The company should use Amazon SageMaker DeepAR algorithm to build the model. DeepAR is a time series forecasting algorithm that is specifically designed for predicting future values based on historical data. In this case, the algorithm can be trained on the consumption metrics of similar factory machines to predict when a machine part is about to stop working. The algorithm's ability to handle time series data and capture temporal dependencies makes it suitable for this task.

Explanation

The company should use Amazon SageMaker DeepAR algorithm to build the model. DeepAR is a time series forecasting algorithm that is specifically designed for predicting future values based on historical data. In this case, the algorithm can be trained on the consumption metrics of similar factory machines to predict when a machine part is about to stop working. The algorithm's ability to handle time series data and capture temporal dependencies makes it suitable for this task.

70. A manufacturer of car engines collects data from cars as they are being driven. The data collected includes timestamp, engine temperature, rotations per minute (RPM), and other sensor readings. The company wants to predict when an engine is going to have a problem so it can notify drivers in advance to get engine so it can notify drivers in advance to get engine maintenance. The engine data is loaded into a data lake for training. Which is the MOST suitable predictive model that can be deployed into production?

This data requires an unsupervised learning algorithm Use Amazon SageMaker k-means to cluster the data.

This data is already formulated as a time series Use Amazon SageMaker seq2seq to model the time series.

The most suitable predictive model that can be deployed into production is to add labels over time to indicate which engine faults occur at what time in the future, turning this into a supervised learning problem. Then, a recurrent neural network (RNN) can be used to train the model to recognize when an engine might need maintenance for a certain fault. This approach allows the model to learn patterns and correlations in the data over time, making it effective for predicting engine problems based on the collected sensor readings.

Explanation

The most suitable predictive model that can be deployed into production is to add labels over time to indicate which engine faults occur at what time in the future, turning this into a supervised learning problem. Then, a recurrent neural network (RNN) can be used to train the model to recognize when an engine might need maintenance for a certain fault. This approach allows the model to learn patterns and correlations in the data over time, making it effective for predicting engine problems based on the collected sensor readings.

71. A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions - Here is an example from the dataset; "The quck BROWN FOX jumps over the lazy dog " Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Select THREE)

Perform part-of-speech tagging and keep the action verb and the nouns only

Normalize all words by making the sentence lowercase

Remove stop words using an English stopword dictionary.

Correct the typography on "quck" to "quick."

One-hot encode all words in the sentence

Tokenize the sentence into words.

The Specialist needs to perform three operations to correctly sanitize and prepare the data in a repeatable manner. Firstly, normalizing all words by making the sentence lowercase ensures that the capitalization of words does not affect the analysis. Secondly, removing stop words using an English stopword dictionary helps to eliminate commonly used words that do not carry much meaning. Lastly, tokenizing the sentence into words breaks it down into individual units, which is necessary for further processing and analysis.

Explanation

The Specialist needs to perform three operations to correctly sanitize and prepare the data in a repeatable manner. Firstly, normalizing all words by making the sentence lowercase ensures that the capitalization of words does not affect the analysis. Secondly, removing stop words using an English stopword dictionary helps to eliminate commonly used words that do not carry much meaning. Lastly, tokenizing the sentence into words breaks it down into individual units, which is necessary for further processing and analysis.

Submit

72. A large consumer goods manufacturer has the following products on sale;

34 different toothpaste variants
48 different toothbrush variants
43 different mouthwash variants

The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched. Which solution should a Machine Learning Specialist apply?

Train a custom ARIMA model to forecast demand for the new product.

Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product

Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product.

Train a custom XGBoost model to forecast demand for the new product

The correct solution is to train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product. The DeepAR algorithm is specifically designed for time series forecasting, which is the case for predicting demand for consumer goods. It can handle multiple related time series and capture complex patterns and dependencies in the data. Therefore, it is the most suitable solution for forecasting demand for the new product based on the sales history of the existing products.

Explanation

The correct solution is to train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product. The DeepAR algorithm is specifically designed for time series forecasting, which is the case for predicting demand for consumer goods. It can handle multiple related time series and capture complex patterns and dependencies in the data. Therefore, it is the most suitable solution for forecasting demand for the new product based on the sales history of the existing products.

73. A security and networking company wants to use ML to flag certain IP addresses that have been known to send spam and phishing information. The company wants to build an ML model based on previous user feedback indicating whether specific IP addresses have been connected to a website designed for spam and phishing. What is the simplest solution that the company can implement?

Regression

Classification

Natural language processing (NLP)

A rule-based solution should be used instead of ML

A rule-based solution should be used instead of ML because the company already has specific criteria (previous user feedback) to identify IP addresses connected to spam and phishing websites. ML models require training data, which may not be readily available in this case. By using a rule-based solution, the company can set predefined rules based on the feedback to flag the IP addresses without the need for ML algorithms.

Explanation

A rule-based solution should be used instead of ML because the company already has specific criteria (previous user feedback) to identify IP addresses connected to spam and phishing websites. ML models require training data, which may not be readily available in this case. By using a rule-based solution, the company can set predefined rules based on the feedback to flag the IP addresses without the need for ML algorithms.

74. A Machine Learning specialist is examining the root cause for underperformance of a regression model and has a hunch that it is consistently overestimating the outcome. Which metrics should he track on a chart to help identify any pattern of model overestimation?

Residuals

Mean Absolute Error

RMSE

AUC

The residuals are the differences between the predicted values and the actual values in a regression model. By tracking the residuals on a chart, the Machine Learning specialist can identify any consistent pattern of overestimation. If the residuals consistently have positive values, it indicates that the model is consistently overestimating the outcome. Therefore, tracking the residuals can help in identifying the root cause of the underperformance of the regression model.

Explanation

The residuals are the differences between the predicted values and the actual values in a regression model. By tracking the residuals on a chart, the Machine Learning specialist can identify any consistent pattern of overestimation. If the residuals consistently have positive values, it indicates that the model is consistently overestimating the outcome. Therefore, tracking the residuals can help in identifying the root cause of the underperformance of the regression model.

75. A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target. What option can the Specialist use to determine whether it is overestimating or underestimating the target value?

Root Mean Square Error (RMSE)

Residual plots

Area under the curve

Confusion matrix

Residual plots can be used to determine whether a regression model is overestimating or underestimating the target value. Residual plots show the difference between the predicted and actual values of the target variable. If the majority of the residuals are positive, it indicates that the model is overestimating the target. Conversely, if the majority of the residuals are negative, it indicates that the model is underestimating the target. Therefore, by analyzing the pattern of residuals in the plot, the Machine Learning Specialist can understand whether the model is more frequently overestimating or underestimating the target.

Explanation

Residual plots can be used to determine whether a regression model is overestimating or underestimating the target value. Residual plots show the difference between the predicted and actual values of the target variable. If the majority of the residuals are positive, it indicates that the model is overestimating the target. Conversely, if the majority of the residuals are negative, it indicates that the model is underestimating the target. Therefore, by analyzing the pattern of residuals in the plot, the Machine Learning Specialist can understand whether the model is more frequently overestimating or underestimating the target.

76. In AWS SageMaker, what feature allows you to distribute machine learning model training across multiple instances and is designed for large-scale distributed training?

SageMaker Data Wrangler

SageMaker Model Monitor

SageMaker Multi-Model Endpoints

SageMaker Distributed Training

SageMaker Distributed Training is a feature within Amazon SageMaker that enables large-scale distributed training of machine learning models across multiple instances. This advanced capability is particularly useful for handling large datasets and complex model training scenarios, making it an essential tool for scaling machine learning workflows in AWS.

Explanation

SageMaker Distributed Training is a feature within Amazon SageMaker that enables large-scale distributed training of machine learning models across multiple instances. This advanced capability is particularly useful for handling large datasets and complex model training scenarios, making it an essential tool for scaling machine learning workflows in AWS.

77. A healthcare organization has an application that takes in sensitive user data. This data is encrypted at rest and stored in an Amazon S3 bucket using customer-managed encryption with AWS Key Management Service (AWS KMS). A Data Scientist in the organization wants to use this encrypted data as features in an Amazon SageMaker training job. However, the following error continues to occur: "Data download failed." What should the Data Scientist do to fix this issue?

Add “S3:*” to the IAM role that is attached to the Amazon SageMaker training job.

Specify the “VolumeKmsKeyId” in the Amazon SageMaker training job.

Add “EnableKMS” to the Amazon SageMaker training job. Then, specify the Amazon S3 bucket thatincludes the data. (

The Data Scientist should make sure that the AWS Identity and Access Management (IAM) role used for Amazon S3 access has the necessary permissions to encrypt and decrypt the data with the AWS KMS key. This is because the data is encrypted at rest and stored in an Amazon S3 bucket using customer-managed encryption with AWS KMS. Without the proper permissions, the Data Scientist will not be able to access and download the encrypted data for use in the Amazon SageMaker training job, resulting in the "Data download failed" error.

Explanation

The Data Scientist should make sure that the AWS Identity and Access Management (IAM) role used for Amazon S3 access has the necessary permissions to encrypt and decrypt the data with the AWS KMS key. This is because the data is encrypted at rest and stored in an Amazon S3 bucket using customer-managed encryption with AWS KMS. Without the proper permissions, the Data Scientist will not be able to access and download the encrypted data for use in the Amazon SageMaker training job, resulting in the "Data download failed" error.

78. A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes. Which prior probability distribution should the ML Specialist use for this variable?

Poisson distribution

Uniform distribution

Normal distribution

Binomial distribution

The ML Specialist should use a Poisson distribution for this variable because it represents the number of events (minutes of waiting) occurring in a fixed interval (bus cycle every 10 minutes) with a known average rate (mean of 3 minutes). The Poisson distribution is commonly used for modeling the number of events in a fixed interval, making it the appropriate choice in this scenario.

Explanation

The ML Specialist should use a Poisson distribution for this variable because it represents the number of events (minutes of waiting) occurring in a fixed interval (bus cycle every 10 minutes) with a known average rate (mean of 3 minutes). The Poisson distribution is commonly used for modeling the number of events in a fixed interval, making it the appropriate choice in this scenario.

79. You work for a manufacturing company that produces retail apparel, such as shoes, dresses, blouses, etc. Your head of manufacturing has asked you to use your data science skills to determine which product, among a list of potential next products, your company should invest its resources to produce. You decide you need to predict the sales levels of each of the potential next products and select the one with the highest predicted purchase rate. Which type of machine learning approach should you use?

The goal is to predict the sales levels of each potential next product and select the one with the highest predicted purchase rate. This is a regression problem because the target variable (sales levels) is a continuous variable. Linear regression is an appropriate machine learning approach for regression problems as it aims to model the relationship between the independent variables (such as product features, marketing efforts, etc.) and the dependent variable (sales levels) through a linear equation. By using linear regression, the company can make predictions on the sales levels of each potential product and choose the one with the highest predicted sales.

Explanation

The goal is to predict the sales levels of each potential next product and select the one with the highest predicted purchase rate. This is a regression problem because the target variable (sales levels) is a continuous variable. Linear regression is an appropriate machine learning approach for regression problems as it aims to model the relationship between the independent variables (such as product features, marketing efforts, etc.) and the dependent variable (sales levels) through a linear equation. By using linear regression, the company can make predictions on the sales levels of each potential product and choose the one with the highest predicted sales.

80. A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs. What does the Specialist need to do?

Bundle the NVIDIA drivers with the Docker image.

Build the Docker container to be NVIDIA-Docker compatible.

Organize the Docker container's file structure to execute on GPU instances.

Set the GPU flag in the Amazon SageMaker Create Training Job request body.

The Specialist needs to build the Docker container to be NVIDIA-Docker compatible. This is because the Specialist is using Amazon EC2 P3 instances, which have NVIDIA GPUs. By making the Docker container NVIDIA-Docker compatible, the Specialist ensures that the container can properly leverage the NVIDIA GPUs for training the model. This compatibility allows the container to access the necessary NVIDIA drivers and libraries for GPU acceleration.

Explanation

The Specialist needs to build the Docker container to be NVIDIA-Docker compatible. This is because the Specialist is using Amazon EC2 P3 instances, which have NVIDIA GPUs. By making the Docker container NVIDIA-Docker compatible, the Specialist ensures that the container can properly leverage the NVIDIA GPUs for training the model. This compatibility allows the container to access the necessary NVIDIA drivers and libraries for GPU acceleration.

81. A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist. Which machine learning model type should the Specialist use to accomplish this task?

Linear regression

Classification

Clustering

Reinforcement learning

The Specialist should use a Classification model to accomplish this task. Classification models are used to predict the category or class that an observation belongs to based on its features. In this case, the Specialist wants to group customers into categories of churn or no churn, which is a binary classification problem. By training a classification model on the labeled data, the Specialist can predict whether a customer will churn or not based on their features.

Explanation

The Specialist should use a Classification model to accomplish this task. Classification models are used to predict the category or class that an observation belongs to based on its features. In this case, the Specialist wants to group customers into categories of churn or no churn, which is a binary classification problem. By training a classification model on the labeled data, the Specialist can predict whether a customer will churn or not based on their features.

82. A retail company intends to use machine learning to categorize new products. A labeled dataset of current products was provided to the Data Science team. The dataset includes 1,200 products. The labeled dataset has 15 features for each product such as title dimensions, weight, and price. Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies. Which model should be used for categorizing new products using the provided dataset for training?

An XGBoost model where the objective parameter is set to multi: softmax

A deep convolutional neural network (CNN) with a softmax activation function for the last layer

A regression forest where the number of trees is set equal to the number of product categories

A DeepAR forecasting model based on a recurrent neural network (RNN)

An XGBoost model with the objective parameter set to multi: softmax should be used for categorizing new products using the provided dataset for training. XGBoost is a powerful machine learning algorithm that can handle multi-class classification problems effectively. By setting the objective parameter to multi: softmax, the model will be trained to output probabilities for each category, allowing it to classify new products accurately. This model is suitable for the given dataset as it has multiple features and six different categories to classify the products into.

Explanation

An XGBoost model with the objective parameter set to multi: softmax should be used for categorizing new products using the provided dataset for training. XGBoost is a powerful machine learning algorithm that can handle multi-class classification problems effectively. By setting the objective parameter to multi: softmax, the model will be trained to output probabilities for each category, allowing it to classify new products accurately. This model is suitable for the given dataset as it has multiple features and six different categories to classify the products into.

83. A Machine Learning Specialist has built a model using Amazon SageMaker built-in algorithms and is not getting expected accurate results. The Specialist wants to use hyperparameter optimization to increase the model's accuracy. Which method is the MOST repeatable and requires the LEAST amount of effort to achieve this?

Launch multiple training jobs in parallel with different hyperparameters

Create a hyperparameter tuning job and set the accuracy as an objective metric.

Creating a hyperparameter tuning job and setting the accuracy as an objective metric is the most repeatable and requires the least amount of effort to achieve the desired results. This approach automates the process of searching for the best hyperparameters by iteratively training and evaluating models with different hyperparameter configurations. By setting the accuracy as the objective metric, the tuning job will optimize the hyperparameters to maximize the model's accuracy. This method eliminates the need for manual trial and error and ensures a systematic and efficient approach to improving the model's accuracy.

Explanation

Creating a hyperparameter tuning job and setting the accuracy as an objective metric is the most repeatable and requires the least amount of effort to achieve the desired results. This approach automates the process of searching for the best hyperparameters by iteratively training and evaluating models with different hyperparameter configurations. By setting the accuracy as the objective metric, the tuning job will optimize the hyperparameters to maximize the model's accuracy. This method eliminates the need for manual trial and error and ensures a systematic and efficient approach to improving the model's accuracy.

84. A company's Machine Learning Specialist needs to improve the training speed of a time-series forecasting model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily. The model accuracy js acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes. What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?

To scale the training solution for future demand, the Machine Learning Specialist should change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. This will allow the training to be parallelized across multiple machines, enabling faster training and accommodating the anticipated increase in training data size. By using Horovod and Amazon SageMaker, the Specialist can achieve the business goals without making significant changes to the coding effort and infrastructure.

Explanation

To scale the training solution for future demand, the Machine Learning Specialist should change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. This will allow the training to be parallelized across multiple machines, enabling faster training and accommodating the anticipated increase in training data size. By using Horovod and Amazon SageMaker, the Specialist can achieve the business goals without making significant changes to the coding effort and infrastructure.

85. A Machine Learning Specialist is working for a credit card processing company and receives an unbalanced dataset containing credit card transactions. It contains 99,000 valid transactions and 1,000 fraudulent transactions. The Specialist is asked to score a model that was run against the dataset. The Specialist has been advised that identifying valid transactions is equally as important as identifying fraudulent transactions. What metric is BEST suited to score the model?

Precision

Recall

Area Under the ROC Curve (AUC)

Root Mean Square Error (RMSE)

The best metric to score the model in this scenario is the Area Under the ROC Curve (AUC). AUC is a suitable metric for evaluating models on unbalanced datasets because it measures the model's ability to distinguish between the positive and negative classes. In this case, the positive class represents the fraudulent transactions, and the negative class represents the valid transactions. AUC takes into account both true positive rate (sensitivity) and false positive rate (1-specificity), which is important when identifying both fraudulent and valid transactions is equally important.

Explanation

The best metric to score the model in this scenario is the Area Under the ROC Curve (AUC). AUC is a suitable metric for evaluating models on unbalanced datasets because it measures the model's ability to distinguish between the positive and negative classes. In this case, the positive class represents the fraudulent transactions, and the negative class represents the valid transactions. AUC takes into account both true positive rate (sensitivity) and false positive rate (1-specificity), which is important when identifying both fraudulent and valid transactions is equally important.

86. 17. An online news organization wants to expand its reach globally by translating some of its most commonly read articles into different languages using ML. The organization's data science team is gathering all the news articles that they have published in both English and at least one other language. They want to use this data to create one machine learning model for each non-English language that the organization is targeting. The models should only require minimum management. What approach should the team use to building these models?

Use Amazon SageMaker Object2Vec to create a vector. Use the Amazon SageMaker built-in Sequence to Sequence model (Seq2Seq)

The team should use Amazon SageMaker Object2Vec to create a vector and then use the Amazon SageMaker built-in Sequence to Sequence model (Seq2Seq). This approach allows them to create a vector representation of the news articles using Object2Vec, which can capture the semantic meaning of the text. Then, they can use the Seq2Seq model to train a machine learning model for each non-English language. The built-in Seq2Seq model in Amazon SageMaker simplifies the management of the models as it provides pre-built functionality for sequence-to-sequence tasks, reducing the need for additional management efforts.

Explanation

The team should use Amazon SageMaker Object2Vec to create a vector and then use the Amazon SageMaker built-in Sequence to Sequence model (Seq2Seq). This approach allows them to create a vector representation of the news articles using Object2Vec, which can capture the semantic meaning of the text. Then, they can use the Seq2Seq model to train a machine learning model for each non-English language. The built-in Seq2Seq model in Amazon SageMaker simplifies the management of the models as it provides pre-built functionality for sequence-to-sequence tasks, reducing the need for additional management efforts.

87. A transportation company currently uses Amazon EMR with Apache Spark for some of its data transformation workloads. It transforms columns of geographical data (like latitudes and longitudes) and adds columns to segment the data into different clusters per city to attain additional features for the k-nearest neighbors algorithm being used. The company wants less operational overhead for their transformation pipeline. They want a new solution that does not make significant changes to the current pipeline and only requires minimal management. What AWS services should the company use to build this new pipeline?

Use Amazon EMR to transform files. Use Amazon S3 as the destination

Use Lambda to transform files. Use Amazon EMR HDFS as the destination.

Use AWS Glue to transform files. Use Amazon S3 as the destination

Use AWS Glue to transform files. Use Amazon EMR HDFS as the destination.

The company should use AWS Glue to transform files because it is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue supports Apache Spark and can be used to transform the geographical data columns and add additional features for the k-nearest neighbors algorithm. Using Amazon S3 as the destination for the transformed data is a good choice because it is a scalable and cost-effective storage solution. This solution meets the company's requirements of minimal changes to the current pipeline and minimal management.

Explanation

The company should use AWS Glue to transform files because it is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue supports Apache Spark and can be used to transform the geographical data columns and add additional features for the k-nearest neighbors algorithm. Using Amazon S3 as the destination for the transformed data is a good choice because it is a scalable and cost-effective storage solution. This solution meets the company's requirements of minimal changes to the current pipeline and minimal management.

88. A Machine Learning Specialist works for a credit card processing company and needs to predict which transactions may be fraudulent in near-real time. Specifically, the Specialist must train a model that returns the probability that a given transaction may be fraudulent. How should the Specialist frame this business problem?

Streaming classification

Binary classification

Multi-category classification

Regression classification

The Specialist should frame this business problem as binary classification because the goal is to predict whether a transaction is fraudulent or not. Binary classification is used when there are only two possible outcomes, in this case, fraudulent or not fraudulent.

Explanation

The Specialist should frame this business problem as binary classification because the goal is to predict whether a transaction is fraudulent or not. Binary classification is used when there are only two possible outcomes, in this case, fraudulent or not fraudulent.

89. Your marketing department wishes to understand how their products are being represented in the various social media services in which they have active content streams. They would like insights into the reception of a current product line so they can plan for the roll out of a new product in the line in the new future. You have been tasked with creating a service that organizes the social media content by sentiment across all languages so that your marketing department can determine how best to introduce the new product. How would you quickly and most efficiently design and build a service for your marketing team that gives insight into the social media sentiment?

Use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight to build a natural-language-processing (NLP)-powered social media dashboard

The best way to quickly and efficiently design and build a service that gives insight into social media sentiment for the marketing team is to use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight. These services provide a comprehensive solution for natural language processing (NLP) and data analysis. Amazon Translate can be used to translate social media content into different languages, Amazon Comprehend can be used for sentiment analysis, Amazon Kinesis can be used for real-time data streaming, Amazon Athena can be used for querying and analyzing the data, and Amazon QuickSight can be used for visualizing the insights on a dashboard. This combination of services enables the marketing team to understand the sentiment of their products across different languages and make informed decisions for the roll out of a new product.

Explanation

The best way to quickly and efficiently design and build a service that gives insight into social media sentiment for the marketing team is to use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight. These services provide a comprehensive solution for natural language processing (NLP) and data analysis. Amazon Translate can be used to translate social media content into different languages, Amazon Comprehend can be used for sentiment analysis, Amazon Kinesis can be used for real-time data streaming, Amazon Athena can be used for querying and analyzing the data, and Amazon QuickSight can be used for visualizing the insights on a dashboard. This combination of services enables the marketing team to understand the sentiment of their products across different languages and make informed decisions for the roll out of a new product.

90. A social networking organization wants to analyze all the comments and likes from its users to flag offensive language on the site. The organization's data science team wants to use a Long Short-term Memory (LSTM) architecture to classify the raw sentences from the comments into one of two categories: offensive and nonoffensive. What should the team do to prepare the data for the LSTM?

Convert the individual sentences into sequences of words. Use those as the input.

Vectorize the sentences. Transform them into numerical sequences. Use the sentences as the input.

Vectorize the sentences. Transform them into numerical sequences with a padding. Use the sentences as the input.

To prepare the data for the LSTM, the team should vectorize the sentences by transforming them into numerical sequences. Additionally, padding should be applied to ensure that all sequences have the same length. This is important because LSTMs require fixed-length input. By vectorizing and padding the sentences, the data can be effectively processed by the LSTM model for classification.

Explanation

To prepare the data for the LSTM, the team should vectorize the sentences by transforming them into numerical sequences. Additionally, padding should be applied to ensure that all sequences have the same length. This is important because LSTMs require fixed-length input. By vectorizing and padding the sentences, the data can be effectively processed by the LSTM model for classification.

91. An analytics company wants to use a fully managed service that automatically scales to handle the transfer of its Apache web logs, syslogs, text and videos on their webserver to Amazon S3 with minimum transformation. What service can be used for this process?

Kinesis Data Streams

Kinesis Firehose

Kinesis Data Analytics

Amazon Kinesis Video Streams

Kinesis Firehose is the correct answer for this question. Kinesis Firehose is a fully managed service that automatically scales to handle the transfer of data, such as Apache web logs, syslogs, text, and videos, from various sources to Amazon S3. It requires minimum transformation, making it suitable for the given scenario where the analytics company wants to transfer their web logs, syslogs, text, and videos to Amazon S3 without extensive data manipulation.

Explanation

Kinesis Firehose is the correct answer for this question. Kinesis Firehose is a fully managed service that automatically scales to handle the transfer of data, such as Apache web logs, syslogs, text, and videos, from various sources to Amazon S3. It requires minimum transformation, making it suitable for the given scenario where the analytics company wants to transfer their web logs, syslogs, text, and videos to Amazon S3 without extensive data manipulation.

92. A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:

Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.
Support event-driven ETL pipelines.
Provide a quick and easy way to understand metadata.

Which approach meets these requirements?

This approach meets the requirements because it utilizes AWS Glue, which is a fully managed extract, transform, and load (ETL) service. The AWS Glue crawler is used to automatically discover and catalog metadata about the data in the S3 data lake. An AWS Lambda function is used to trigger the AWS Glue ETL job, which allows for event-driven ETL pipelines. The AWS Glue Data Catalog is used to search and discover metadata, providing a quick and easy way to understand the data lake's metadata. This approach also aligns with the requirement of supporting querying old and new data through Amazon Athena and Amazon Redshift Spectrum.

Explanation

This approach meets the requirements because it utilizes AWS Glue, which is a fully managed extract, transform, and load (ETL) service. The AWS Glue crawler is used to automatically discover and catalog metadata about the data in the S3 data lake. An AWS Lambda function is used to trigger the AWS Glue ETL job, which allows for event-driven ETL pipelines. The AWS Glue Data Catalog is used to search and discover metadata, providing a quick and easy way to understand the data lake's metadata. This approach also aligns with the requirement of supporting querying old and new data through Amazon Athena and Amazon Redshift Spectrum.

93. You would like to draw inferences from a model deployed to Amazon SageMaker Hosting Services. If you are using AWS SDK for Python (Boto 3) library, which method would you call for inferences?

Create_endpoint_config

Predict

Create_endpoint

Invoke_endpoint

When using the AWS SDK for Python (Boto 3) library, the method to call for inferences from a model deployed to Amazon SageMaker Hosting Services is "invoke_endpoint". This method is used to send a request to the endpoint and receive the inference response. It allows you to pass input data to the model and obtain the corresponding predictions or inferences.

Explanation

When using the AWS SDK for Python (Boto 3) library, the method to call for inferences from a model deployed to Amazon SageMaker Hosting Services is "invoke_endpoint". This method is used to send a request to the endpoint and receive the inference response. It allows you to pass input data to the model and obtain the corresponding predictions or inferences.

94. An insurance company is developing a new device for vehicles that uses a camera to observe drivers' behavior and alert them when they appear distracted. The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models. During the model evaluation the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images. Which of the following should be used to resolve this issue? (Select TWO)

Add vanishing gradient to the model.

Perform data augmentation on the training data.

Make the neural network architecture complex.

Use gradient checking in the model.

Add L2 regularization to the model.

Performing data augmentation on the training data can help resolve the issue of the model not accurately inferring on unseen test images. Data augmentation involves creating new training data by applying various transformations to the existing training images, such as rotating, scaling, or flipping them. This can help the model generalize better and improve its performance on unseen data.

Adding L2 regularization to the model can also be helpful in resolving this issue. L2 regularization adds a penalty term to the loss function during training, which discourages the model from overfitting the training data. By reducing overfitting, the model is more likely to perform well on unseen test images.

Explanation

Performing data augmentation on the training data can help resolve the issue of the model not accurately inferring on unseen test images. Data augmentation involves creating new training data by applying various transformations to the existing training images, such as rotating, scaling, or flipping them. This can help the model generalize better and improve its performance on unseen data.

Adding L2 regularization to the model can also be helpful in resolving this issue. L2 regularization adds a penalty term to the loss function during training, which discourages the model from overfitting the training data. By reducing overfitting, the model is more likely to perform well on unseen test images.

Submit

95. You work for a retail firm that wishes to conduct a direct mail campaign to attract new customers. Your marketing manager wishes to get answers to questions that can be put into discrete categories, such as "using historical customer email campaign responses, should this customer receive an email from our current campaign?" You decide to use the SageMaker Linear Learner algorithm to build your model. Which hyperparameter setting would you use to get the algorithm to produce discrete results?

Set the objective hyperparameter to reg:logistic.

Set the predictor_type hyperparameter to regressor.

Set the objective hyperparameter to reg:linear.

Set the predictor_type hyperparameter to binary_classifier.

The correct answer is to set the predictor_type hyperparameter to binary_classifier. This is because a binary classifier is designed to produce discrete results by classifying data into two categories, such as yes or no, true or false. In this case, the goal is to determine whether a customer should receive an email from the current campaign or not, which is a binary classification problem. By setting the predictor_type hyperparameter to binary_classifier, the SageMaker Linear Learner algorithm will be trained to produce discrete results for this specific task.

Explanation

The correct answer is to set the predictor_type hyperparameter to binary_classifier. This is because a binary classifier is designed to produce discrete results by classifying data into two categories, such as yes or no, true or false. In this case, the goal is to determine whether a customer should receive an email from the current campaign or not, which is a binary classification problem. By setting the predictor_type hyperparameter to binary_classifier, the SageMaker Linear Learner algorithm will be trained to produce discrete results for this specific task.

96. For the given confusion matrix, what is the recall and precision of the model?

Recall = 0.92 Precision = 0.84

Recall = 0.84 Precision = 0.8

Recall = 0.92 Precision = 0.8

Recall = 0.8 Precision = 0.92

The recall of a model is the ratio of the true positive predictions to the sum of true positive and false negative predictions. In this case, the recall is 0.92, which means that the model has a high ability to correctly identify positive instances.

The precision of a model is the ratio of the true positive predictions to the sum of true positive and false positive predictions. In this case, the precision is 0.8, which means that the model has a moderate ability to avoid false positive predictions.

Therefore, the correct answer is Recall = 0.92 Precision = 0.8.

Explanation

The recall of a model is the ratio of the true positive predictions to the sum of true positive and false negative predictions. In this case, the recall is 0.92, which means that the model has a high ability to correctly identify positive instances.

The precision of a model is the ratio of the true positive predictions to the sum of true positive and false positive predictions. In this case, the precision is 0.8, which means that the model has a moderate ability to avoid false positive predictions.

Therefore, the correct answer is Recall = 0.92 Precision = 0.8.

97. A Machine Learning Specialist is using Amazon SageMaker to host a model for a highly available customer- facing application. The Specialist has trained a new version of the model, validated it with historical data, and now wants to deploy it to production. To limit any risk of a negative customer experience, the Specialist wants to be able to monitor the model and roll it back, if needed. What is the SIMPLEST approach with the LEAST risk to deploy the model and roll it back, if needed?

The simplest approach with the least risk to deploy the model and roll it back, if needed, is to update the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant. This approach allows for gradual testing and monitoring of the new model's performance with a small percentage of traffic, minimizing the impact on customers. If the model does not perform as expected, traffic can be reverted to the last version by resetting the weights.

Explanation

The simplest approach with the least risk to deploy the model and roll it back, if needed, is to update the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant. This approach allows for gradual testing and monitoring of the new model's performance with a small percentage of traffic, minimizing the impact on customers. If the model does not perform as expected, traffic can be reverted to the last version by resetting the weights.

98. An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget. What should the Specialist do to meet these requirements?

Create one-hot word encoding vectors.

Produce a set of synonyms for every word using Amazon Mechanical Turk.

Create word embedding factors that store edit distance with every other word.

Download word embedding’s pre-trained on a large corpus.

The Specialist should download word embeddings pre-trained on a large corpus to meet the requirements. Word embeddings are dense vector representations of words that capture semantic relationships between words. By using pre-trained word embeddings, which have been trained on a large corpus of text data, the Specialist can leverage the learned semantic relationships to provide word features for the downstream nearest neighbor model. This will enable the widget to display words used in similar contexts based on the semantic similarities captured by the word embeddings.

Explanation

The Specialist should download word embeddings pre-trained on a large corpus to meet the requirements. Word embeddings are dense vector representations of words that capture semantic relationships between words. By using pre-trained word embeddings, which have been trained on a large corpus of text data, the Specialist can leverage the learned semantic relationships to provide word features for the downstream nearest neighbor model. This will enable the widget to display words used in similar contexts based on the semantic similarities captured by the word embeddings.

99. A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: 1. Please call the number below. 2. Please do not call us. What are the dimensions of the tf–idf matrix?

(8, 10)

(2, 8)

(2, 10)

(2, 16)

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The
phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is
“Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the
number,” “number below,” “Please do,” “do not,” “not call,” and “call us.”

Explanation

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The
phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is
“Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the
number,” “number below,” “Please do,” “do not,” “not call,” and “call us.”

100. A data engineer needs to create a cost-effective data pipeline solution that ingests unstructured data from various sources and stores it for downstream analytics applications and ML. The solution should include a data store where the processed data is highly available for at least one year so that data analysts and data scientists can run analytics and ML workloads on the most recent data. For compliance reasons, the solution should include both processed and raw data. The raw data does not need to be accessed regularly, but when needed, should be accessible within 24 hours. What solution should the data engineer deploy?

Use Amazon S3 Standard for all raw data. Use Amazon S3 Glacier Deep Archive for all processed data.

The data engineer should deploy Amazon S3 Standard for the processed data that is within one year of processing. After one year, they should use Amazon S3 Glacier for the processed data. Additionally, they should use Amazon S3 Glacier Deep Archive for all raw data. This solution ensures that the processed data is highly available for at least one year, allowing data analysts and data scientists to run analytics and ML workloads on the most recent data. The use of Amazon S3 Glacier Deep Archive for raw data ensures compliance and accessibility within 24 hours when needed.

Explanation

The data engineer should deploy Amazon S3 Standard for the processed data that is within one year of processing. After one year, they should use Amazon S3 Glacier for the processed data. Additionally, they should use Amazon S3 Glacier Deep Archive for all raw data. This solution ensures that the processed data is highly available for at least one year, allowing data analysts and data scientists to run analytics and ML workloads on the most recent data. The use of Amazon S3 Glacier Deep Archive for raw data ensures compliance and accessibility within 24 hours when needed.

101. A Dats Scientist at a retail company is using Amazon SageMaker to classify social media posts that mention the company into one of two categories: Posts that require a response from the company, and posts that do not. The Data Scientist is using a training dataset of 10,000 posts, which contains the timestamp, author, and full text of each post. However, the Data Scientist is missing the target labels that are required for training. Which approach can the Data Scientist take to create valid target label data? (Select TWO.)

Ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label

Use the sentiment analysis natural language processing library to determine whether a post requires a response

Use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts

Use the a priori probability distribution of the two classes. Then, use Monte-Carlo simulation to generate the labels

Use K-Means to cluster posts into various groups, and pick the most frequent word in each group as its label

The Data Scientist can ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label. This approach involves manual review and labeling of each post by the team, ensuring accurate target labels for training. Additionally, the Data Scientist can use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts. This crowdsourcing approach allows for a larger pool of workers to label the posts, increasing efficiency and scalability in generating valid target label data.

Explanation

The Data Scientist can ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label. This approach involves manual review and labeling of each post by the team, ensuring accurate target labels for training. Additionally, the Data Scientist can use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts. This crowdsourcing approach allows for a larger pool of workers to label the posts, increasing efficiency and scalability in generating valid target label data.

Submit

102. A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset. Which tool should be used to improve the validation accuracy?

Amazon Comprehend syntax analysts and entity detection

Amazon SageMaker BlazingText allow mode

Natural Language Toolkit (NLTK) stemming and stop word removal

Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers

The Data Scientist believes that the poor validation accuracy may be due to a rich vocabulary and low average frequency of words in the dataset. In order to improve the accuracy, they should use Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers. TF-IDF is a technique that assigns weights to words based on their frequency in a document and their rarity in the entire dataset. By using TF-IDF vectorizers, the Data Scientist can give more importance to the words that are both frequent in a document and rare in the dataset, which can help improve the accuracy of the sentiment analysis application.

Explanation

The Data Scientist believes that the poor validation accuracy may be due to a rich vocabulary and low average frequency of words in the dataset. In order to improve the accuracy, they should use Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers. TF-IDF is a technique that assigns weights to words based on their frequency in a document and their rarity in the entire dataset. By using TF-IDF vectorizers, the Data Scientist can give more importance to the words that are both frequent in a document and rare in the dataset, which can help improve the accuracy of the sentiment analysis application.

103. A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company's Amazon S3-based data lake. The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of:

Real-time analytics
Interactive analytics of historical data
Clickstream analytics
Product recommendations

Which services should the Specialist use?

The Specialist should use AWS Glue as the data catalog to manage the metadata of the data lake. They should use Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights, allowing them to process and analyze streaming data in real-time. They should also use Amazon Kinesis Data Firehose to deliver the clickstream data to Amazon ES for clickstream analytics. Lastly, they should use Amazon EMR to generate personalized product recommendations by processing and analyzing the data in the data lake.

Explanation

The Specialist should use AWS Glue as the data catalog to manage the metadata of the data lake. They should use Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights, allowing them to process and analyze streaming data in real-time. They should also use Amazon Kinesis Data Firehose to deliver the clickstream data to Amazon ES for clickstream analytics. Lastly, they should use Amazon EMR to generate personalized product recommendations by processing and analyzing the data in the data lake.

104. A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker. The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm. The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model. What should the Specialist do to address the performance issues with the current solution?

Use the SageMaker batch transform feature.

Compress the training data into Apache Parquet format.

Ensure that the input mode for the training job is set to Pipe.

Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.

Setting the input mode for the training job to Pipe allows the data to be streamed directly from Amazon S3 to the training instances, which can significantly reduce the time it takes for the training job to launch. This can help address the observed issue of the unacceptable length of time before the training job launches.

Explanation

Setting the input mode for the training job to Pipe allows the data to be streamed directly from Amazon S3 to the training instances, which can significantly reduce the time it takes for the training job to launch. This can help address the observed issue of the unacceptable length of time before the training job launches.

105. An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code. How should the company set up this new pipeline?

Use Amazon in-built algorithms to run their training and inference jobs.

The company should use the Build Your Own Container (BYOC) Amazon SageMaker option. This option allows the company to create a new Docker container with the existing code and register it in Amazon Elastic Container Registry (ECR). By using this option, the company can port their training and inference code written in R to Amazon SageMaker with minimal changes. They can then run the training and inference jobs using this container, ensuring a seamless transition to the new ML pipeline.

Explanation

The company should use the Build Your Own Container (BYOC) Amazon SageMaker option. This option allows the company to create a new Docker container with the existing code and register it in Amazon Elastic Container Registry (ECR). By using this option, the company can port their training and inference code written in R to Amazon SageMaker with minimal changes. They can then run the training and inference jobs using this container, ensuring a seamless transition to the new ML pipeline.

106. A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements However company acronyms are being mispronounced in the current documents. How should a Machine Learning Specialist address this issue for future documents?

Convert current documents to SSML with pronunciation tags.

Create an appropriate pronunciation lexicon.

Output speech marks to guide in pronunciation.

Use Amazon Lex to preprocess the text files for pronunciation.

To address the issue of mispronounced company acronyms in future documents, a Machine Learning Specialist should create an appropriate pronunciation lexicon. This lexicon will contain the correct pronunciation of the acronyms, allowing Amazon Polly to accurately pronounce them in the speech output. By creating a pronunciation lexicon, the specialist can ensure that the acronyms are pronounced correctly in all future automated company announcements.

Explanation

To address the issue of mispronounced company acronyms in future documents, a Machine Learning Specialist should create an appropriate pronunciation lexicon. This lexicon will contain the correct pronunciation of the acronyms, allowing Amazon Polly to accurately pronounce them in the speech output. By creating a pronunciation lexicon, the specialist can ensure that the acronyms are pronounced correctly in all future automated company announcements.

107. A financial organization uses multiple ML models to detect irregular patterns in its data to combat fraudulent activity such as money laundering. They use a TensorFlow-based Docker container on GPU-enabled Amazon EC2 instances to concurrently train the multiple models for this workload. However, they want to automate the batch data preprocessing and ML training aspects of this pipeline, scheduling them to take place automatically every 24 hours. What AWS service can they use to do this?

AWS Glue

AWS Batch

Amazon EMR

Kinesis Data Analytics

The financial organization can use AWS Batch to automate the batch data preprocessing and ML training aspects of their pipeline. AWS Batch is a fully managed service that allows users to run batch computing workloads on the AWS Cloud. It takes care of the underlying infrastructure and automatically scales resources to meet the demands of the workload. With AWS Batch, the organization can schedule their data preprocessing and ML training tasks to run automatically every 24 hours, ensuring a streamlined and automated process.

Explanation

The financial organization can use AWS Batch to automate the batch data preprocessing and ML training aspects of their pipeline. AWS Batch is a fully managed service that allows users to run batch computing workloads on the AWS Cloud. It takes care of the underlying infrastructure and automatically scales resources to meet the demands of the workload. With AWS Batch, the organization can schedule their data preprocessing and ML training tasks to run automatically every 24 hours, ensuring a streamlined and automated process.

108. A bank's Machine Learning team is developing an approach for credit card fraud detection. The company has a large dataset of historical data labeled as fraudulent. The goal is to build a model to take the information from new transactions and predict whether each transaction is fraudulent or not. Which built-in Amazon SageMaker machine learning algorithm should be used for modeling this problem?

Seq2seq

XGBoost

K-means

Random Cut Forest (RCF)

XGBoost is a popular machine learning algorithm that is commonly used for classification problems, such as fraud detection. It is known for its ability to handle large datasets and its effectiveness in capturing complex relationships between variables. Therefore, it would be a suitable choice for the bank's Machine Learning team to use XGBoost for modeling the credit card fraud detection problem.

Explanation

XGBoost is a popular machine learning algorithm that is commonly used for classification problems, such as fraud detection. It is known for its ability to handle large datasets and its effectiveness in capturing complex relationships between variables. Therefore, it would be a suitable choice for the bank's Machine Learning team to use XGBoost for modeling the credit card fraud detection problem.

109. A Machine Learning Engineer wants to use Amazon SageMaker and the built-in XGBoost algorithm for model training. The training data is currently stored in CSV format, with the first 10 columns representing features and the 11th column representing the target label. What should the ML Engineer do to prepare the data for use in an Amazon SageMaker training job?

The data should be split into training, validation, and test sets. The datasets should then be uploaded to Amazon S3.

To prepare the data for use in an Amazon SageMaker training job, the ML Engineer should first change the target label to the first column. Then, the data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3.

Explanation

To prepare the data for use in an Amazon SageMaker training job, the ML Engineer should first change the target label to the first column. Then, the data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3.

110. A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?

Increase the learning rate. Keep the batch size the same.

Reduce the batch size. Decrease the learning rate

Keep the batch size the same. Decrease the learning rate

Do not change the learning rate. Increase the batch size.

It is most likely that the loss function is very curvy and has multiple local minima where the training is
getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima
saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum

Explanation

It is most likely that the loss function is very curvy and has multiple local minima where the training is
getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima
saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum

111. A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet How should the company set up the job?

Launch the notebook instances in a public subnet and access the data through the public S3 endpoint

Launch the notebook instances in a private subnet and access the data through a NAT gateway

Launch the notebook instances in a public subnet and access the data through a NAT gateway

Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.

The company should launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint. This setup ensures that the data is not transmitted across the internet, as required by the compliance policy. By using a private subnet, the instances are not accessible from the public internet. The S3 VPC endpoint allows the instances to securely access the S3 bucket within the VPC, without the need for internet connectivity. This ensures that the data remains within the company's network and complies with the compliance policy.

Explanation

The company should launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint. This setup ensures that the data is not transmitted across the internet, as required by the compliance policy. By using a private subnet, the instances are not accessible from the public internet. The S3 VPC endpoint allows the instances to securely access the S3 bucket within the VPC, without the need for internet connectivity. This ensures that the data remains within the company's network and complies with the compliance policy.

112. A Data Scientist wants to tune the hyperparameters of a machine learning model to improve the model's F1score. What technique can be used to achieve this desired outcome on Amazon SageMaker? (Select TWO)

Grid Search

Random Search

Breadth First Search

Bayesian optimization

Depth first search

Random Search and Bayesian optimization are two techniques that can be used to tune the hyperparameters of a machine learning model on Amazon SageMaker to improve the model's F1 score. Random Search involves randomly selecting combinations of hyperparameters from a predefined search space and evaluating their performance. Bayesian optimization, on the other hand, uses a probabilistic model to find the optimal set of hyperparameters by iteratively exploring the search space based on previous evaluations. Both techniques can help identify the best hyperparameter values that maximize the F1 score.

Explanation

Random Search and Bayesian optimization are two techniques that can be used to tune the hyperparameters of a machine learning model on Amazon SageMaker to improve the model's F1 score. Random Search involves randomly selecting combinations of hyperparameters from a predefined search space and evaluating their performance. Bayesian optimization, on the other hand, uses a probabilistic model to find the optimal set of hyperparameters by iteratively exploring the search space based on previous evaluations. Both techniques can help identify the best hyperparameter values that maximize the F1 score.

Submit

113. A Data Scientist wants to use the Amazon SageMaker hyperparameter tuning job to automatically tune a random forest model. What API does the Amazon SageMaker SDK use to create and interact with the Amazon SageMaker hyperparameter tuning jobs?

YperparameterTunerJob()

HyperparameterTuner()

HyperparameterTuningJobs()

Hyperparameter()

The Amazon SageMaker SDK uses the HyperparameterTuner() API to create and interact with the Amazon SageMaker hyperparameter tuning jobs. This API allows the data scientist to automate the tuning process for their random forest model, optimizing the hyperparameters to improve the model's performance.

Explanation

The Amazon SageMaker SDK uses the HyperparameterTuner() API to create and interact with the Amazon SageMaker hyperparameter tuning jobs. This API allows the data scientist to automate the tuning process for their random forest model, optimizing the hyperparameters to improve the model's performance.

114. A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance. Which approach allows the Specialist to use all the data to train the model?

To avoid loading the entire large dataset onto the limited storage of the SageMaker notebook instance, the Machine Learning Specialist should load a smaller subset of the data into the notebook and train locally. This allows them to confirm that the training code is executing correctly and the model parameters are reasonable. Once this is verified, they can initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode. This approach allows the Specialist to use all the data for training without exceeding the storage limitations of the notebook instance.

Explanation

To avoid loading the entire large dataset onto the limited storage of the SageMaker notebook instance, the Machine Learning Specialist should load a smaller subset of the data into the notebook and train locally. This allows them to confirm that the training code is executing correctly and the model parameters are reasonable. Once this is verified, they can initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode. This approach allows the Specialist to use all the data for training without exceeding the storage limitations of the notebook instance.

115. A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes;

Start the workflow as soon as data is uploaded to Amazon S3
When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3
Store the results of joining datasets in Amazon S3
If one of the jobs fails, send a notification to the Administrator.

Which configuration will meet these requirements?

The correct answer is to use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. This configuration meets the requirements as it ensures that the workflow starts as soon as data is uploaded to S3. AWS Glue is used to join the datasets, and the results are stored in S3. An Amazon CloudWatch alarm is set up to send an SNS notification to the Administrator in case of a failure. This configuration provides an efficient and scalable solution for the daily ETL workflow.

Explanation

The correct answer is to use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. This configuration meets the requirements as it ensures that the workflow starts as soon as data is uploaded to S3. AWS Glue is used to join the datasets, and the results are stored in S3. An Amazon CloudWatch alarm is set up to send an SNS notification to the Administrator in case of a failure. This configuration provides an efficient and scalable solution for the daily ETL workflow.

116. A company is using its genomic data to classify how different human DNA affects cell growth, so that they can predict a person's chances of getting cancer. Before creating and preparing the training and validation datasets for the model, the company wants to reduce the high dimensionality of the data. What technique should the company use to achieve this goal? (Select TWO.)

Use seaborn distribution plot (distplot) to visualize the correlated data. Remove the unrelated features

Use T-SNE to reduce the dimensionality of the data. Visualize the data using matplotlib

Use Principle Component Analysis (PCA) to reduce the dimensionality of the data. Visualize the data using matplotlib.

Calculate the eigenvectors. Use a scatter matrix to choose the best features

Use L2 regularization to reduce the features used in the data. Visualize the data using matplotlib

The company should use T-SNE and PCA to reduce the dimensionality of the data. T-SNE is a technique used for visualizing high-dimensional data by reducing it to a lower-dimensional space while preserving the structure of the data. It is particularly useful for visualizing clusters or patterns in the data. PCA, on the other hand, is a technique used to reduce the dimensionality of data by finding a new set of variables (principal components) that capture the most important information in the original data. By visualizing the data using matplotlib, the company can gain insights into the distribution and relationships between variables.

Explanation

The company should use T-SNE and PCA to reduce the dimensionality of the data. T-SNE is a technique used for visualizing high-dimensional data by reducing it to a lower-dimensional space while preserving the structure of the data. It is particularly useful for visualizing clusters or patterns in the data. PCA, on the other hand, is a technique used to reduce the dimensionality of data by finding a new set of variables (principal components) that capture the most important information in the original data. By visualizing the data using matplotlib, the company can gain insights into the distribution and relationships between variables.

Submit

117. A Machine Learning Specialist needs to move and transform data in preparation for training. Some of the data needs to be processed in near-real time and other data can be moved hourly. There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data. Which of the following services can feed data to the MapReduce jobs? (Select TWO)

AWSDMS

Amazon Kinesis

AWS Data Pipeline

Amazon Athena

Amazon ES

Amazon Kinesis and AWS Data Pipeline are both services that can feed data to MapReduce jobs. Amazon Kinesis is a real-time streaming service that can continuously collect and process large streams of data in near-real time. AWS Data Pipeline is a web service for orchestrating and automating the movement and transformation of data between different AWS services and on-premises data sources. Both of these services can be used to feed data to the MapReduce jobs for processing and analysis.

Explanation

Amazon Kinesis and AWS Data Pipeline are both services that can feed data to MapReduce jobs. Amazon Kinesis is a real-time streaming service that can continuously collect and process large streams of data in near-real time. AWS Data Pipeline is a web service for orchestrating and automating the movement and transformation of data between different AWS services and on-premises data sources. Both of these services can be used to feed data to the MapReduce jobs for processing and analysis.

Submit

118. A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 1) Must have a recall rate of at least 80% 2) Must have a false positive rate of 10% or less 3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?

TN = 91, FP = 9FN = 22, TP = 78

TN = 99, FP = 1FN = 21, TP = 79

TN = 96, FP = 4FN = 10, TP = 90

TN = 98, FP = 2FN = 18, TP = 82

The following calculations are required:

TP = True Positive
FP = False Positive
FN = False Negative
TN = True Negative
FN = False Negative

Recall = TP / (TP + FN)

False Positive Rate (FPR) = FP / (FP + TN)

Cost = 5 * FP + FN

Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective.

Explanation

The following calculations are required:

TP = True Positive
FP = False Positive
FN = False Negative
TN = True Negative
FN = False Negative

Recall = TP / (TP + FN)

False Positive Rate (FPR) = FP / (FP + TN)

Cost = 5 * FP + FN

Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective.

119. An ML Engineer at a real estate startup wants to use a new quantitative feature for an existing ML model that predicts housing prices. Before adding the feature to the cleaned dataset, the Engineer wants to visualize the feature in order to check for outliers and overall distribution and skewness of the feature. What visualization technique should the ML Engineer use? (Select TWO.)

Box Plot

Histogram

Scatterplot

Heatmap

T-SNE

The ML Engineer should use a box plot to visualize the feature in order to check for outliers. A box plot provides information about the minimum, first quartile, median, third quartile, and maximum values of a dataset, allowing the Engineer to identify any extreme values that may be considered outliers. Additionally, the Engineer should use a histogram to visualize the overall distribution and skewness of the feature. A histogram provides a visual representation of the frequency distribution of the feature, allowing the Engineer to observe the shape of the distribution and identify any skewness.

Explanation

The ML Engineer should use a box plot to visualize the feature in order to check for outliers. A box plot provides information about the minimum, first quartile, median, third quartile, and maximum values of a dataset, allowing the Engineer to identify any extreme values that may be considered outliers. Additionally, the Engineer should use a histogram to visualize the overall distribution and skewness of the feature. A histogram provides a visual representation of the frequency distribution of the feature, allowing the Engineer to observe the shape of the distribution and identify any skewness.

Submit

120. A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data. Which of the following methods should the Specialist consider using to correct this? (Select THREE)

Decrease regularization

Increase regularization

Increase dropout

Decrease dropout

Increase feature combinations

Decrease feature combinations

The Specialist should consider increasing regularization to prevent overfitting and improve generalization on the test data. Increasing dropout can also help reduce overfitting by randomly dropping out neurons during training. Decreasing feature combinations can help simplify the model and reduce the risk of overfitting.

Explanation

The Specialist should consider increasing regularization to prevent overfitting and improve generalization on the test data. Increasing dropout can also help reduce overfitting by randomly dropping out neurons during training. Decreasing feature combinations can help simplify the model and reduce the risk of overfitting.

Submit

121. A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance's Amazon EBS volume, and needs to take a snapshot of that EBS volume. However the ML Specialist cannot find the Amazon SageMaker notebook instance's EBS volume or Amazon EC2 instance within the VPC. Why is the ML Specialist not seeing the instance visible in the VPC?

Amazon SageMaker notebook instances are based on the EC2 instances within the customer account but they run outside of VPCs.

Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.

Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.

Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service accounts.

not-available-via-ai

Explanation

not-available-via-ai

122. A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?

Drop the test samples with missing full review text fields, and then run through the test set.

Use an algorithm that handles missing data better than decision trees.

Generate synthetic data to fill in the fields that are missing data, and then run through the test set.

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is
a valid stand-in for the missing full review text field.

Explanation

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is
a valid stand-in for the missing full review text field.

123. A financial planning company is using the Amazon SageMaker endpoint with an Auto Scaling policy to serve its forecasting model to the company's customers to help them plan for retirement. The team wants to update the endpoint with its latest forecasting model, which has been trained using Amazon SageMaker training jobs. The team wants to do this without any downtime and with minimal change to the code. What steps should the team take to update this endpoint?

Use a new endpoint configuration with the latest model Amazon S3 path in the UpdateEndpoint API.

Create a new endpoint using a new configuration with the latest model. Then, register the endpoint as a scalable target.

To update the endpoint without any downtime and minimal code changes, the team should first de-register the endpoint as a scalable target. Then, they should update the endpoint using a new endpoint configuration that includes the latest model Amazon S3 path. Finally, they should register the endpoint as a scalable target again. This process ensures that the endpoint is temporarily taken out of the scaling policy, allowing the team to update it with the latest model while minimizing disruption to the customers.

Explanation

To update the endpoint without any downtime and minimal code changes, the team should first de-register the endpoint as a scalable target. Then, they should update the endpoint using a new endpoint configuration that includes the latest model Amazon S3 path. Finally, they should register the endpoint as a scalable target again. This process ensures that the endpoint is temporarily taken out of the scaling policy, allowing the team to update it with the latest model while minimizing disruption to the customers.

124. A ride-share company wants to create intelligent conversational chatbots that will serve as first responders to customers who call to report an issue with their ride. The company wants these chatbot-customer calls to mimic natural conversations that provide personalized experiences for the customers. What combination of AWS services can the company use to create this workflow without a lot of ongoing management?

Amazon Lex to parse the utterances and intent of customer comments, Amazon Polly to reply to the customers

Amazon Polly to parse the utterances and intent of customer comments, Amazon Lex to reply to the customers

Amazon Transcribe to parse the utterances and intent of customer comments, Amazon Lex to reply to the customers

Amazon Transcribe to parse the utterances and intent of customer comments, Amazon Polly to reply to the customers

The company can use Amazon Lex to parse the utterances and intent of customer comments, and Amazon Polly to generate personalized replies to the customers. Amazon Lex is a service for building conversational interfaces using voice and text, which can understand and interpret customer inputs. Amazon Polly is a service that turns text into lifelike speech, allowing the chatbot to respond to customers in a natural and personalized manner. This combination of services allows the company to create intelligent chatbots that can provide personalized experiences without requiring a lot of ongoing management.

Explanation

The company can use Amazon Lex to parse the utterances and intent of customer comments, and Amazon Polly to generate personalized replies to the customers. Amazon Lex is a service for building conversational interfaces using voice and text, which can understand and interpret customer inputs. Amazon Polly is a service that turns text into lifelike speech, allowing the chatbot to respond to customers in a natural and personalized manner. This combination of services allows the company to create intelligent chatbots that can provide personalized experiences without requiring a lot of ongoing management.

125. What factors lead to the wide adoption of neural networks in the last decade? (Select THREE.)

Efficient algorithms

Cheaper GPUs

An orders of magnitude increase in data collected

Cheaper CPUs

Wide adoption of cloud-based services

Neural networks have been widely adopted in the last decade due to several factors. Firstly, the development of efficient algorithms has made it easier to train and optimize neural networks, leading to improved performance. Additionally, the availability of cheaper GPUs (Graphics Processing Units) has significantly accelerated the computational power required for neural network training and inference. Lastly, there has been an exponential increase in the amount of data collected in various fields, allowing neural networks to learn from larger and more diverse datasets, leading to better performance and generalization.

Explanation

Neural networks have been widely adopted in the last decade due to several factors. Firstly, the development of efficient algorithms has made it easier to train and optimize neural networks, leading to improved performance. Additionally, the availability of cheaper GPUs (Graphics Processing Units) has significantly accelerated the computational power required for neural network training and inference. Lastly, there has been an exponential increase in the amount of data collected in various fields, allowing neural networks to learn from larger and more diverse datasets, leading to better performance and generalization.

Submit

126. A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training. What should the Specialist do to optimize the data for training on SageMaker?

Use the SageMaker batch transform feature to transform the training data into a DataFrame

Use AWS Glue to compress the data into the Apache Parquet format

Transform the dataset into the Recordio protobuf format

Use the SageMaker hyperparameter optimization feature to automatically optimize the data

The Specialist should transform the dataset into the Recordio protobuf format. This format is optimized for high-performance, efficient data storage and retrieval, which can improve the speed of training on SageMaker.

Explanation

The Specialist should transform the dataset into the Recordio protobuf format. This format is optimized for high-performance, efficient data storage and retrieval, which can improve the speed of training on SageMaker.

127. A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker. When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization?

Choose the maximum number of hyperparameters supported by Amazon SageMaker to search the largest number of combinations possible.

Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.

Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible.

Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments.

Using log-scaled hyperparameters allows for a more efficient search of the hyperparameter space. Log scaling reduces the range of values that need to be searched, making the optimization process faster. By taking the logarithm of the hyperparameters, the search space is compressed, allowing the algorithm to explore a wider range of values in a shorter amount of time. This helps in finding the optimal combination of hyperparameters more quickly and improving the overall optimization process.

Explanation

Using log-scaled hyperparameters allows for a more efficient search of the hyperparameter space. Log scaling reduces the range of values that need to be searched, making the optimization process faster. By taking the logarithm of the hyperparameters, the search space is compressed, allowing the algorithm to explore a wider range of values in a shorter amount of time. This helps in finding the optimal combination of hyperparameters more quickly and improving the overall optimization process.

128. A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team. Which solution requires the LEAST coding effort?

Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team.

Generate daily precision-recall data in Amazon ES, and publish the results in a dashboard shared with the Business team.

The solution that requires the least coding effort is to run a daily Amazon EMR workflow to generate precision-recall data and save the results in Amazon S3. Then, the arrays can be visualized in Amazon QuickSight and published in a dashboard shared with the Business team. This solution minimizes coding effort by utilizing existing services and tools provided by Amazon, rather than building a custom solution from scratch.

Explanation

The solution that requires the least coding effort is to run a daily Amazon EMR workflow to generate precision-recall data and save the results in Amazon S3. Then, the arrays can be visualized in Amazon QuickSight and published in a dashboard shared with the Business team. This solution minimizes coding effort by utilizing existing services and tools provided by Amazon, rather than building a custom solution from scratch.

129. If you wanted to build your own Alexa-type device that converses with customers using speech, which Amazon services might you use?

Amazon Transcribe -> Amazon Comprehend -> Amazon Polly

Amazon Comprehend -> Amazon Lex -> Amazon Polly

Amazon Transcribe -> Amazon Lex -> Amazon Polly

Amazon Polly -> Amazon Lex -> Amazon Transcribe

To build an Alexa-type device that converses with customers using speech, the appropriate Amazon services to use would be Amazon Transcribe, Amazon Lex, and Amazon Polly. Amazon Transcribe is used to convert speech to text, Amazon Lex is used for natural language understanding and conversation management, and Amazon Polly is used for text-to-speech conversion. This combination of services allows for the device to understand and respond to customer queries using speech.

Explanation

To build an Alexa-type device that converses with customers using speech, the appropriate Amazon services to use would be Amazon Transcribe, Amazon Lex, and Amazon Polly. Amazon Transcribe is used to convert speech to text, Amazon Lex is used for natural language understanding and conversation management, and Amazon Polly is used for text-to-speech conversion. This combination of services allows for the device to understand and respond to customer queries using speech.

130. A Data Scientist at a credit card company trained a classification model to predict fraud at the time of a transaction. The Data Scientist used a confusion matrix to evaluate the performance of the model. Using the confusion matrix below, determine the percent of positive records that were classified correctly.

Choose the answer that also labels this evaluation metric correctly.

80%; Recall

52.6%; Recall

80%; Precision

52.6%; Precision

The confusion matrix is used to evaluate the performance of a classification model. It consists of four metrics: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this case, the question asks for the percent of positive records that were classified correctly, which is the definition of recall. Recall measures the ability of a model to correctly identify positive records out of all actual positive records. Therefore, the correct answer is 80%; Recall.

Explanation

The confusion matrix is used to evaluate the performance of a classification model. It consists of four metrics: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this case, the question asks for the percent of positive records that were classified correctly, which is the definition of recall. Recall measures the ability of a model to correctly identify positive records out of all actual positive records. Therefore, the correct answer is 80%; Recall.

131. A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting. Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls. What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps?

The Machine Learning team should use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. This will allow the auditors to view the log activity report and ensure there are no unauthorized API calls. Additionally, the team should add code to push a custom metric to Amazon CloudWatch to monitor the model's performance. Finally, they should create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. This solution addresses the requirements with the least amount of code and fewest steps.

Explanation

The Machine Learning team should use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. This will allow the auditors to view the log activity report and ensure there are no unauthorized API calls. Additionally, the team should add code to push a custom metric to Amazon CloudWatch to monitor the model's performance. Finally, they should create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. This solution addresses the requirements with the least amount of code and fewest steps.

132. In AWS SageMaker, which of the following statements accurately describes the role of a "SageMaker Endpoint" in the context of deploying machine learning models?

A SageMaker Endpoint is used to store training data during model training and is required for model deployment.

A SageMaker Endpoint allows for batch data processing and is used for offline predictions.

A SageMaker Endpoint is utilized for model hyperparameter tuning and does not handle actual predictions.

A SageMaker Endpoint in AWS is used for real-time inference, enabling predictions on new data by deploying an already trained machine learning model. It is not used for storing training data, batch processing, or hyperparameter tuning. Training data storage and batch processing are handled by other services, while hyperparameter tuning is managed through Hyperparameter Tuning Jobs. The Endpoint focuses on providing real-time predictions once a model is trained.

Explanation

A SageMaker Endpoint in AWS is used for real-time inference, enabling predictions on new data by deploying an already trained machine learning model. It is not used for storing training data, batch processing, or hyperparameter tuning. Training data storage and batch processing are handled by other services, while hyperparameter tuning is managed through Hyperparameter Tuning Jobs. The Endpoint focuses on providing real-time predictions once a model is trained.

133. A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily. Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?

The option to insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL will require the least amount of development effort. This option leverages the existing data ingestion pipeline and adds a simple transformation step using SQL, which can be easily implemented. It eliminates the need for additional infrastructure setup like deploying EMR clusters or spinning up EC2 instances, making it a more efficient and cost-effective solution.

Explanation

The option to insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL will require the least amount of development effort. This option leverages the existing data ingestion pipeline and adds a simple transformation step using SQL, which can be easily implemented. It eliminates the need for additional infrastructure setup like deploying EMR clusters or spinning up EC2 instances, making it a more efficient and cost-effective solution.

134. A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS. How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3?

Define security group(s) to allow all HTTP inbound/outbound traffic and assign those security group(s) to the Amazon SageMaker notebook instance.

Assign the same KMS key used to encrypt data in Amazon S3 to the Amazon SageMaker notebookinstance.

https://docs.aws.amazon.com/kms/latest/developerguide/control-access-overview.html#managing-access

Explanation

https://docs.aws.amazon.com/kms/latest/developerguide/control-access-overview.html#managing-access

135. Example Corp has an annual sale event from October to December. The company has sequential sales data from the past 15 years and wants to use Amazon ML to predict the sales for this year's upcoming event. Which method should Example Corp use to split the data into a training dataset and evaluation dataset?

Pre-split the data before uploading to Amazon S3

Have Amazon ML split the data randomly.

Have Amazon ML split the data sequentially.

Perform custom cross-validation on the data

Since Example Corp has sequential sales data from the past 15 years, it would be appropriate to use a sequential split for the training and evaluation datasets. This means that the data should be split in a way that maintains the order of the sales data. This will allow the model to learn from the past patterns and trends in the data, which is crucial for predicting future sales accurately.

Explanation

Since Example Corp has sequential sales data from the past 15 years, it would be appropriate to use a sequential split for the training and evaluation datasets. This means that the data should be split in a way that maintains the order of the sales data. This will allow the model to learn from the past patterns and trends in the data, which is crucial for predicting future sales accurately.

136. A Data Scientist is using stochastic gradient descent (SGD) as the gradient optimizer to train a machine learning model. However, the model training error is taking longer to converge to the optimal solution than desired. What optimizer can the Data Scientist use to improve training performance? (Select THREE)

Adam

Adagrad

Gradient Descent

RMSProp

Mini-batch gradient descent

Xavier

The Data Scientist can use Adam, Adagrad, and RMSProp as optimizers to improve training performance. These optimizers are known for their ability to converge faster to the optimal solution compared to stochastic gradient descent (SGD). Adam combines the advantages of both AdaGrad and RMSProp, making it a popular choice. Adagrad adapts the learning rate for each parameter individually, which can be beneficial in dealing with sparse data. RMSProp also adapts the learning rate, but it uses a moving average of squared gradients to adjust the learning rate.

Explanation

The Data Scientist can use Adam, Adagrad, and RMSProp as optimizers to improve training performance. These optimizers are known for their ability to converge faster to the optimal solution compared to stochastic gradient descent (SGD). Adam combines the advantages of both AdaGrad and RMSProp, making it a popular choice. Adagrad adapts the learning rate for each parameter individually, which can be beneficial in dealing with sparse data. RMSProp also adapts the learning rate, but it uses a moving average of squared gradients to adjust the learning rate.

Submit

137. A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing. The Data Scientist has been given the following requirements to the cloud solution:

Combine multiple data sources.
Reuse existing PySpark logic.
Run the solution on the existing schedule.
Minimize the number of servers that will need to be managed.

Which architecture should the Data Scientist use to build this solution?

The Data Scientist should use the architecture described in option 2. This option suggests writing the raw data to Amazon S3 and using AWS Glue ETL job to perform the ETL processing. By writing the ETL job in PySpark, the existing logic can be leveraged. A new AWS Glue trigger can be created to trigger the ETL job based on the existing schedule. The output target of the ETL job can be configured to write to a "processed" location in Amazon S3, which is accessible for downstream use. This architecture meets all the given requirements, including combining multiple data sources, reusing existing PySpark logic, running on the existing schedule, and minimizing the number of managed servers.

Explanation

The Data Scientist should use the architecture described in option 2. This option suggests writing the raw data to Amazon S3 and using AWS Glue ETL job to perform the ETL processing. By writing the ETL job in PySpark, the existing logic can be leveraged. A new AWS Glue trigger can be created to trigger the ETL job based on the existing schedule. The output target of the ETL job can be configured to write to a "processed" location in Amazon S3, which is accessible for downstream use. This architecture meets all the given requirements, including combining multiple data sources, reusing existing PySpark logic, running on the existing schedule, and minimizing the number of managed servers.

138. A Data Scientist working for an autonomous vehicle company is building an ML model to detect and label people and various objects (for instance, cars and traffic signs) that may be encountered on a street. The Data Scientist has a dataset made up of labeled images, which will be used to train their machine learning model. What kind of ML algorithm should be used?

Image classification

Instance segmentation

Image localization

Semantic segmentation

Instance segmentation should be used in this scenario. Instance segmentation not only classifies objects in an image but also provides a pixel-level mask for each individual object. This is important in the context of autonomous vehicles as it allows for accurate detection and labeling of people and various objects on the street. Image classification would only classify the entire image, while image localization would only provide bounding boxes around objects. Semantic segmentation would classify pixels into different categories but would not differentiate between individual objects.

Explanation

Instance segmentation should be used in this scenario. Instance segmentation not only classifies objects in an image but also provides a pixel-level mask for each individual object. This is important in the context of autonomous vehicles as it allows for accurate detection and labeling of people and various objects on the street. Image classification would only classify the entire image, while image localization would only provide bounding boxes around objects. Semantic segmentation would classify pixels into different categories but would not differentiate between individual objects.

139. A Data Scientist wants to create a linear regression model to train on a housing dataset to predict home prices. As part of that process, the Data Scientist created a correlation matrix between the dataset's features and the target variable. The correlations between the target and two of the features, feature 3 and feature 7, are 0.64 and -0.85, respectively. Which feature has a stronger correlation with the target variable?

Feature 3

Feature 7

There is not sufficient enough data to determine which variable has a stronger correlation to the target

Feature 7 and feature 3 both have weak correlations to the target

The strength of a correlation is determined by its absolute value. In this case, the correlation coefficient between feature 3 and the target variable is 0.64, and the correlation coefficient between feature 7 and the target variable is -0.85. The correlation coefficient of -0.85 is stronger than the correlation coefficient of 0.64 in terms of magnitude. Therefore, feature 7 has a stronger correlation with the target variable compared to feature 3.

Explanation

The strength of a correlation is determined by its absolute value. In this case, the correlation coefficient between feature 3 and the target variable is 0.64, and the correlation coefficient between feature 7 and the target variable is -0.85. The correlation coefficient of -0.85 is stronger than the correlation coefficient of 0.64 in terms of magnitude. Therefore, feature 7 has a stronger correlation with the target variable compared to feature 3.

140. A Machine Learning Specialist is working with multiple data sources containing billions of records that need to be joined. What feature engineering and model development approach should the Specialist take with a dataset this large?

Use an Amazon SageMaker notebook for both feature engineering and model development.

Use an Amazon SageMaker notebook for feature engineering and Amazon ML for model development.

Use Amazon EMR for feature engineering and Amazon SageMaker SDK for model development.

Use Amazon ML for both feature engineering and model development.

not-available-via-ai

Explanation

not-available-via-ai

141. A Machine Learning Engineer is creating and preparing data for a linear regression model. However, while preparing the data, the Engineer notices that about 20% of the numerical data contains missing values in the same two columns. The shape of the data is 500 rows by 4 columns, including the target column. How could the Engineer handle the missing values in the data? (Select TWO.)

Remove the rows containing the missing values

Remove the columns containing the missing values

Fill the missing values with zeros

Impute the missing values using regression

Add regularization to the model

The Engineer can handle the missing values in two ways. Firstly, they can fill the missing values with zeros, which means replacing the missing values with the value of zero. Secondly, they can impute the missing values using regression, which involves using the other available data to predict and fill in the missing values based on a regression model. These two approaches help to ensure that the missing values are accounted for and do not negatively impact the linear regression model's performance.

Explanation

The Engineer can handle the missing values in two ways. Firstly, they can fill the missing values with zeros, which means replacing the missing values with the value of zero. Secondly, they can impute the missing values using regression, which involves using the other available data to predict and fill in the missing values based on a regression model. These two approaches help to ensure that the missing values are accounted for and do not negatively impact the linear regression model's performance.

Submit

142. A company is observing low accuracy while training on the default built-in image classification algorithm in Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture instead of a ResNet architecture. Which of the following will accomplish this? (Select TWO)

Customize the built-in image classification algorithm to use Inception and use this for model training.

Create a support case with the SageMaker team to change the default image classification algorithm to Inception.

Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training.

The correct answer is to bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training, and to use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network and use this for model training. These options allow the Data Science team to use the Inception neural network architecture instead of the default ResNet architecture by customizing the image classification algorithm and loading the Inception network model for training.

Explanation

The correct answer is to bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training, and to use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network and use this for model training. These options allow the Data Science team to use the Inception neural network architecture instead of the default ResNet architecture by customizing the image classification algorithm and loading the Inception network model for training.

Submit

143. A news organization wants to extract metadata from its articles and blogs and index that metadata in Amazon Elasticsearch Service (Amazon ES) to enable faster searches. What AWS service can the organization use to achieve this goal?

Amazon Comprehend

Amazon Personalize

Amazon Textract

Amazon Rekognition Image

The organization can use Amazon Comprehend to extract metadata from its articles and blogs. Amazon Comprehend is a natural language processing (NLP) service that can analyze text and extract information such as key phrases, entities, sentiment, and language. By using Amazon Comprehend, the organization can extract the desired metadata from the articles and blogs, and then index that metadata in Amazon Elasticsearch Service for faster searches.

Explanation

The organization can use Amazon Comprehend to extract metadata from its articles and blogs. Amazon Comprehend is a natural language processing (NLP) service that can analyze text and extract information such as key phrases, entities, sentiment, and language. By using Amazon Comprehend, the organization can extract the desired metadata from the articles and blogs, and then index that metadata in Amazon Elasticsearch Service for faster searches.

144. A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data. The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards. Which solution should the Data Scientist build to satisfy the requirements?

The Data Scientist should create a schema in the AWS Glue Data Catalog to define the structure of the incoming data. Then, they should use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform it into Apache Parquet or ORC format using the AWS Glue Data Catalog. The transformed data should be delivered to Amazon S3, which provides a highly available output datastore. Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to business intelligence tools using the Athena JDBC connector. This solution satisfies the requirements of buffering and converting the data, providing a highly available datastore, and enabling SQL queries and connection to existing dashboards.

Explanation

The Data Scientist should create a schema in the AWS Glue Data Catalog to define the structure of the incoming data. Then, they should use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform it into Apache Parquet or ORC format using the AWS Glue Data Catalog. The transformed data should be delivered to Amazon S3, which provides a highly available output datastore. Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to business intelligence tools using the Athena JDBC connector. This solution satisfies the requirements of buffering and converting the data, providing a highly available datastore, and enabling SQL queries and connection to existing dashboards.

145. An e-commerce company needs a customized training model to classify images of its shirts and pants products. The company needs a proof of concept in 2 to 3 days with good accuracy. Which compute choice should the Machine Learning Specialist selects to train and achieve good accuracy on the model quickly?

. m5 4xlarge (general purpose)

R5.2xlarge (memory optimized)

P3.2xlarge (GPU accelerated computing)

P3 8xlarge (GPU accelerated computing)

The Machine Learning Specialist should select the p3.2xlarge compute choice because it is a GPU accelerated computing option. GPU acceleration can significantly speed up the training process for machine learning models, especially when working with image classification tasks. The use of GPUs allows for parallel processing and optimization of computations, leading to faster training times and ultimately achieving good accuracy on the model quickly.

Explanation

The Machine Learning Specialist should select the p3.2xlarge compute choice because it is a GPU accelerated computing option. GPU acceleration can significantly speed up the training process for machine learning models, especially when working with image classification tasks. The use of GPUs allows for parallel processing and optimization of computations, leading to faster training times and ultimately achieving good accuracy on the model quickly.

146. A Data Scientist at an ad-tech startup wants to update an ML model that uses an Amazon SageMaker endpoint using the canary deployment methodology, in which the production variant 1 is the production model and the production variant 2 is the updated model. How can the Data Scientist efficiently configure this endpoint configuration to deploy the two different versions of the model while monitoring the Amazon CloudWatch invocations?

Create an endpoint configuration with production variants for the two models with equal weights.

Create two Amazon SageMaker endpoints and change the endpoint URL after testing the new endpoint.

Create an endpoint configuration with production variants for the two models with a weight ratio of 10:90.

The Data Scientist should create an endpoint configuration with production variants for the two models with a weight ratio of 0:1. This means that the production variant 1 will have a weight of 0, indicating that it is not being used for inference, while the production variant 2 will have a weight of 1, indicating that it is the updated model being used for inference. By periodically updating the weights, the Data Scientist can gradually shift the traffic from the old model to the updated model and monitor the invocations using Amazon CloudWatch.

Explanation

The Data Scientist should create an endpoint configuration with production variants for the two models with a weight ratio of 0:1. This means that the production variant 1 will have a weight of 0, indicating that it is not being used for inference, while the production variant 2 will have a weight of 1, indicating that it is the updated model being used for inference. By periodically updating the weights, the Data Scientist can gradually shift the traffic from the old model to the updated model and monitor the invocations using Amazon CloudWatch.

147. Real estate startup wants to use ML to predict the value of homes in various cities. To do so, the startup's data science team is joining real estate price data with other variables such as weather, demographic, and standard of living data. However, the team is having problems with slow model convergence. Additionally, the model includes large weights for some features, which is causing degradation in model performance. What kind of data preprocessing technique should the team use to more effectively prepare this data?

Standard scaler

Normalizer

Max absolute scaler

One hot encoder

The team should use the Standard Scaler data preprocessing technique to effectively prepare the data. Standard Scaler transforms the data to have a mean of 0 and a standard deviation of 1, which helps in reducing the impact of outliers and brings all features to a similar scale. This can help in improving the model's convergence and performance by reducing the influence of large weights and ensuring that all features contribute equally to the predictions.

Explanation

The team should use the Standard Scaler data preprocessing technique to effectively prepare the data. Standard Scaler transforms the data to have a mean of 0 and a standard deviation of 1, which helps in reducing the impact of outliers and brings all features to a similar scale. This can help in improving the model's convergence and performance by reducing the influence of large weights and ensuring that all features contribute equally to the predictions.

148. A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?

Replace each missing value by the mean or median across non-missing values in same row.

Delete observations that contain missing values because these represent less than 5% of the data

Replace each missing value by the mean or median across non-missing values in the same column.

For each feature, approximate the missing values using supervised learning based on other features.

Use supervised learning to predict missing values based on the values of other features. Different
supervised learning approaches might have different performances, but any properly implemented supervised
learning approach should provide the same or better approximation than mean or median approximation, as
proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field
of research.

Explanation

Use supervised learning to predict missing values based on the values of other features. Different
supervised learning approaches might have different performances, but any properly implemented supervised
learning approach should provide the same or better approximation than mean or median approximation, as
proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field
of research.

149. An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Explanation

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

150. Which of the following best describes the purpose of Amazon SageMaker Ground Truth in the context of machine learning?

It automates the training of machine learning models.

It provides tools for model hyperparameter tuning.

It helps generate labeled datasets for training machine learning models.

It monitors and manages deployed machine learning models.

Amazon SageMaker Ground Truth is a service that helps in generating high-quality labeled datasets by using a combination of human labeling and automated data labeling techniques. This is essential for training supervised machine learning models, as labeled data serves as the foundation for teaching models to make accurate predictions. The service significantly reduces the time and cost associated with data labeling.

Explanation

Amazon SageMaker Ground Truth is a service that helps in generating high-quality labeled datasets by using a combination of human labeling and automated data labeling techniques. This is essential for training supervised machine learning models, as labeled data serves as the foundation for teaching models to make accurate predictions. The service significantly reduces the time and cost associated with data labeling.

151. A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age. Initial models have performed poorly. While reviewing the underlying data, the Data Scientist noticed that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features of these observations appear normal compared to the rest of the sample population. How should the Data Scientist correct this issue?

Drop all records from the dataset where age has been set to 0.

Drop the age feature from the dataset and train the model using the rest of the features.

Use k-means clustering to handle missing features.

This is a widely accepted method for handling such anomalies. Since the age of 0 is clearly incorrect in a study of individuals over 65, imputing these values with the mean or median age of the other patients ensures that these records are still usable without introducing significant bias. The mean or median imputation would likely provide a reasonable estimate of what the age should have been.

Explanation

This is a widely accepted method for handling such anomalies. Since the age of 0 is clearly incorrect in a study of individuals over 65, imputing these values with the mean or median age of the other patients ensures that these records are still usable without introducing significant bias. The mean or median imputation would likely provide a reasonable estimate of what the age should have been.

152. A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. Currently, the company has the following data in Amazon Aurora.

Profiles for all past and existing customers
Profiles for all past and existing insured pets
Policy-level information
Premiums received
Claims paid

What steps should be taken to implement a machine learning model to identify potential new customers on social media?

Use regression on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media.

Use clustering on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media.

To implement a machine learning model to identify potential new customers on social media, the marketing manager should use clustering on customer profile data to understand key characteristics of consumer segments. This approach will help in grouping customers with similar attributes together, enabling the company to target specific segments effectively. By finding similar profiles on social media, the marketing manager can tailor the marketing campaign to reach potential customers who share similar characteristics with existing and past customers.

Explanation

To implement a machine learning model to identify potential new customers on social media, the marketing manager should use clustering on customer profile data to understand key characteristics of consumer segments. This approach will help in grouping customers with similar attributes together, enabling the company to target specific segments effectively. By finding similar profiles on social media, the marketing manager can tailor the marketing campaign to reach potential customers who share similar characteristics with existing and past customers.

153. While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown. What does this mean?

The model might have prediction errors over a range of target values.

The dataset cannot be accurately represented using the regression model.

There are too many variables in the model.

The model is predicting its target values perfectly.

The histogram for residuals not forming a zero-centered bell shape indicates that the regression model is not accurately representing the dataset. A zero-centered bell shape suggests that the model's predictions have minimal errors and are distributed evenly around the true values. However, if the residuals do not form this shape, it implies that the model is consistently overestimating or underestimating the target values, leading to inaccurate representation of the dataset.

Explanation

The histogram for residuals not forming a zero-centered bell shape indicates that the regression model is not accurately representing the dataset. A zero-centered bell shape suggests that the model's predictions have minimal errors and are distributed evenly around the true values. However, if the residuals do not form this shape, it implies that the model is consistently overestimating or underestimating the target values, leading to inaccurate representation of the dataset.

154. You work in the security department of your company's IT division. Your company has decided to try to use facial recognition to improve security on their campus. You have been asked to design a system that augments your company's building access security by scanning the faces of people entering their buildings and recognizing the person as either an employee/contractor/consultant, who is in the company's database, or visitor, who is not in their database. Across their many campus locations worldwide your company has over 750,000 employees and over 250,000 contractors and consultants. These workers are all registered in their HR database. Each of these workers has an image of their face stored in the HR database. You have decided to use Amazon Rekognition for your facial recognition solution. On occasion, the Rekognition model fails to recognize visitors to the buildings. What could be the source of the problem?

Face landmarks filters set to a max sharpness

Bounding box and confidence score for face comparison threshold tolerances set to max values

Confidence threshold tolerance set to the default

Face collection contents

The source of the problem could be the face collection contents. Since the Rekognition model is failing to recognize visitors, it is possible that the faces of the visitors are not included in the face collection that the system is comparing against. The face collection should ideally contain images of both employees/contractors/consultants and visitors in order to accurately identify and differentiate between them.

Explanation

The source of the problem could be the face collection contents. Since the Rekognition model is failing to recognize visitors, it is possible that the faces of the visitors are not included in the face collection that the system is comparing against. The face collection should ideally contain images of both employees/contractors/consultants and visitors in order to accurately identify and differentiate between them.

155. A Data Scientist is training a convolutional neural network model to detect incoming employees at the company's front gate using a camera so that the system opens for them automatically. However, the model is taking too long to converge and the error oscillates for more than 10 epochs. What should the Data scientists do to improve upon this situation? (Select TWO.)

Normalize the images before training

Add batch normalization

Add more epochs

Increase batch size

Decrease weight decay

The Data Scientist should normalize the images before training to improve the convergence of the model. Normalization helps to scale the pixel values of the images to a standard range, which can prevent the model from getting stuck in oscillating errors. Additionally, adding batch normalization can also help in improving the convergence by normalizing the activations of the neural network layers. These two techniques can help stabilize the training process and improve the performance of the model.

Explanation

The Data Scientist should normalize the images before training to improve the convergence of the model. Normalization helps to scale the pixel values of the images to a standard range, which can prevent the model from getting stuck in oscillating errors. Additionally, adding batch normalization can also help in improving the convergence by normalizing the activations of the neural network layers. These two techniques can help stabilize the training process and improve the performance of the model.

Submit

156. An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen. Which combination of algorithms would provide the appropriate insights? (Select TWO)

The factorization machines (FM) algorithm

The Latent Dirichlet Allocation (LDA) algorithm

The principal component analysis (PCA) algorithm

The k-means algorithm

The Random Cut Forest (RCF) algorithm

The principal component analysis (PCA) algorithm is suitable for this task as it can reduce the dimensionality of the data and identify the most important variables that contribute to the variance in the dataset. This can help in identifying patterns and relationships within the census information. The k-means algorithm can be used to cluster the data based on similarities, which can be useful in grouping provinces and cities with similar healthcare and social program needs. These algorithms together can provide valuable insights for determining healthcare and social program needs by province and city based on the census information.

Explanation

The principal component analysis (PCA) algorithm is suitable for this task as it can reduce the dimensionality of the data and identify the most important variables that contribute to the variance in the dataset. This can help in identifying patterns and relationships within the census information. The k-means algorithm can be used to cluster the data based on similarities, which can be useful in grouping provinces and cities with similar healthcare and social program needs. These algorithms together can provide valuable insights for determining healthcare and social program needs by province and city based on the census information.

Submit

157. A machine translation company is deploying its language translation models behind an Amazon SageMaker endpoint. The company wants to deploy a solution directly on its website so that users can input text in one language and have it translated into a second language. The company wants to reach a solution with minimal maintenance and latency for spiky traffic times. How should the company architect this solution?

Use Amazon SageMaker InvokeEndpoint with API Gateway

Use Lambda to call InvokeEndpoint. Use the Amazon API Gateway URL to call the AWS Lambdafunction

Install the sagemaker-runtime library on the web server. Call InvokeEndpoint from the webserver.

The company should use Lambda to call InvokeEndpoint and use the Amazon API Gateway URL to call the AWS Lambda function. This architecture allows for minimal maintenance and latency for spiky traffic times. By using Lambda, the company can easily scale the solution to handle increased traffic. The Amazon API Gateway acts as a proxy, providing a secure and scalable entry point for the website to call the Lambda function. This solution is efficient and provides a seamless experience for users on the website.

Explanation

The company should use Lambda to call InvokeEndpoint and use the Amazon API Gateway URL to call the AWS Lambda function. This architecture allows for minimal maintenance and latency for spiky traffic times. By using Lambda, the company can easily scale the solution to handle increased traffic. The Amazon API Gateway acts as a proxy, providing a secure and scalable entry point for the website to call the Lambda function. This solution is efficient and provides a seamless experience for users on the website.

158. A Machine Learning Specialist is developing recommendation engine for a photography blog Given a picture, the recommendation engine should show a picture that captures similar objects The Specialist would like to create a numerical representation feature to perform nearest-neighbor searches. What actions would allow the Specialist to get relevant numerical representations?

Reduce image resolution and use reduced resolution pixel values as features.

Use Amazon Mechanical Turk to label image content and create a one-hot representation indicating the presence of specific labels.

Run images through a neural network pie-trained on ImageNet, and collect the feature vectors from the penultimate layer.

Average colors by channel to obtain three-dimensional representations of images.

The Specialist can obtain relevant numerical representations by running images through a neural network pre-trained on ImageNet and collecting the feature vectors from the penultimate layer. This approach allows the recommendation engine to capture high-level visual features and patterns in the images, which can be used for nearest-neighbor searches. The penultimate layer is chosen because it typically contains more abstract and discriminative features compared to earlier layers.

Explanation

The Specialist can obtain relevant numerical representations by running images through a neural network pre-trained on ImageNet and collecting the feature vectors from the penultimate layer. This approach allows the recommendation engine to capture high-level visual features and patterns in the images, which can be used for nearest-neighbor searches. The penultimate layer is chosen because it typically contains more abstract and discriminative features compared to earlier layers.

159. When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Select THREE.)

The training channel identifying the location of training data on an Amazon S3 bucket.

The validation channel identifying the location of validation data on an Amazon S3 bucket.

The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users.

Hyperparameters in a JSON array as documented for the algorithm used.

The Amazon EC2 instance class specifying whether training will be run using CPU or GPU.

The output path specifying where on an Amazon S3 bucket the trained model will persist.

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, three common parameters that must be specified are the IAM role, the EC2 instance class, and the output path. The IAM role is necessary for SageMaker to perform tasks on behalf of the users. The EC2 instance class determines whether the training will be run using CPU or GPU. The output path specifies where the trained model will be stored in an Amazon S3 bucket.

Explanation

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, three common parameters that must be specified are the IAM role, the EC2 instance class, and the output path. The IAM role is necessary for SageMaker to perform tasks on behalf of the users. The EC2 instance class determines whether the training will be run using CPU or GPU. The output path specifies where the trained model will be stored in an Amazon S3 bucket.

Submit

160. A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (Pll). The dataset:

Must be accessible from a VPC only.
Must not traverse the public internet.

How can these requirements be satisfied?

To satisfy the requirements of making the dataset accessible from a VPC only and not traversing the public internet, a VPC endpoint needs to be created. Additionally, a bucket access policy should be applied to restrict access to the specific VPC endpoint and the VPC itself. This ensures that the dataset can only be accessed within the designated VPC, providing the necessary security measures for sensitive data.

Explanation

To satisfy the requirements of making the dataset accessible from a VPC only and not traversing the public internet, a VPC endpoint needs to be created. Additionally, a bucket access policy should be applied to restrict access to the specific VPC endpoint and the VPC itself. This ensures that the dataset can only be accessed within the designated VPC, providing the necessary security measures for sensitive data.

161. A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create security vulnerability where malicious code running on the instances could compromise data privacy. The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network. How should the Data Science team configure the notebook instance placement to meet these requirements?

The correct answer is to associate the Amazon SageMaker notebook with a private subnet in a VPC and ensure that the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it. This configuration allows the Data Science team to access data stored in Amazon S3 buckets and use Amazon SageMaker while keeping the instances within a secured VPC with no internet access. The use of VPC endpoints ensures that data communication traffic stays within the AWS network, addressing the IT Security team's concerns about data privacy and security vulnerabilities.

Explanation

The correct answer is to associate the Amazon SageMaker notebook with a private subnet in a VPC and ensure that the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it. This configuration allows the Data Science team to access data stored in Amazon S3 buckets and use Amazon SageMaker while keeping the instances within a secured VPC with no internet access. The use of VPC endpoints ensures that data communication traffic stays within the AWS network, addressing the IT Security team's concerns about data privacy and security vulnerabilities.

162. You work for a web retailer where you need to analyze data produced for your company by an outside market data provider. You need to produce recommendations based on patterns in user preferences by demographic found in the supplied data. You have stored the data in one of your company's S3 buckets. You have created a Glue crawler that you have configured to crawl the data on S3 and you have written a custom classifier. Unfortunately, the crawler failed to create a schema. Why might the Glue crawler have failed in this way?

You did not add an exclude pattern when you configured the data store

All the classifiers returned a certainty of 0.0

You chose to create a single schema for each S3 path

The Glue crawler might have failed to create a schema because all the classifiers returned a certainty of 0.0. This means that the classifiers were not able to accurately identify the data types and structure of the data in the S3 bucket. As a result, the crawler was unable to generate a schema based on the patterns in the user preferences by demographic found in the data.

Explanation

The Glue crawler might have failed to create a schema because all the classifiers returned a certainty of 0.0. This means that the classifiers were not able to accurately identify the data types and structure of the data in the S3 bucket. As a result, the crawler was unable to generate a schema based on the patterns in the user preferences by demographic found in the data.

163. An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time. Which solution should the agency consider?

The agency should consider using a proxy server at each local office and for each camera to stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, they should use Amazon Rekognition Video to create a stream processor that detects faces from a collection of known employees. This solution allows for real-time identification of activities performed by non-employees and alerts the agency when non-employees are detected.

Explanation

The agency should consider using a proxy server at each local office and for each camera to stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, they should use Amazon Rekognition Video to create a stream processor that detects faces from a collection of known employees. This solution allows for real-time identification of activities performed by non-employees and alerts the agency when non-employees are detected.

164. A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours. With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s). Which visualization will accomplish this?

A histogram showing whether the most important input feature is Gaussian.

A scatter plot showing the performance of the objective metric over each training iteration.

A scatter plot showing the correlation between maximum tree depth and the objective metric.

The scatter plot showing the correlation between maximum tree depth and the objective metric will accomplish the goal of decreasing the training time and costs. By analyzing the relationship between the maximum tree depth and the objective metric (AUC), the Machine Learning Specialist can identify the optimal value for the maximum tree depth that results in the best performance. This allows for fine-tuning the hyperparameters to improve the model's efficiency and reduce the time and cost required for training.

Explanation

The scatter plot showing the correlation between maximum tree depth and the objective metric will accomplish the goal of decreasing the training time and costs. By analyzing the relationship between the maximum tree depth and the objective metric (AUC), the Machine Learning Specialist can identify the optimal value for the maximum tree depth that results in the best performance. This allows for fine-tuning the hyperparameters to improve the model's efficiency and reduce the time and cost required for training.

165. A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency?

Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.

AWS Glue with a custom ETL script to transform the data.

An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.

Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.

The correct answer is Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. This solution allows the use of SQL to query the data stream with the least latency because Kinesis Data Analytics can process real-time streaming data using SQL queries, and AWS Lambda can be used to transform the data in real-time. This combination ensures that the insights can be gained in real-time with minimal delay.

Explanation

The correct answer is Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. This solution allows the use of SQL to query the data stream with the least latency because Kinesis Data Analytics can process real-time streaming data using SQL queries, and AWS Lambda can be used to transform the data in real-time. This combination ensures that the insights can be gained in real-time with minimal delay.

166. A Machine Learning Specialist discover the following statistics while experimenting on a model. What can the Specialist from the experiments?

The experiments suggest that the model in Experiment 1 initially had a high variance error, which means it was overfitting the training data. However, in Experiment 3, by applying regularization techniques, the variance error was reduced, indicating that the model became more generalizable. Additionally, Experiment 2 shows that there is minimal bias error in Experiment 1, meaning that the model was able to capture the underlying patterns in the data well.

Explanation

The experiments suggest that the model in Experiment 1 initially had a high variance error, which means it was overfitting the training data. However, in Experiment 3, by applying regularization techniques, the variance error was reduced, indicating that the model became more generalizable. Additionally, Experiment 2 shows that there is minimal bias error in Experiment 1, meaning that the model was able to capture the underlying patterns in the data well.

167. A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

Decision tree

Linear support vector machine (SVM)

Naive Bayesian classifier

Single Perceptron with sigmoid activation function

The decision tree model would have the highest recall with respect to the fraudulent class. Decision trees are known for their ability to capture complex relationships between features and target variables. In this case, the decision tree can effectively learn the patterns in the age of account and transaction month features that are indicative of fraudulent behavior. This allows the model to correctly identify a higher proportion of fraudulent instances, resulting in a higher recall for the fraudulent class.

Explanation

The decision tree model would have the highest recall with respect to the fraudulent class. Decision trees are known for their ability to capture complex relationships between features and target variables. In this case, the decision tree can effectively learn the patterns in the age of account and transaction month features that are indicative of fraudulent behavior. This allows the model to correctly identify a higher proportion of fraudulent instances, resulting in a higher recall for the fraudulent class.

168. You work in the data analytics department of a ride sharing software company. You need to use the K-means machine learning algorithm to separate your company's optimized ride data into clusters based on ride coordinates. How would you best use AWS Glue to build the data tables needed to classify the ride data?

Use Glue crawlers together with a K-means classifier to classify the ride data based on coordinates

Use Glue FindMatches to find and remove duplicate records in you data

Use Glue to automatically generate code to classify the ride data based on coordinates

Use Glue to transform and flatten your data so you can classify the ride data based on coordinates

AWS Glue crawlers can be used to automatically discover and catalog the ride data, including its structure and schema. This will create the necessary data tables that can be used for classification. By using a K-means classifier, the ride data can be separated into clusters based on the coordinates. This approach leverages the capabilities of AWS Glue to automate the process of preparing and organizing the data for classification.

Explanation

AWS Glue crawlers can be used to automatically discover and catalog the ride data, including its structure and schema. This will create the necessary data tables that can be used for classification. By using a K-means classifier, the ride data can be separated into clusters based on the coordinates. This approach leverages the capabilities of AWS Glue to automate the process of preparing and organizing the data for classification.

169. Which AWS service provides a managed environment for training and deploying machine learning models with built-in support for distributed training, automatic model tuning, and integration with other AWS services?

AWS Glue

Amazon Comprehend

AWS SageMaker

Amazon Lex

AWS SageMaker is the service that provides a managed environment for training and deploying machine learning models. It supports distributed training, automatic model tuning, and integrates with other AWS services. Unlike AWS Glue, which handles data integration and ETL, and Amazon Comprehend, which focuses on text analysis, SageMaker is specifically designed for end-to-end machine learning workflows, including model training, tuning, and deployment. Amazon Lex is for building chatbots, not model deployment.

Explanation

AWS SageMaker is the service that provides a managed environment for training and deploying machine learning models. It supports distributed training, automatic model tuning, and integrates with other AWS services. Unlike AWS Glue, which handles data integration and ETL, and Amazon Comprehend, which focuses on text analysis, SageMaker is specifically designed for end-to-end machine learning workflows, including model training, tuning, and deployment. Amazon Lex is for building chatbots, not model deployment.

170. A sports and betting company uses machine learning to predict the odds of winning during sporting events. It uses the Amazon SageMaker endpoint to serve its production model. The endpoint is on an m5.8xlarge instance. What can the company do to ensure that this endpoint is highly available while using the most cost-effective and easily managed solution?

Create another endpoint. Put the two endpoints behind an Application Load Balancer.

Increase the number of instances associated with the endpoint to more than one.

Increase the instance size to m5.16 x-large.

Add an elastic inference to the endpoint.

By increasing the number of instances associated with the endpoint to more than one, the company can ensure high availability of the endpoint. This means that even if one instance fails or experiences issues, the other instances can continue serving the production model, minimizing downtime. This solution is cost-effective as it does not require upgrading to a larger instance size or adding an elastic inference, and it is easily managed as the company can simply add more instances to the endpoint configuration. Additionally, putting the multiple endpoints behind an Application Load Balancer can further enhance availability and distribute the incoming traffic efficiently.

Explanation

By increasing the number of instances associated with the endpoint to more than one, the company can ensure high availability of the endpoint. This means that even if one instance fails or experiences issues, the other instances can continue serving the production model, minimizing downtime. This solution is cost-effective as it does not require upgrading to a larger instance size or adding an elastic inference, and it is easily managed as the company can simply add more instances to the endpoint configuration. Additionally, putting the multiple endpoints behind an Application Load Balancer can further enhance availability and distribute the incoming traffic efficiently.

171. A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users. The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns. Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory. Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO)

Add more deep trees to the random forest to enable the model to learn more features.

Indicate a copy of the samples in the test database in the training dataset

Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.

Change the cost function so that false negatives have a higher impact on the cost value than false positives.

Change the cost function so that false positives have a higher impact on the cost value than false negatives.

The Data Science team should generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This will help to balance the dataset and provide more examples of paid users for the model to learn from. Additionally, the team should change the cost function so that false negatives have a higher impact on the cost value than false positives. This will prioritize correctly identifying users who will become paid users, even if it results in more false positives.

Explanation

The Data Science team should generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This will help to balance the dataset and provide more examples of paid users for the model to learn from. Additionally, the team should change the cost function so that false negatives have a higher impact on the cost value than false positives. This will prioritize correctly identifying users who will become paid users, even if it results in more false positives.

Submit

172. You are a machine learning expert working for a marketing firm. You are supporting a team of data scientists and marketing managers who are running a marketing campaign. Your data scientists and marketing managers need to answer the question "Will this user subscribe to my campaign?" You have been given a dataset in the form of a CSV file which is formatted as such: UserId, jobId, jobDescription, educationLevel, campaign, duration, willRespondToCampaign When you build your schema for this dataset, which of the following data descriptors would you use to define the willRespondToCampaign attribute? (Select TWO).

Categorical

TargetAttributeName

TEXT

Binary

Numeric

RowId

The "willRespondToCampaign" attribute is a binary variable that indicates whether a user will respond to the campaign or not. Therefore, the data descriptor "Binary" is appropriate to define this attribute. Additionally, the "targetAttributeName" is also a valid data descriptor as it specifies the attribute that is the target variable for the prediction task, which in this case is "willRespondToCampaign".

Explanation

The "willRespondToCampaign" attribute is a binary variable that indicates whether a user will respond to the campaign or not. Therefore, the data descriptor "Binary" is appropriate to define this attribute. Additionally, the "targetAttributeName" is also a valid data descriptor as it specifies the attribute that is the target variable for the prediction task, which in this case is "willRespondToCampaign".

Submit

173. A Machine Learning Engineer created a pipeline for training an ML model using an Amazon SageMaker training job. The training job began successfully but then failed after running for five minutes. How should the Engineer begin to debug this issue? (Select TWO.)

Log into the Amazon SageMaker training job instance and check the job history

Call the DescribeJob API to check the FailureReason option

Go to Amazon CloudWatch logs and check the logs for the given training job

Check the error in the given training job directly in the Amazon SageMaker console

Check AWS CloudTrail logs to check the error that caused the training to fail

To debug the issue, the Machine Learning Engineer should call the DescribeJob API to check the FailureReason option. This will provide information about the reason for the failure, which can help in identifying the problem. Additionally, the Engineer should go to Amazon CloudWatch logs and check the logs for the given training job. The logs can provide more details about any errors or issues that occurred during the training job, aiding in the debugging process. Checking the job history on the SageMaker console, directly checking the error in the training job, and checking AWS CloudTrail logs are not mentioned as appropriate steps for debugging this specific issue.

Explanation

To debug the issue, the Machine Learning Engineer should call the DescribeJob API to check the FailureReason option. This will provide information about the reason for the failure, which can help in identifying the problem. Additionally, the Engineer should go to Amazon CloudWatch logs and check the logs for the given training job. The logs can provide more details about any errors or issues that occurred during the training job, aiding in the debugging process. Checking the job history on the SageMaker console, directly checking the error in the training job, and checking AWS CloudTrail logs are not mentioned as appropriate steps for debugging this specific issue.

Submit

174. A company has raw user and transaction data stored in AmazonS3 a MySQL database, and Amazon RedShift. A Data Scientist needs to perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon RedShift, and then calculating the average-of a few selected columns from the joined data. Which AWS service should the Data Scientist use?

Amazon Athena

Amazon Redshift Spectrum

AWS Glue

Amazon QuickSight

Amazon Redshift Spectrum should be used by the Data Scientist. Redshift Spectrum allows users to run complex SQL queries against data stored in Amazon S3, without the need to load the data into Redshift first. It can directly access data from S3 and join it with data in Redshift and MySQL. This makes it a suitable choice for the Data Scientist to join the datasets from Amazon S3, MySQL, and Amazon RedShift, and perform the required analysis.

Explanation

Amazon Redshift Spectrum should be used by the Data Scientist. Redshift Spectrum allows users to run complex SQL queries against data stored in Amazon S3, without the need to load the data into Redshift first. It can directly access data from S3 and join it with data in Redshift and MySQL. This makes it a suitable choice for the Data Scientist to join the datasets from Amazon S3, MySQL, and Amazon RedShift, and perform the required analysis.

175. You are a data scientist working for a cancer screening center. The center has gathered data on many patients that have been screened over the years. The data is obviously skewed toward true negative results, as most screened patients don't have cancer. You are evaluating several machine learning models to decide which model best predicts true positives when using your cancer screening data. You have split your data into a 70/30 ratio of training set to test set. You now need to decide which metric to use to evaluate your models. Which metric will most accurately determine the model best suited to solve your classification problem?

ROC Curve

Precision

Recall

PR Curve

The PR Curve is the most suitable metric to determine the model best suited for the classification problem in this scenario. Since the data is skewed towards true negative results, precision and recall are more appropriate metrics than the ROC curve. Precision measures the proportion of true positive predictions out of all positive predictions, while recall measures the proportion of true positive predictions out of all actual positive cases. The PR Curve combines both precision and recall, providing a more accurate evaluation of the model's performance in identifying true positives.

Explanation

The PR Curve is the most suitable metric to determine the model best suited for the classification problem in this scenario. Since the data is skewed towards true negative results, precision and recall are more appropriate metrics than the ROC curve. Precision measures the proportion of true positive predictions out of all positive predictions, while recall measures the proportion of true positive predictions out of all actual positive cases. The PR Curve combines both precision and recall, providing a more accurate evaluation of the model's performance in identifying true positives.

176. You work for a real estate company where you are building a machine learning model to predict the prices of houses. You are using a regression decision tree. As you train your model you see that it is overfitted to your training data and that it doesn't generalize well to unseen data. How can you improve your situation and get better training results in the most efficient way?

Gather additional training data that gives a more diverse representation of the housing price data.

Use the “dropout” technique to penalize large weights and prevent overfitting.

Use feature selection to eliminate irrelevant features and iteratively train your model until you eliminate the overfitting.

Using a random forest by building multiple randomized decision trees and averaging their outputs can improve the situation and provide better training results. Random forests help to reduce overfitting by introducing randomness into the model. By building multiple decision trees with different subsets of the data and features, the model can learn from different perspectives and make more accurate predictions. Averaging the outputs of these trees helps to reduce the impact of individual overfitted trees and provides a more generalized prediction for unseen data.

Explanation

Using a random forest by building multiple randomized decision trees and averaging their outputs can improve the situation and provide better training results. Random forests help to reduce overfitting by introducing randomness into the model. By building multiple decision trees with different subsets of the data and features, the model can learn from different perspectives and make more accurate predictions. Averaging the outputs of these trees helps to reduce the impact of individual overfitted trees and provides a more generalized prediction for unseen data.

177. A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet. How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?

Create a NAT gateway within the corporate VPC.

Route Amazon SageMaker traffic through an on-premises network.

Create Amazon SageMaker VPC interface endpoints within the corporate VPC.

Create VPC peering with Amazon VPC hosting Amazon SageMaker.

If the corporate data security policy does not allow communication over the internet but the company still wants to use Amazon SageMaker, they can set up an Amazon SageMaker VPC (Virtual Private Cloud) configuration to ensure that the service is accessible only within their private network. They can create a Virtual Private Cloud (VPC) for Amazon SageMaker. This involves setting up private subnets within the VPC. Then, they can create VPC endpoints for the necessary AWS services that Amazon SageMaker requires. VPC endpoints allow you to privately connect your VPC to supported AWS services without needing public internet access.

Explanation

If the corporate data security policy does not allow communication over the internet but the company still wants to use Amazon SageMaker, they can set up an Amazon SageMaker VPC (Virtual Private Cloud) configuration to ensure that the service is accessible only within their private network. They can create a Virtual Private Cloud (VPC) for Amazon SageMaker. This involves setting up private subnets within the VPC. Then, they can create VPC endpoints for the necessary AWS services that Amazon SageMaker requires. VPC endpoints allow you to privately connect your VPC to supported AWS services without needing public internet access.

178. An oil and natural gas company is utilizing machine learning to identify optimal drilling sites. They have selected Amazon SageMaker for developing their models. Their data scientists, working from notebook instances, face lengthy delays during model training. The company aims to reduce these delays to enhance the efficiency of model iterations, allowing data scientists to rapidly test and modify their models without significant code adjustments or extended waits for data loading and model training. How can a team of data scientists reduce the waiting time associated with training machine learning models for discovering drilling locations, using Amazon SageMaker, while minimizing code changes for faster experimentation?

Use Amazon SageMaker in-built algorithms.

Use Amazon SageMaker Estimators in local mode to train the models.

Change the training job to use Pipe Mode to improve the time it takes to train the model.

Create the models on local laptops. Then, port the code over to use Amazon SageMaker.

Using Amazon SageMaker Estimators in local mode allows data scientists to quickly test and iterate on their machine learning models directly within their SageMaker notebook instances without having to launch separate training jobs on SageMaker's managed training infrastructure. This approach dramatically reduces the waiting time typically associated with starting up training jobs and loading data in the cloud. By training models locally, data scientists can receive immediate feedback and make rapid adjustments to their models, thus accelerating the experimental cycle. This method also minimizes the changes needed in the code, as the same SageMaker Estimator can be used for both local training and full-scale cloud training simply by changing the instance type from local to a specific ML instance type provided by SageMaker.

Explanation

Using Amazon SageMaker Estimators in local mode allows data scientists to quickly test and iterate on their machine learning models directly within their SageMaker notebook instances without having to launch separate training jobs on SageMaker's managed training infrastructure. This approach dramatically reduces the waiting time typically associated with starting up training jobs and loading data in the cloud. By training models locally, data scientists can receive immediate feedback and make rapid adjustments to their models, thus accelerating the experimental cycle. This method also minimizes the changes needed in the code, as the same SageMaker Estimator can be used for both local training and full-scale cloud training simply by changing the instance type from local to a specific ML instance type provided by SageMaker.

179. A Machine Learning Specialist is using Apache Spark for pre-processing training data As part of the Spark pipeline, the Specialist wants to use Amazon SageMaker for training a model and hosting it. Which of the following would the Specialist do to integrate the Spark application with SageMaker? (Select THREE)

Download the AWS SDK for the Spark environment.

Install the SageMaker Spark library in the Spark environment.

Use the appropriate estimator from the SageMaker Spark Library to train a model.

Compress the training data into a ZIP file and upload it to a pre-defined Amazon S3 bucket.

Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker.

Convert the DataFrame object to a CSV file, and use the CSV file as input for obtaining inferences from SageMaker.

To integrate the Spark application with SageMaker, the Machine Learning Specialist would need to perform the following steps: 1) Install the SageMaker Spark library in the Spark environment, which allows for seamless integration between Spark and SageMaker. 2) Use the appropriate estimator from the SageMaker Spark Library to train a model, which provides a high-level API for training models on SageMaker using Spark. 3) Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker, which allows for real-time inference on new data using the trained model.

Explanation

To integrate the Spark application with SageMaker, the Machine Learning Specialist would need to perform the following steps: 1) Install the SageMaker Spark library in the Spark environment, which allows for seamless integration between Spark and SageMaker. 2) Use the appropriate estimator from the SageMaker Spark Library to train a model, which provides a high-level API for training models on SageMaker using Spark. 3) Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker, which allows for real-time inference on new data using the trained model.

Submit

180. Which of the following AWS services is NOT directly related to machine learning?

Amazon SageMaker

Amazon Rekognition

AWS Lambda

Amazon Comprehend

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. While it can be used to support machine learning applications, it's not a dedicated machine learning service.

Amazon SageMaker is a comprehensive platform for building, training, and deploying machine learning models.

Amazon Rekognition is a service that uses deep learning to analyze images and videos.

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to extract insights from text.

Explanation

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. While it can be used to support machine learning applications, it's not a dedicated machine learning service.

Amazon SageMaker is a comprehensive platform for building, training, and deploying machine learning models.

Amazon Rekognition is a service that uses deep learning to analyze images and videos.

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to extract insights from text.

AWS Machine Learning - Specialty (MLS-c01 – 2020)