AWS Machine Learning - Specialty (MLS-c01 – 2020)

Reviewed by Editorial Team
The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.
Learn about Our Editorial Process
| By Olatunde.olanayo
O
Olatunde.olanayo
Community Contributor
Quizzes Created: 1 | Total Attempts: 6,238
| Attempts: 6,238 | Questions: 180
Please wait...
Question 1 / 180
0 %
0/100
Score 0/100
1. A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.   Which solution requires the LEAST effort to be able to query this data?

Explanation

The correct answer is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It can automatically discover and catalog data stored in Amazon S3, making it easier to query the data using SQL. Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. This combination of AWS Glue and Amazon Athena requires the least effort as it eliminates the need for manual data transformation and provides a simple and efficient way to query the data.

Submit
Please wait...
About This Quiz
AWS Machine Learning - Specialty (MLS-c01  2020) - Quiz

The AWS Certified Machine Learning - Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate's ability to design, implement,... see moredeploy, and maintain machine learning (ML) solutions for given business problems. see less

2. A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.   Which solution requires the LEAST effort to be able to query this data?

Explanation

The solution that requires the least effort to be able to query the data is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue can automatically discover and catalog the structured and unstructured data stored in the Amazon S3 bucket, making it easier to query. Amazon Athena is a serverless query service that allows the Machine Learning Specialist to run SQL queries directly on the data without the need for any infrastructure setup or management. This solution eliminates the need for additional data transformation or ETL processes, reducing the overall effort required.

Submit
3. 10. A Data Scientist for a credit card company is creating a solution to predict credit card fraud at the time of transaction. To that end, the Data Scientist is looking to create an ML model to predict fraud and will do so by training that model on an existing dataset of credit card transactions. That dataset contains 1,000 examples of transactions in total, only 50 of which are labeled as fraud. How should the Data Scientist deal with this class imbalance?

Explanation

The Data Scientist should use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records. This technique generates synthetic samples of the minority class (fraudulent records) in order to balance the class distribution. By creating additional synthetic fraud examples, the ML model will have more data to learn from and improve its ability to predict fraud accurately. This approach is suitable for dealing with class imbalance and can help improve the performance of the model in detecting credit card fraud.

Submit
4. A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800.000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1 5 MB in size. Most queries will span 5 to 10 columns only   How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Explanation

By converting the records to Apache Parquet format, the Machine Learning Specialist can minimize query runtime. Apache Parquet is a columnar storage file format that is designed for efficient query performance and compression. It allows for column-wise compression and predicate pushdown, which means that only the relevant columns and rows are read during a query. This format is well-suited for analytical workloads and can significantly reduce the amount of data that needs to be processed, resulting in faster query execution times.

Submit
5. A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published A sample of the data being used is below.   Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values.   What technique should be used to convert this column to binary values?

Explanation

The technique that should be used to convert the Day-Of_Week column to binary values is one-hot encoding. One-hot encoding is a technique used to represent categorical variables as binary vectors. Each category is converted into a binary column, where a value of 1 represents the presence of that category and a value of 0 represents the absence. This is commonly used in machine learning algorithms to handle categorical data and allow them to be used in mathematical calculations.

Submit
6. A navigation and transportation company is using satellite images to model weather around the world in order to create optimal routes for its ships and planes. The company is using Amazon SageMaker training jobs to build and train its models. However, during training, it takes too long to download the company's 100 GB data from Amazon S3 to the training instance before the training starts. What should the company do to speed up its training jobs while keeping the costs low?

Explanation

Changing the input mode to Pipe would speed up the training jobs while keeping the costs low. By using Pipe mode, the company can stream the data directly from Amazon S3 to the training instance without the need to download the entire 100 GB data before training starts. This eliminates the time-consuming download process and allows for faster training. Additionally, it helps in reducing storage costs as there is no need to store the data on the training instance.

Submit
7. A multi-national banking organization provides loan services to customers worldwide. Many of its customers still submit loan applications in paper form in one of the bank's branch locations. The bank wants to speed up the loan approval process for this set of customers by using machine learning. More specifically, it wants to create a process in which customers submit the application to the clerk, who scans and uploads it to the system. The system then reads and provides an approval or denial of the application in a matter of minutes. What can the bank use to read and extract the necessary data from the loan applications without needing to manage the process?

Explanation

The bank can use Amazon Textract to read and extract the necessary data from the loan applications without needing to manage the process. Amazon Textract is a machine learning service that automatically extracts text and data from scanned documents. It can analyze the loan applications and extract relevant information such as personal details, financial data, and loan request information. This would significantly speed up the loan approval process as the system can quickly analyze and provide an approval or denial of the application in a matter of minutes.

Submit
8. This graph shows the training and validation loss against the epochs for a neural network. The network being trained is as follows;
  • Two dense layers one output neuron
  • 100 neurons in each layer
  • 100 epochs
  • Random initialization of weights
  Which technique can be used to improve model performance in terms of accuracy in the validation set?

Explanation

Early stopping is a technique that can be used to improve model performance in terms of accuracy in the validation set. It involves stopping the training process before all epochs are completed, based on a predefined condition. This helps prevent overfitting and allows the model to generalize better to unseen data. By monitoring the validation loss during training, early stopping can determine when the model starts to overfit and stop training at that point, resulting in improved accuracy on the validation set.

Submit
9. A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

Explanation

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds
new information by adding synthetic data points to the minority class. This technique would be the most effective
in this scenario.

Submit
10. A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.   Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)

Explanation

The correct answer is AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is used to track API activity and monitor actions taken by users, including model deployments and endpoint invocations. Amazon CloudWatch is used to monitor resource utilization, such as GPU and CPU utilization on the deployed SageMaker endpoints. AWS Health, AWS Trusted Advisor, and AWS Config are not directly integrated with Amazon SageMaker for tracking this information.

Submit
11. IT leadership wants to transition a company's existing machine learning data storage environment to AWS as a temporary ad hoc solution. The company currently uses a custom software process that heavily leverages SOL as a query language and exclusively stores generated csv documents for machine learning. The ideal state for the company would be a solution that allows it to continue to use the current workforce of SQL experts. The solution must also support the storage of csv and JSON files, and be able to query over semi- structured data. The following are high priorities for the company:  
  • Solution simplicity
  • Fast development time
  • Low cost
  • High flexibility
  What technologies meet the company's requirements?

Explanation

The company's ideal state is to continue using SQL experts and support the storage of csv and JSON files, as well as querying over semi-structured data. Amazon S3 allows for the storage of csv and JSON files, while Amazon Athena is a serverless query service that supports SQL queries on data stored in Amazon S3. This solution is simple, has fast development time, low cost, and high flexibility, making it the best fit for the company's requirements.

Submit
12. An employee found a video clip with audio on a company's social media feed. The language used in the video is Spanish. English is the employee's first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task?

Explanation

The most efficient combination of services to accomplish the task is Amazon Transcribe, Amazon Translate, and Amazon Comprehend. Amazon Transcribe can be used to transcribe the audio in the video clip from Spanish to text. Amazon Translate can then be used to translate the transcribed text from Spanish to English, allowing the employee to understand the content. Finally, Amazon Comprehend can be used to perform sentiment analysis on the translated text, providing insights into the sentiment expressed in the video clip.

Submit
13. A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier:   Total number of images available = 1,000 Test set images = 100 (constant test set)   The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners.   Which techniques can be used by the ML Specialist to improve this specific test error?

Explanation

The ML Specialist can improve the test error by increasing the training data and adding variation in rotation for training images. This can help the model to learn and recognize cats even when they are held upside down by their owners. By exposing the model to a wider range of cat orientations during training, it can become more robust and accurate in identifying cats in different positions. Increasing the number of epochs, the number of layers, or the dropout rate may not specifically address the issue of misclassifying upside-down cats.

Submit
14. A Data Scientist wants to implement a near-real-time anomaly detection solution for routine machine maintenance. The data is currently streamed from connected devices by AWS IoT to an Amazon S3 bucket and then sent downstream for further processing in a real-time dashboard. What service can the Data Scientist use to achieve the desired outcome with minimal change to the pipeline?

Explanation

Amazon Kinesis Data Analytics is the correct answer because it is a service specifically designed for real-time data processing and analytics. It can easily integrate with AWS IoT and Amazon S3 to process the streamed data and provide near-real-time anomaly detection. Using Amazon Kinesis Data Analytics would require minimal changes to the existing pipeline, making it the ideal choice for the Data Scientist.

Submit
15. A video streaming company is looking to create a personalized experience for its customers on its platform. The company wants to provide recommended videos to stream based on what other similar users watched previously. To this end, it is collecting its platform's clickstream data using an ETL pipeline and storing the logs and syslogs in Amazon S3. What kind of algorithm should the company use to create the simplest solution in this situation?

Explanation

The company should use a recommender system algorithm to create a personalized experience for its customers. A recommender system analyzes clickstream data and user behavior to provide recommendations based on what other similar users watched previously. This algorithm would be the simplest solution for the company to implement in order to provide recommended videos to stream on its platform.

Submit
16. The Chief Editor for a product catalog wants the Research and Development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company's retail brand. The team has a set of training data.   Which machine learning algorithm should the researchers use that BEST meets their requirements?

Explanation

A Convolutional Neural Network (CNN) is the best machine learning algorithm for the given requirements. CNNs are specifically designed for image recognition tasks and have been proven to be highly effective in detecting patterns and features in images. They are able to automatically learn and extract relevant features from the training data, making them well-suited for identifying whether individuals in a collection of images are wearing the company's retail brand. LDA, RNN, and K-means are not specifically designed for image recognition tasks and may not perform as well in this scenario.

Submit
17. A Machine Learning Specialist deployed a model that provides product recommendations on a company's website Initially, the model was performing very well and resulted in customers buying more products on average However within the past few months the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago.  Which method should the Specialist try to improve model performance?

Explanation

The correct answer is that the model should be periodically retrained using the original training data plus new data as product inventory changes. This is because the Specialist has noticed a decrease in the effectiveness of the model over time, indicating that the model needs to be updated. By periodically retraining the model with both the original training data and new data that reflects the changes in product inventory, the model can adapt to the evolving patterns and preferences of customers, leading to improved performance.

Submit
18. A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output?

Explanation

The Softmax function is used to produce a probability distribution of the likelihood that the input image belongs to each of the 10 classes. It converts the output of the neural network into a probability distribution by normalizing the values and ensuring they sum up to 1. This is suitable for multi-class classification tasks, such as classifying different types of animals in this case. Dropout is a regularization technique used to prevent overfitting, Smooth L1 loss is a loss function used in object detection, and Rectified linear units (ReLU) is an activation function commonly used in neural networks.

Submit
19. A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users' behavior and product preferences to predict which products users would like based on the users' similarity to other users.   What should the Specialist do to meet this objective?  

Explanation

To meet the objective of predicting which products users would like based on their similarity to other users, the Machine Learning Specialist should build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. Collaborative filtering is a technique that analyzes the behavior of multiple users and identifies patterns and similarities between them to make recommendations. This approach is suitable for leveraging the large amount of information on users' behavior and product preferences that the company has. Apache Spark ML is a powerful machine learning library that can handle large-scale data processing, and Amazon EMR provides a scalable and cost-effective platform for running Spark applications.

Submit
20. Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

Explanation

The Area Under the ROC Curve (AUC) is a commonly used metric to compare and evaluate machine learning classification models. It measures the ability of the model to distinguish between positive and negative classes by calculating the area under the Receiver Operating Characteristic (ROC) curve. A higher AUC value indicates a better model performance in terms of classification accuracy. Therefore, a Machine Learning Specialist should generally use AUC to compare and evaluate different classification models.

Submit
21. An ML scientist has built a decision tree model using scikit-learn with 1,000 trees. The training accuracy for the model was 99.2% and the test accuracy was 70.3%. Should the Scientist use this model in production?

Explanation

The model's training accuracy of 99.2% indicates that it is performing well on the training data. However, the test accuracy of only 70.3% suggests that the model is not generalizing well to new, unseen data. Therefore, it is not advisable to use this model in production as it may not perform well on real-world data.

Submit
22. A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.   Which storage scheme is MOST adapted to this scenario?

Explanation

Storing datasets as files in Amazon S3 is the most adapted storage scheme for this scenario because it allows for scalability and cost-effectiveness. With S3, the Data Science team can easily store and retrieve large amounts of training data without worrying about capacity limitations. Additionally, S3 supports SQL-based querying using services like Amazon Athena, allowing for easy exploration of the data using SQL. This solution also aligns with the requirement of being able to create an arbitrary number of new datasets every day, as S3 can handle the storage of a large number of files.

Submit
23. A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant.   Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test?

Explanation

Generating an Amazon CloudWatch dashboard will allow the Specialist to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. This will provide a convenient and comprehensive way to review and monitor these metrics during the load test.

Submit
24. A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis the Specialist observes that many features are highly correlated with each other. This may make the model unstable. What should be done to reduce the impact of having such a large number of features?

Explanation

When many features in a prediction model are highly correlated, it can lead to instability in the model. To reduce the impact of this issue, the Machine Learning Specialist should create a new feature space using principal component analysis (PCA). PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. By creating a new feature space using PCA, the Specialist can eliminate the problem of high correlation among features and improve the stability of the prediction model.

Submit
25. A web-based company wants to improve its conversion rate on its landing page Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker. However, there is an over fitting problem training data shows 90% accuracy in predictions, while test data shows 70% accuracy only.   The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validation data?  

Explanation

Applying L1 or L2 regularization and dropouts to the training is recommended to provide the highest accuracy model for the company's test and validation data. Regularization techniques such as L1 or L2 help to prevent overfitting by adding a penalty term to the loss function, which discourages the model from relying too heavily on any single feature. Dropouts, on the other hand, randomly deactivate a certain proportion of neurons during training, forcing the model to learn more robust and generalizable representations. These techniques can help improve the model's ability to generalize and perform well on unseen data, thus boosting the accuracy on the test and validation datasets.

Submit
26. A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model.   What should the Specialist do to prepare the data for model training?  

Explanation

The Specialist should apply normalization to ensure each field has a mean of 0 and a variance of 1. This will standardize the magnitude of the input features and prevent variables with larger magnitudes from dominating the model. Normalization is a common preprocessing step in machine learning to scale the data and make it easier for the model to learn patterns and make accurate predictions.

Submit
27. A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information which model would have the HIGHEST accuracy?

Explanation

Support vector machines (SVM) with a non-linear kernel are known for their ability to handle complex decision boundaries and capture non-linear relationships in the data. In this case, the data is not linearly separable, as there is overlap between the fraudulent and normal classes in both features. Therefore, using a non-linear kernel in the SVM model would likely result in the highest accuracy by effectively capturing the patterns and relationships in the data.

Submit
28. You work for the information security department of a major corporation. You have been asked to build a solution that detects web application log anomalies to protect your organization from fraudulent activity. The system needs to have near-real-time updates to the model where log entry data points dynamically change the underlying model as the log files are updated.   Which AWS service component do you use to implement the best algorithm based on these requirements?

Explanation

The best algorithm to implement the given requirements is the Random Cut Forest algorithm provided by Kinesis Data Analytics. This algorithm is specifically designed for anomaly detection in streaming data, making it suitable for detecting web application log anomalies in near-real-time. It dynamically updates the underlying model as log files are updated, ensuring that the system can adapt to changing data patterns. Therefore, Kinesis Data Analytics Random Cut Forest is the most appropriate AWS service component for this scenario.

Submit
29. A Machine Learning Specialist was given a dataset consisting of unlabeled data The Specialist must create a model that can help the team classify the data into different buckets What model should be used to complete this work?

Explanation

K-means clustering should be used to complete this work because it is a popular unsupervised learning algorithm that is used for clustering data. It is suitable for this task because the dataset consists of unlabeled data and the goal is to classify the data into different buckets. K-means clustering works by partitioning the data into k clusters based on their similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. This algorithm is widely used for data clustering and can help the Machine Learning Specialist in this task.

Submit
30. A Machine Learning Specialist is working with a large cyber security company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested.   The company also wants be able to save the results in its data lake for later processing and analysis.   What is the MOST efficient way to accomplish these tasks?

Explanation

The most efficient way to accomplish the tasks of ingesting and analyzing the data in real-time is by using Amazon Kinesis Data Firehose to ingest the data and Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. After detecting the anomalies, the results can be streamed to Amazon S3 using Kinesis Data Firehose. This approach allows for real-time analysis and storage of the results in a scalable and efficient manner.

Submit
31. A Machine Learning Engineer is creating a regression model for forecasting company revenue based on an internal dataset made up of past sales and other related data. What metric should the Engineer use to evaluate the ML model?

Explanation

The Machine Learning Engineer should use Root Mean Squared Error (RMSE) to evaluate the regression model. RMSE is a commonly used metric for regression models as it measures the average difference between the predicted and actual values. It provides a single value that represents the overall accuracy of the model's predictions. The lower the RMSE, the better the model's performance in accurately forecasting the company's revenue based on the internal dataset.

Submit
32. Amazon Connect has recently been tolled out across a company as a contact call center. The solution has been configured to store voice call recordings on Amazon S3.   The content of the voice calls is being analyzed for the incidents being discussed by the call operators. Amazon Transcribe is being used to convert the audio to text, and the output is stored on Amazon S3. Which approach will provide the information required for further analysis?

Explanation

Amazon Comprehend is a natural language processing (NLP) service that can be used to analyze text and extract key topics. In this scenario, the voice calls are being converted to text using Amazon Transcribe, and the transcribed files are stored on Amazon S3. By using Amazon Comprehend with the transcribed files, the company can extract the key topics discussed in the calls, providing the required information for further analysis. This approach leverages the capabilities of Amazon Comprehend to analyze the text data and identify important topics, making it the appropriate choice for this situation.

Submit
33. You wish to use a SageMaker notebook within a VPC. SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC.   How would you use SageMaker within a VPC without opening up Internet access?

Explanation

To use SageMaker within a VPC without opening up Internet access, you need to disable direct Internet access when specifying the VPC for your notebook instance. Additionally, you should use VPC interface endpoints (PrivateLink) to allow the connections needed for training and hosting your model. To ensure outbound connections for training and hosting, you should modify your instance's security group. This approach ensures that the SageMaker notebook instance remains within the VPC and does not create a potential security hole by accessing the Internet.

Submit
34. During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?

Explanation

During mini-batch training, the learning rate determines the step size at which the model's parameters are updated. If the learning rate is set too high, the model may overshoot the optimal parameters, causing the training accuracy to oscillate. This is because the model is unable to converge to the optimal solution and keeps bouncing back and forth. Therefore, the most likely cause of the issue is that the learning rate is very high.

Submit
35. The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?

Explanation

The graph shows that the predicted values closely follow the actual values for both the trend and the seasonality. This indicates that the model is accurately capturing the patterns and fluctuations in the time series data, suggesting that it predicts both the trend and the seasonality well.

Submit
36. An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

Explanation

The Specialist should use the multiple imputations approach to reconstruct the missing data. This approach involves creating multiple plausible values for the missing data based on the observed data and then analyzing the dataset multiple times with different imputed values. It helps to preserve the integrity of the dataset by accounting for the uncertainty associated with the missing data and providing more accurate estimates.

Submit
37. A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations.   Which solution should the Specialist recommend?

Explanation

The Specialist should recommend collaborative filtering based on user interactions and correlations to identify patterns in the customer database. This approach is suitable for analyzing customer behavior and preferences based on their past interactions with the website. By analyzing the correlations between users and their interactions, the Specialist can identify patterns and make personalized recommendations to enhance the website's service.

Submit
38. A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data.    Which type of data repository is the MOST cost-effective solution?

Explanation

An Amazon S3 data lake with hourly object prefixes is the most cost-effective solution because it allows for storing a large amount of time-based training data and easily adding new files every hour. With S3, the volume of updates can change significantly throughout a 24-hour period without incurring additional costs. Additionally, using object prefixes allows for efficient organization and retrieval of the data based on the time intervals required for training.

Submit
39. An ad tech company is using an XGBoost model to classify its clickstream data. The company's Data Scientist is asked to explain how the model works to a group of non-technical colleagues. What is a simple explanation the Data Scientist can provide?

Explanation

XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees. It is a powerful and popular algorithm used for classification tasks. It works by iteratively combining multiple weak decision trees to create a strong predictive model. This algorithm is specifically designed to handle large datasets and provides high accuracy and efficiency.

Submit
40. A healthcare company using the AWS Cloud has access to a variety of data types, including raw and preprocessed data. The company wants to start using this data for its ML pipeline, but also wants to make sure the data is highly available and located in a centralized repository. What approach should the company take to achieve the desired outcome?

Explanation

The company should create a data lake using Amazon S3 as the data storage layer. This approach allows the company to store both raw and preprocessed data in a centralized repository. Amazon S3 provides highly available and scalable storage for various data types, making it an ideal choice for the company's ML pipeline.

Submit
41. A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS.   Which approach should the Specialist use for training a model using that data?

Explanation

The Specialist should push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook. This approach allows for efficient and scalable storage of the historical training data in Amazon S3, which can then be easily accessed and used for training the model in Amazon SageMaker. It also ensures that the data is securely stored and can be easily shared and accessed by other services or users within the AWS environment.

Submit
42. A Data Scientist created a correlation matrix between nine variables and the target variable. The correlation coefficient between two of the numerical variables, variable 1 and variable 5, is -0.95. How should the Data Scientist interpret the correlation coefficient?

Explanation

The correlation coefficient of -0.95 indicates a strong negative correlation between variable 1 and variable 5. This means that as variable 1 increases, variable 5 tends to decrease. The closer the correlation coefficient is to -1, the stronger the negative correlation. Therefore, the Data Scientist can interpret that there is a strong inverse relationship between variable 1 and variable 5.

Submit
43. A manufacturing company has a large set of labeled historical sales data The manufacturer would like to predict how many units of a particular part should be produced each quarter.   Which machine learning approach should be used to solve this problem?

Explanation

Linear regression is the appropriate machine learning approach to solve the problem of predicting the number of units of a particular part that should be produced each quarter. Linear regression is used for predicting a continuous numerical value, which aligns with the problem of predicting the quantity of units to be produced. Logistic regression, Random Cut Forest (RCF), and Principal Component Analysis (PCA) are not suitable in this case because they are used for different types of problems such as classification, anomaly detection, and dimensionality reduction, respectively.

Submit
44. A real estate company wants to provide its customers with a more accurate prediction of the final sale price for houses they are considering in various cities. To do this, the company wants to use a fully connected neural network trained on data from the previous ten years of home sales, as well as other features. What kind of machine learning problem does this situation represent?

Explanation

This situation represents a regression problem. Regression is a type of machine learning problem where the goal is to predict a continuous numerical value. In this case, the real estate company wants to predict the final sale price of houses, which is a continuous variable. By using a fully connected neural network trained on previous home sales data, the company can make more accurate predictions for their customers.

Submit
45. A log analytics company wants to provide a history of Amazon SageMaker API calls made on its client's account for security analysis and operational troubleshooting purposes. What must be done in the client's account to ensure that the company can analyze the API calls?

Explanation

Enabling AWS CloudTrail is necessary to ensure that the log analytics company can analyze the API calls made on the client's account. AWS CloudTrail provides a comprehensive history of API calls made within an AWS account, including Amazon SageMaker API calls. This allows for security analysis and operational troubleshooting purposes, as requested by the log analytics company. IAM roles and permissions, as well as CloudWatch logs, may be relevant for other purposes but are not specifically mentioned in relation to analyzing API calls. Using the Amazon SageMaker SDK and calling the 'sagemaker_history()' function is not mentioned as a requirement for this specific task.

Submit
46. A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold.   What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model's performance?

Explanation

The Specialist should use the Receiver Operating Characteristic (ROC) curve to understand how different classification thresholds will impact the model's performance. The ROC curve is a graphical representation that shows the trade-off between the true positive rate and the false positive rate at various classification thresholds. By analyzing the curve, the Specialist can determine the optimal threshold that balances the true positive rate and the false positive rate, and therefore maximize the model's performance.

Submit
47. A Data Scientist wants to include "month" as a categorical column in a training dataset for an ML model that is being built. However, the ML algorithm gives an error when the column is added to the training data. What should the Data Scientist do to add this column?

Explanation

The Data Scientist should convert the "month" column to 12 different columns, one for each month, by using one-hot encoding. This is because the ML algorithm gives an error when the column is added to the training data, indicating that the algorithm cannot handle categorical data directly. One-hot encoding will create binary columns for each month, allowing the ML model to interpret and utilize the information effectively.

Submit
48. A Data Scientist at a waste recycling company trained a CNN model to classify waste at the company's sites. Incoming waste was classified as either trash, compost, or recyclable to make it easier for the machines to split the incoming waste into the appropriate bins. During model testing, the F1 score was 0.918. The company's senior leadership originally asked the Data Scientist to reach an F1 score of at least 0.95. What should the Data scientists do to improve this score without spending too much time optimizing the model?

Explanation

The Data Scientist should use Amazon SageMaker tuning jobs to tune the hyperparameters used. This technique allows for automatic hyperparameter optimization, which can help improve the model's performance without spending too much time manually optimizing the model. By tuning the hyperparameters, the Data Scientist can potentially find the best combination of settings that will result in a higher F1 score.

Submit
49. A video streaming company wants to analyze its VPC flow logs to build a real-time anomaly detection pipeline. The pipeline must be minimally managed and enable the business to build a near real-time dashboard. What combination of AWS service and algorithm can the company use for this pipeline?

Explanation

The company can use Kinesis Data Analytics with RandomCutForest for this pipeline. Kinesis Data Analytics is a fully managed service that can process and analyze streaming data in real-time. RandomCutForest is an algorithm provided by Amazon SageMaker that can be used for anomaly detection. By combining these two services, the company can build a minimally managed pipeline to analyze VPC flow logs and generate real-time insights for a dashboard.

Submit
50. A manufacturing company asks its Machine Learning Specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100000 images per defect type for training.   During the initial training of the image classification model the Specialist notices that the validation accuracy is 80%, while the training accuracy is 90%. It is known that human-level performance for this type of image classification is around 90%.   What should the Specialist consider to fix this issue?

Explanation

The Specialist should consider using some form of regularization to fix the issue. Regularization techniques such as L1 or L2 regularization can help prevent overfitting, which is indicated by the higher training accuracy compared to the validation accuracy. By adding regularization, the model's complexity is reduced, preventing it from memorizing the training data and improving its generalization ability. This can help bring the validation accuracy closer to the human-level performance of 90%.

Submit
51. A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1:10] Considering the graph, what is a reasonable selection for the optimal choice of k?

Explanation

Based on the graph, we can observe that the decrease in the sum of squared errors (SSE) is significant until k=4, after which the decrease becomes less pronounced. This suggests that k=4 is a reasonable selection for the optimal choice of k, as it provides a good balance between minimizing the SSE and avoiding overfitting.

Submit
52. A Data Engineer needs to build a model using a dataset containing customer credit card information.   How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?

Explanation

The correct answer is to use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue. This solution ensures that the data is encrypted both at rest and in transit, using AWS KMS for encryption. Additionally, AWS Glue can be used to remove or redact the credit card numbers from the dataset, further enhancing the security of the data.

Submit
53. A team of Data Scientists wants to use Amazon SageMaker training jobs to run two different versions of the same model in parallel to compare the long-term effectiveness of the different versions in reaching the related business outcome. How should the team deploy these two model versions with minimum management?

Explanation

The team should create an endpoint configuration with production variants for the two models with equal weights. This means that both models will be deployed and used simultaneously, with an equal amount of traffic being directed to each model. By doing this, the team can compare the long-term effectiveness of the two versions in reaching the related business outcome. This approach requires minimum management as the models are deployed and managed automatically by Amazon SageMaker.

Submit
54. A large JSON dataset for a project has been uploaded to a private Amazon S3 bucket. The Machine Learning Specialist wants to securely access and explore the data from an Amazon SageMaker notebook instance. A new VPC was created and assigned to the Specialist.   How can the privacy and integrity of the data stored in Amazon S3 be maintained while granting access to the Specialist for analysis?

Explanation

The correct answer is to launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data, and define a custom S3 bucket policy to only allow requests from the VPC to access the S3 bucket. This ensures that the data stored in Amazon S3 remains private and can only be accessed by the Specialist through the VPC and the designated notebook instance. The S3 VPC endpoint establishes a private connection between the VPC and S3, eliminating the need for internet access. The custom S3 bucket policy further restricts access to the bucket, ensuring the integrity of the data.

Submit
55. A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available.   Which model is MOST likely to provide the best results in Amazon SageMaker?

Explanation

The Amazon SageMaker Linear Learner algorithm is most likely to provide the best results in this scenario. The task is to forecast air quality based on historical data, which is a regression problem. The Linear Learner algorithm is designed for regression tasks and can effectively learn patterns and make predictions based on the given time series data. Using a regressor predictor_type will allow the algorithm to accurately forecast the air quality in parts per million of contaminants for the next 2 days. The other options, such as k-Nearest-Neighbors and Random Cut Forest, may not be as suitable for this specific task.

Submit
56. A Machine Learning Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5.   Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the SageMaker Variant Invocations Per Instance setting?

Explanation

The Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for endpoint automatic scaling. They performed a load test and found that the peak requests per second (RPS) without service degradation is 20 RPS. Since the invocations per instance setting is measured on a per-minute basis, the Specialist intends to set the invocation safety factor to 0.5. To calculate the appropriate setting, we multiply the peak RPS by 60 (to convert it to requests per minute) and then divide it by the invocation safety factor (0.5). Therefore, the Specialist should set the SageMaker Variant Invocations Per Instance setting to 600.

Submit
57. A video streaming company wants to create a searchable video library that provides a personalized searching experience and automated content moderation for its users, so that when the users search for a keyword, they get all the videos that map to that keyword. The company wants to do this with minimal cost and limited need for management. What approach should the company take to building this solution?

Explanation

The company should use Amazon Rekognition Video to extract metadata from the videos. Amazon Rekognition Video is a service that can analyze and extract meaningful metadata from videos, such as objects, scenes, and faces. By using this service, the company can automatically extract relevant information from the videos, which can then be used for creating a searchable video library. This approach would require minimal cost and limited management, as the extraction of metadata can be automated using Amazon Rekognition Video.

Submit
58. A Machine Learning Specialist is training a model to identify the make and model of vehicles in images. The Specialist wants to use transfer learning and an existing model trained on images of general objects. The Specialist collated a large custom dataset of pictures containing different vehicle makes and models.   What should the Specialist do to initialize the model to re-train it with the custom data?

Explanation

To initialize the model and re-train it with the custom data, the Machine Learning Specialist should use the option of initializing the model with pre-trained weights in all layers and replacing the last fully connected layer. This approach allows the Specialist to leverage the knowledge and features learned by the existing model trained on general objects. By replacing the last fully connected layer, the model can be adapted to the specific task of identifying the make and model of vehicles using the custom dataset.

Submit
59. A Machine Learning Specialist is building a supervised model that will evaluate customers' satisfaction with their mobile phone service based on recent usage. The model's output should infer whether or not a customer is likely to switch to a competitor in the next 30 days.   Which of the following modeling techniques should the Specialist use?

Explanation

The Machine Learning Specialist should use binary classification for this task because the goal is to predict whether or not a customer is likely to switch to a competitor in the next 30 days. Binary classification is the appropriate modeling technique for predicting a binary outcome, in this case, whether a customer will switch or not. Time-series prediction would be more suitable if the goal was to predict future usage patterns, while anomaly detection is used to identify unusual or abnormal behavior. Regression is used for predicting continuous numerical values, which is not the case here.

Submit
60. Given the following confusion matrix for a movie classification model, what is the true class frequency for Romance and the predicted class frequency for Adventure?

Explanation

not-available-via-ai

Submit
61. A Machine Learning Specialist built an image classification deep learning model. However the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%r respectively.   How should the Specialist address this issue and what is the reason behind it?

Explanation

The overfitting problem occurs when a model performs well on the training data but poorly on the testing data, indicating that the model is not generalized enough. Increasing the dropout rate at the flatten layer helps to regularize the model by randomly dropping out some of the neurons during training, preventing over-reliance on specific features and improving generalization. Therefore, increasing the dropout rate is the appropriate solution to address the overfitting issue in this scenario.

Submit
62. A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance.   How should the records be stored in Amazon S3 to improve query performance?

Explanation

Storing the records in Amazon S3 in Parquet file format can improve query performance. Parquet is a columnar storage file format that is optimized for query performance. It uses compression techniques and column-level encoding to reduce the amount of data that needs to be read from disk during queries. This can significantly improve the query execution time, especially for large volumes of data like in this scenario where 1 TB of data is generated every minute.

Submit
63. Which probability distribution would describe the likelihood of flipping a coin "heads"?  

Explanation

The likelihood of flipping a coin "heads" can be described by the Binomial Distribution. This distribution is used when there are two possible outcomes (in this case, heads or tails) and each flip is independent. The Binomial Distribution calculates the probability of a certain number of successes (in this case, heads) in a fixed number of trials (the number of times the coin is flipped).

Submit
64. While working on a neural network project, a Machine Learning Specialist discovers that some features in the data have very high magnitude resulting in this data being weighted more in the cost function.   What should the Specialist do to ensure better convergence during backpropagation?

Explanation

Data normalization is the process of scaling the data to a standard range. In this case, the high magnitude of some features can cause the neural network to give more importance to those features, leading to slower convergence during backpropagation. By normalizing the data, the features will be on a similar scale, allowing the neural network to learn more effectively and converge faster. This helps to prevent any one feature from dominating the cost function and ensures better convergence during backpropagation.

Submit
65. A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access.   Which approach should the Specialist use to continue working?

Explanation

The Specialist should download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment and use the Amazon SageMaker Python SDK to test the code. This approach allows the Specialist to have a local copy of the SageMaker environment, including the necessary dependencies and libraries, without requiring an internet connection. They can continue working on the project and test the code using the SageMaker Python SDK, ensuring compatibility and consistency with the SageMaker environment.

Submit
66. A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis.   Which of the following services would both ingest and store this data in the correct format?

Explanation

Amazon Kinesis Data Firehose is the correct answer because it is a fully managed service that can ingest streaming data and store it in Apache Parquet files. It can automatically convert the incoming data to the desired format and deliver it to various destinations, including Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. This makes it suitable for exploration and analysis of the data.

Submit
67. A Machine Learning Specialist is evaluating an ML model using a custom Deep Learning Amazon Machine Image (AMI) with Anaconda installed to run workloads through the terminal. Unfortunately, the ML Specialist does not have any experience with the Deep Learning AMI and wants to log into the instance and create an ipython notebook (*.ipynb), but cannot access the notebook interface. After creating the AMI instance, what steps should the ML Specialist take to create a notebook?

Explanation

To create a notebook, the ML Specialist should SSH into the Deep Learning AMI instance with port forwarding at port 8888. They should then start a Jupyter notebook application, which will provide them with the notebook interface. From there, they can create a new ipython notebook.

Submit
68. Data and analytics company is expanding its platform on AWS. The company wants to build a serverless product that preprocesses large structured data while minimizing the cost for data storage and compute. The company also wants to integrate the new product with an existing ML product that uses Amazon EMR with Spark. What solution should the company use to build this new product?

Explanation

The company should use AWS Glue for data preprocessing because it is a serverless data integration service that can efficiently process large structured data. Saving the data in Amazon S3 in Parquet format is recommended because Parquet is a columnar storage format that provides better compression and query performance compared to CSV format. This solution will help minimize the cost for data storage and compute while integrating with the existing ML product that uses Amazon EMR with Spark.

Submit
69. A manufacturing company wants to increase the longevity of its factory machines by predicting when a machine part is about to stop working, jeopardizing the health of the machine. The company's team of Data Scientists will build an ML model to accomplish this goal. The model will be trained on data made up of consumption metrics from similar factory machines, and will span a time frame from one hour before a machine part broke down to five minutes after the part degraded. What kind of machine learning algorithm should the company use to build this model?

Explanation

The company should use Amazon SageMaker DeepAR algorithm to build the model. DeepAR is a time series forecasting algorithm that is specifically designed for predicting future values based on historical data. In this case, the algorithm can be trained on the consumption metrics of similar factory machines to predict when a machine part is about to stop working. The algorithm's ability to handle time series data and capture temporal dependencies makes it suitable for this task.

Submit
70. A manufacturer of car engines collects data from cars as they are being driven. The data collected includes timestamp, engine temperature, rotations per minute (RPM), and other sensor readings. The company wants to predict when an engine is going to have a problem so it can notify drivers in advance to get engine so it can notify drivers in advance to get engine maintenance. The engine data is loaded into a data lake for training.   Which is the MOST suitable predictive model that can be deployed into production?

Explanation

The most suitable predictive model that can be deployed into production is to add labels over time to indicate which engine faults occur at what time in the future, turning this into a supervised learning problem. Then, a recurrent neural network (RNN) can be used to train the model to recognize when an engine might need maintenance for a certain fault. This approach allows the model to learn patterns and correlations in the data over time, making it effective for predicting engine problems based on the collected sensor readings.

Submit
71. A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions - Here is an example from the dataset;   "The quck BROWN FOX jumps over the lazy dog "   Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Select THREE)

Explanation

The Specialist needs to perform three operations to correctly sanitize and prepare the data in a repeatable manner. Firstly, normalizing all words by making the sentence lowercase ensures that the capitalization of words does not affect the analysis. Secondly, removing stop words using an English stopword dictionary helps to eliminate commonly used words that do not carry much meaning. Lastly, tokenizing the sentence into words breaks it down into individual units, which is necessary for further processing and analysis.

Submit
72. A large consumer goods manufacturer has the following products on sale;  
  • 34 different toothpaste variants
  • 48 different toothbrush variants
  • 43 different mouthwash variants
  The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average (ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched.   Which solution should a Machine Learning Specialist apply?

Explanation

The correct solution is to train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product. The DeepAR algorithm is specifically designed for time series forecasting, which is the case for predicting demand for consumer goods. It can handle multiple related time series and capture complex patterns and dependencies in the data. Therefore, it is the most suitable solution for forecasting demand for the new product based on the sales history of the existing products.

Submit
73. A security and networking company wants to use ML to flag certain IP addresses that have been known to send spam and phishing information. The company wants to build an ML model based on previous user feedback indicating whether specific IP addresses have been connected to a website designed for spam and phishing. What is the simplest solution that the company can implement?

Explanation

A rule-based solution should be used instead of ML because the company already has specific criteria (previous user feedback) to identify IP addresses connected to spam and phishing websites. ML models require training data, which may not be readily available in this case. By using a rule-based solution, the company can set predefined rules based on the feedback to flag the IP addresses without the need for ML algorithms.

Submit
74. A Machine Learning specialist is examining the root cause for underperformance of a regression model and has a hunch that it is consistently overestimating the outcome.   Which metrics should he track on a chart to help identify any pattern of model overestimation?

Explanation

The residuals are the differences between the predicted values and the actual values in a regression model. By tracking the residuals on a chart, the Machine Learning specialist can identify any consistent pattern of overestimation. If the residuals consistently have positive values, it indicates that the model is consistently overestimating the outcome. Therefore, tracking the residuals can help in identifying the root cause of the underperformance of the regression model.

Submit
75. A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target.   What option can the Specialist use to determine whether it is overestimating or underestimating the target value?

Explanation

Residual plots can be used to determine whether a regression model is overestimating or underestimating the target value. Residual plots show the difference between the predicted and actual values of the target variable. If the majority of the residuals are positive, it indicates that the model is overestimating the target. Conversely, if the majority of the residuals are negative, it indicates that the model is underestimating the target. Therefore, by analyzing the pattern of residuals in the plot, the Machine Learning Specialist can understand whether the model is more frequently overestimating or underestimating the target.

Submit
76. In AWS SageMaker, what feature allows you to distribute machine learning model training across multiple instances and is designed for large-scale distributed training?

Explanation

SageMaker Distributed Training is a feature within Amazon SageMaker that enables large-scale distributed training of machine learning models across multiple instances. This advanced capability is particularly useful for handling large datasets and complex model training scenarios, making it an essential tool for scaling machine learning workflows in AWS.

Submit
77. A healthcare organization has an application that takes in sensitive user data. This data is encrypted at rest and stored in an Amazon S3 bucket using customer-managed encryption with AWS Key Management Service (AWS KMS). A Data Scientist in the organization wants to use this encrypted data as features in an Amazon SageMaker training job. However, the following error continues to occur: "Data download failed." What should the Data Scientist do to fix this issue?

Explanation

The Data Scientist should make sure that the AWS Identity and Access Management (IAM) role used for Amazon S3 access has the necessary permissions to encrypt and decrypt the data with the AWS KMS key. This is because the data is encrypted at rest and stored in an Amazon S3 bucket using customer-managed encryption with AWS KMS. Without the proper permissions, the Data Scientist will not be able to access and download the encrypted data for use in the Amazon SageMaker training job, resulting in the "Data download failed" error.

Submit
78. A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes.   Which prior probability distribution should the ML Specialist use for this variable?  

Explanation

The ML Specialist should use a Poisson distribution for this variable because it represents the number of events (minutes of waiting) occurring in a fixed interval (bus cycle every 10 minutes) with a known average rate (mean of 3 minutes). The Poisson distribution is commonly used for modeling the number of events in a fixed interval, making it the appropriate choice in this scenario.

Submit
79. You work for a manufacturing company that produces retail apparel, such as shoes, dresses, blouses, etc. Your head of manufacturing has asked you to use your data science skills to determine which product, among a list of potential next products, your company should invest its resources to produce. You decide you need to predict the sales levels of each of the potential next products and select the one with the highest predicted purchase rate.   Which type of machine learning approach should you use?

Explanation

The goal is to predict the sales levels of each potential next product and select the one with the highest predicted purchase rate. This is a regression problem because the target variable (sales levels) is a continuous variable. Linear regression is an appropriate machine learning approach for regression problems as it aims to model the relationship between the independent variables (such as product features, marketing efforts, etc.) and the dependent variable (sales levels) through a linear equation. By using linear regression, the company can make predictions on the sales levels of each potential product and choose the one with the highest predicted sales.

Submit
80. A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs.   What does the Specialist need to do?

Explanation

The Specialist needs to build the Docker container to be NVIDIA-Docker compatible. This is because the Specialist is using Amazon EC2 P3 instances, which have NVIDIA GPUs. By making the Docker container NVIDIA-Docker compatible, the Specialist ensures that the container can properly leverage the NVIDIA GPUs for training the model. This compatibility allows the container to access the necessary NVIDIA drivers and libraries for GPU acceleration.

Submit
81. A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist.   Which machine learning model type should the Specialist use to accomplish this task?

Explanation

The Specialist should use a Classification model to accomplish this task. Classification models are used to predict the category or class that an observation belongs to based on its features. In this case, the Specialist wants to group customers into categories of churn or no churn, which is a binary classification problem. By training a classification model on the labeled data, the Specialist can predict whether a customer will churn or not based on their features.

Submit
82. A retail company intends to use machine learning to categorize new products. A labeled dataset of current products was provided to the Data Science team. The dataset includes 1,200 products. The labeled dataset has 15 features for each product such as title dimensions, weight, and price. Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies.   Which model should be used for categorizing new products using the provided dataset for training?

Explanation

An XGBoost model with the objective parameter set to multi: softmax should be used for categorizing new products using the provided dataset for training. XGBoost is a powerful machine learning algorithm that can handle multi-class classification problems effectively. By setting the objective parameter to multi: softmax, the model will be trained to output probabilities for each category, allowing it to classify new products accurately. This model is suitable for the given dataset as it has multiple features and six different categories to classify the products into.

Submit
83. A Machine Learning Specialist has built a model using Amazon SageMaker built-in algorithms and is not getting expected accurate results. The Specialist wants to use hyperparameter optimization to increase the model's accuracy.   Which method is the MOST repeatable and requires the LEAST amount of effort to achieve this?

Explanation

Creating a hyperparameter tuning job and setting the accuracy as an objective metric is the most repeatable and requires the least amount of effort to achieve the desired results. This approach automates the process of searching for the best hyperparameters by iteratively training and evaluating models with different hyperparameter configurations. By setting the accuracy as the objective metric, the tuning job will optimize the hyperparameters to maximize the model's accuracy. This method eliminates the need for manual trial and error and ensures a systematic and efficient approach to improving the model's accuracy.

Submit
84. A company's Machine Learning Specialist needs to improve the training speed of a time-series forecasting model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily.   The model accuracy js acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes.   What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?

Explanation

To scale the training solution for future demand, the Machine Learning Specialist should change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. This will allow the training to be parallelized across multiple machines, enabling faster training and accommodating the anticipated increase in training data size. By using Horovod and Amazon SageMaker, the Specialist can achieve the business goals without making significant changes to the coding effort and infrastructure.

Submit
85. A Machine Learning Specialist is working for a credit card processing company and receives an unbalanced dataset containing credit card transactions. It contains 99,000 valid transactions and 1,000 fraudulent transactions. The Specialist is asked to score a model that was run against the dataset. The Specialist has been advised that identifying valid transactions is equally as important as identifying fraudulent transactions.   What metric is BEST suited to score the model?

Explanation

The best metric to score the model in this scenario is the Area Under the ROC Curve (AUC). AUC is a suitable metric for evaluating models on unbalanced datasets because it measures the model's ability to distinguish between the positive and negative classes. In this case, the positive class represents the fraudulent transactions, and the negative class represents the valid transactions. AUC takes into account both true positive rate (sensitivity) and false positive rate (1-specificity), which is important when identifying both fraudulent and valid transactions is equally important.

Submit
86. 17. An online news organization wants to expand its reach globally by translating some of its most commonly read articles into different languages using ML. The organization's data science team is gathering all the news articles that they have published in both English and at least one other language. They want to use this data to create one machine learning model for each non-English language that the organization is targeting. The models should only require minimum management. What approach should the team use to building these models?

Explanation

The team should use Amazon SageMaker Object2Vec to create a vector and then use the Amazon SageMaker built-in Sequence to Sequence model (Seq2Seq). This approach allows them to create a vector representation of the news articles using Object2Vec, which can capture the semantic meaning of the text. Then, they can use the Seq2Seq model to train a machine learning model for each non-English language. The built-in Seq2Seq model in Amazon SageMaker simplifies the management of the models as it provides pre-built functionality for sequence-to-sequence tasks, reducing the need for additional management efforts.

Submit
87. A transportation company currently uses Amazon EMR with Apache Spark for some of its data transformation workloads. It transforms columns of geographical data (like latitudes and longitudes) and adds columns to segment the data into different clusters per city to attain additional features for the k-nearest neighbors algorithm being used. The company wants less operational overhead for their transformation pipeline. They want a new solution that does not make significant changes to the current pipeline and only requires minimal management. What AWS services should the company use to build this new pipeline?

Explanation

The company should use AWS Glue to transform files because it is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue supports Apache Spark and can be used to transform the geographical data columns and add additional features for the k-nearest neighbors algorithm. Using Amazon S3 as the destination for the transformed data is a good choice because it is a scalable and cost-effective storage solution. This solution meets the company's requirements of minimal changes to the current pipeline and minimal management.

Submit
88. A Machine Learning Specialist works for a credit card processing company and needs to predict which transactions may be fraudulent in near-real time. Specifically, the Specialist must train a model that returns the probability that a given transaction may be fraudulent.   How should the Specialist frame this business problem?

Explanation

The Specialist should frame this business problem as binary classification because the goal is to predict whether a transaction is fraudulent or not. Binary classification is used when there are only two possible outcomes, in this case, fraudulent or not fraudulent.

Submit
89. Your marketing department wishes to understand how their products are being represented in the various social media services in which they have active content streams. They would like insights into the reception of a current product line so they can plan for the roll out of a new product in the line in the new future. You have been tasked with creating a service that organizes the social media content by sentiment across all languages so that your marketing department can determine how best to introduce the new product.   How would you quickly and most efficiently design and build a service for your marketing team that gives insight into the social media sentiment?

Explanation

The best way to quickly and efficiently design and build a service that gives insight into social media sentiment for the marketing team is to use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight. These services provide a comprehensive solution for natural language processing (NLP) and data analysis. Amazon Translate can be used to translate social media content into different languages, Amazon Comprehend can be used for sentiment analysis, Amazon Kinesis can be used for real-time data streaming, Amazon Athena can be used for querying and analyzing the data, and Amazon QuickSight can be used for visualizing the insights on a dashboard. This combination of services enables the marketing team to understand the sentiment of their products across different languages and make informed decisions for the roll out of a new product.

Submit
90. A social networking organization wants to analyze all the comments and likes from its users to flag offensive language on the site. The organization's data science team wants to use a Long Short-term Memory (LSTM) architecture to classify the raw sentences from the comments into one of two categories: offensive and nonoffensive. What should the team do to prepare the data for the LSTM?

Explanation

To prepare the data for the LSTM, the team should vectorize the sentences by transforming them into numerical sequences. Additionally, padding should be applied to ensure that all sequences have the same length. This is important because LSTMs require fixed-length input. By vectorizing and padding the sentences, the data can be effectively processed by the LSTM model for classification.

Submit
91. An analytics company wants to use a fully managed service that automatically scales to handle the transfer of its Apache web logs, syslogs, text and videos on their webserver to Amazon S3 with minimum transformation. What service can be used for this process?

Explanation

Kinesis Firehose is the correct answer for this question. Kinesis Firehose is a fully managed service that automatically scales to handle the transfer of data, such as Apache web logs, syslogs, text, and videos, from various sources to Amazon S3. It requires minimum transformation, making it suitable for the given scenario where the analytics company wants to transfer their web logs, syslogs, text, and videos to Amazon S3 without extensive data manipulation.

Submit
92. A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:
  • Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.
  • Support event-driven ETL pipelines.
  • Provide a quick and easy way to understand metadata.
Which approach meets these requirements?

Explanation

This approach meets the requirements because it utilizes AWS Glue, which is a fully managed extract, transform, and load (ETL) service. The AWS Glue crawler is used to automatically discover and catalog metadata about the data in the S3 data lake. An AWS Lambda function is used to trigger the AWS Glue ETL job, which allows for event-driven ETL pipelines. The AWS Glue Data Catalog is used to search and discover metadata, providing a quick and easy way to understand the data lake's metadata. This approach also aligns with the requirement of supporting querying old and new data through Amazon Athena and Amazon Redshift Spectrum.

Submit
93. You would like to draw inferences from a model deployed to Amazon SageMaker Hosting Services. If you are using AWS SDK for Python (Boto 3) library, which method would you call for inferences?

Explanation

When using the AWS SDK for Python (Boto 3) library, the method to call for inferences from a model deployed to Amazon SageMaker Hosting Services is "invoke_endpoint". This method is used to send a request to the endpoint and receive the inference response. It allows you to pass input data to the model and obtain the corresponding predictions or inferences.

Submit
94. An insurance company is developing a new device for vehicles that uses a camera to observe drivers' behavior and alert them when they appear distracted. The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models.   During the model evaluation the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images.   Which of the following should be used to resolve this issue? (Select TWO)

Explanation

Performing data augmentation on the training data can help resolve the issue of the model not accurately inferring on unseen test images. Data augmentation involves creating new training data by applying various transformations to the existing training images, such as rotating, scaling, or flipping them. This can help the model generalize better and improve its performance on unseen data.

Adding L2 regularization to the model can also be helpful in resolving this issue. L2 regularization adds a penalty term to the loss function during training, which discourages the model from overfitting the training data. By reducing overfitting, the model is more likely to perform well on unseen test images.

Submit
95. You work for a retail firm that wishes to conduct a direct mail campaign to attract new customers. Your marketing manager wishes to get answers to questions that can be put into discrete categories, such as "using historical customer email campaign responses, should this customer receive an email from our current campaign?" You decide to use the SageMaker Linear Learner algorithm to build your model.   Which hyperparameter setting would you use to get the algorithm to produce discrete results?

Explanation

The correct answer is to set the predictor_type hyperparameter to binary_classifier. This is because a binary classifier is designed to produce discrete results by classifying data into two categories, such as yes or no, true or false. In this case, the goal is to determine whether a customer should receive an email from the current campaign or not, which is a binary classification problem. By setting the predictor_type hyperparameter to binary_classifier, the SageMaker Linear Learner algorithm will be trained to produce discrete results for this specific task.

Submit
96. For the given confusion matrix, what is the recall and precision of the model?

Explanation

The recall of a model is the ratio of the true positive predictions to the sum of true positive and false negative predictions. In this case, the recall is 0.92, which means that the model has a high ability to correctly identify positive instances.

The precision of a model is the ratio of the true positive predictions to the sum of true positive and false positive predictions. In this case, the precision is 0.8, which means that the model has a moderate ability to avoid false positive predictions.

Therefore, the correct answer is Recall = 0.92 Precision = 0.8.

Submit
97. A Machine Learning Specialist is using Amazon SageMaker to host a model for a highly available customer- facing application. The Specialist has trained a new version of the model, validated it with historical data, and now wants to deploy it to production. To limit any risk of a negative customer experience, the Specialist wants to be able to monitor the model and roll it back, if needed.   What is the SIMPLEST approach with the LEAST risk to deploy the model and roll it back, if needed?  

Explanation

The simplest approach with the least risk to deploy the model and roll it back, if needed, is to update the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant. This approach allows for gradual testing and monitoring of the new model's performance with a small percentage of traffic, minimizing the impact on customers. If the model does not perform as expected, traffic can be reverted to the last version by resetting the weights.

Submit
98. An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget.   What should the Specialist do to meet these requirements?

Explanation

The Specialist should download word embeddings pre-trained on a large corpus to meet the requirements. Word embeddings are dense vector representations of words that capture semantic relationships between words. By using pre-trained word embeddings, which have been trained on a large corpus of text data, the Specialist can leverage the learned semantic relationships to provide word features for the downstream nearest neighbor model. This will enable the widget to display words used in similar contexts based on the semantic similarities captured by the word embeddings.

Submit
99. A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: 1. Please call the number below. 2. Please do not call us. What are the dimensions of the tf–idf matrix?

Explanation

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The
phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is
“Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the
number,” “number below,” “Please do,” “do not,” “not call,” and “call us.”

Submit
100. A data engineer needs to create a cost-effective data pipeline solution that ingests unstructured data from various sources and stores it for downstream analytics applications and ML. The solution should include a data store where the processed data is highly available for at least one year so that data analysts and data scientists can run analytics and ML workloads on the most recent data. For compliance reasons, the solution should include both processed and raw data. The raw data does not need to be accessed regularly, but when needed, should be accessible within 24 hours.    What solution should the data engineer deploy?

Explanation

The data engineer should deploy Amazon S3 Standard for the processed data that is within one year of processing. After one year, they should use Amazon S3 Glacier for the processed data. Additionally, they should use Amazon S3 Glacier Deep Archive for all raw data. This solution ensures that the processed data is highly available for at least one year, allowing data analysts and data scientists to run analytics and ML workloads on the most recent data. The use of Amazon S3 Glacier Deep Archive for raw data ensures compliance and accessibility within 24 hours when needed.

Submit
101. A Dats Scientist at a retail company is using Amazon SageMaker to classify social media posts that mention the company into one of two categories: Posts that require a response from the company, and posts that do not. The Data Scientist is using a training dataset of 10,000 posts, which contains the timestamp, author, and full text of each post. However, the Data Scientist is missing the target labels that are required for training. Which approach can the Data Scientist take to create valid target label data? (Select TWO.)

Explanation

The Data Scientist can ask the social media handling team to review each post using Amazon SageMaker GroundTruth and provide the label. This approach involves manual review and labeling of each post by the team, ensuring accurate target labels for training. Additionally, the Data Scientist can use Amazon Mechanical Turk to publish Human Intelligence Tasks that ask Turk workers to label the posts. This crowdsourcing approach allows for a larger pool of workers to label the posts, increasing efficiency and scalability in generating valid target label data.

Submit
102. A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset.   Which tool should be used to improve the validation accuracy?

Explanation

The Data Scientist believes that the poor validation accuracy may be due to a rich vocabulary and low average frequency of words in the dataset. In order to improve the accuracy, they should use Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers. TF-IDF is a technique that assigns weights to words based on their frequency in a document and their rarity in the entire dataset. By using TF-IDF vectorizers, the Data Scientist can give more importance to the words that are both frequent in a document and rare in the dataset, which can help improve the accuracy of the sentiment analysis application.

Submit
103. A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company's Amazon S3-based data lake.   The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of:  
  • Real-time analytics
  • Interactive analytics of historical data
  • Clickstream analytics
  • Product recommendations
  Which services should the Specialist use?

Explanation

The Specialist should use AWS Glue as the data catalog to manage the metadata of the data lake. They should use Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights, allowing them to process and analyze streaming data in real-time. They should also use Amazon Kinesis Data Firehose to deliver the clickstream data to Amazon ES for clickstream analytics. Lastly, they should use Amazon EMR to generate personalized product recommendations by processing and analyzing the data in the data lake.

Submit
104. A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker. The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm. The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model.   What should the Specialist do to address the performance issues with the current solution?

Explanation

Setting the input mode for the training job to Pipe allows the data to be streamed directly from Amazon S3 to the training instances, which can significantly reduce the time it takes for the training job to launch. This can help address the observed issue of the unacceptable length of time before the training job launches.

Submit
105. An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code. How should the company set up this new pipeline?

Explanation

The company should use the Build Your Own Container (BYOC) Amazon SageMaker option. This option allows the company to create a new Docker container with the existing code and register it in Amazon Elastic Container Registry (ECR). By using this option, the company can port their training and inference code written in R to Amazon SageMaker with minimal changes. They can then run the training and inference jobs using this container, ensuring a seamless transition to the new ML pipeline.

Submit
106. A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements However company acronyms are being mispronounced in the current documents.   How should a Machine Learning Specialist address this issue for future documents?

Explanation

To address the issue of mispronounced company acronyms in future documents, a Machine Learning Specialist should create an appropriate pronunciation lexicon. This lexicon will contain the correct pronunciation of the acronyms, allowing Amazon Polly to accurately pronounce them in the speech output. By creating a pronunciation lexicon, the specialist can ensure that the acronyms are pronounced correctly in all future automated company announcements.

Submit
107. A financial organization uses multiple ML models to detect irregular patterns in its data to combat fraudulent activity such as money laundering. They use a TensorFlow-based Docker container on GPU-enabled Amazon EC2 instances to concurrently train the multiple models for this workload. However, they want to automate the batch data preprocessing and ML training aspects of this pipeline, scheduling them to take place automatically every 24 hours. What AWS service can they use to do this?

Explanation

The financial organization can use AWS Batch to automate the batch data preprocessing and ML training aspects of their pipeline. AWS Batch is a fully managed service that allows users to run batch computing workloads on the AWS Cloud. It takes care of the underlying infrastructure and automatically scales resources to meet the demands of the workload. With AWS Batch, the organization can schedule their data preprocessing and ML training tasks to run automatically every 24 hours, ensuring a streamlined and automated process.

Submit
108. A bank's Machine Learning team is developing an approach for credit card fraud detection. The company has a large dataset of historical data labeled as fraudulent. The goal is to build a model to take the information from new transactions and predict whether each transaction is fraudulent or not.   Which built-in Amazon SageMaker machine learning algorithm should be used for modeling this problem?

Explanation

XGBoost is a popular machine learning algorithm that is commonly used for classification problems, such as fraud detection. It is known for its ability to handle large datasets and its effectiveness in capturing complex relationships between variables. Therefore, it would be a suitable choice for the bank's Machine Learning team to use XGBoost for modeling the credit card fraud detection problem.

Submit
109. A Machine Learning Engineer wants to use Amazon SageMaker and the built-in XGBoost algorithm for model training. The training data is currently stored in CSV format, with the first 10 columns representing features and the 11th column representing the target label. What should the ML Engineer do to prepare the data for use in an Amazon SageMaker training job?

Explanation

To prepare the data for use in an Amazon SageMaker training job, the ML Engineer should first change the target label to the first column. Then, the data should be split into training, validation, and test sets. Finally, the datasets should be uploaded to Amazon S3.

Submit
110. A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?

Explanation

It is most likely that the loss function is very curvy and has multiple local minima where the training is
getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima
saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum

Submit
111. A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet How should the company set up the job?

Explanation

The company should launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint. This setup ensures that the data is not transmitted across the internet, as required by the compliance policy. By using a private subnet, the instances are not accessible from the public internet. The S3 VPC endpoint allows the instances to securely access the S3 bucket within the VPC, without the need for internet connectivity. This ensures that the data remains within the company's network and complies with the compliance policy.

Submit
112. A Data Scientist wants to tune the hyperparameters of a machine learning model to improve the model's F1score. What technique can be used to achieve this desired outcome on Amazon SageMaker? (Select TWO)

Explanation

Random Search and Bayesian optimization are two techniques that can be used to tune the hyperparameters of a machine learning model on Amazon SageMaker to improve the model's F1 score. Random Search involves randomly selecting combinations of hyperparameters from a predefined search space and evaluating their performance. Bayesian optimization, on the other hand, uses a probabilistic model to find the optimal set of hyperparameters by iteratively exploring the search space based on previous evaluations. Both techniques can help identify the best hyperparameter values that maximize the F1 score.

Submit
113. A Data Scientist wants to use the Amazon SageMaker hyperparameter tuning job to automatically tune a random forest model. What API does the Amazon SageMaker SDK use to create and interact with the Amazon SageMaker hyperparameter tuning jobs?

Explanation

The Amazon SageMaker SDK uses the HyperparameterTuner() API to create and interact with the Amazon SageMaker hyperparameter tuning jobs. This API allows the data scientist to automate the tuning process for their random forest model, optimizing the hyperparameters to improve the model's performance.

Submit
114. A Machine Learning Specialist is developing a custom video recommendation model for an application. The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket. The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance.   Which approach allows the Specialist to use all the data to train the model?

Explanation

To avoid loading the entire large dataset onto the limited storage of the SageMaker notebook instance, the Machine Learning Specialist should load a smaller subset of the data into the notebook and train locally. This allows them to confirm that the training code is executing correctly and the model parameters are reasonable. Once this is verified, they can initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode. This approach allows the Specialist to use all the data for training without exceeding the storage limitations of the notebook instance.

Submit
115. A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes;  
  • Start the workflow as soon as data is uploaded to Amazon S3
  • When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3
  • Store the results of joining datasets in Amazon S3
  • If one of the jobs fails, send a notification to the Administrator.
Which configuration will meet these requirements?

Explanation

The correct answer is to use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. This configuration meets the requirements as it ensures that the workflow starts as soon as data is uploaded to S3. AWS Glue is used to join the datasets, and the results are stored in S3. An Amazon CloudWatch alarm is set up to send an SNS notification to the Administrator in case of a failure. This configuration provides an efficient and scalable solution for the daily ETL workflow.

Submit
116. A company is using its genomic data to classify how different human DNA affects cell growth, so that they can predict a person's chances of getting cancer. Before creating and preparing the training and validation datasets for the model, the company wants to reduce the high dimensionality of the data. What technique should the company use to achieve this goal? (Select TWO.)

Explanation

The company should use T-SNE and PCA to reduce the dimensionality of the data. T-SNE is a technique used for visualizing high-dimensional data by reducing it to a lower-dimensional space while preserving the structure of the data. It is particularly useful for visualizing clusters or patterns in the data. PCA, on the other hand, is a technique used to reduce the dimensionality of data by finding a new set of variables (principal components) that capture the most important information in the original data. By visualizing the data using matplotlib, the company can gain insights into the distribution and relationships between variables.

Submit
117. A Machine Learning Specialist needs to move and transform data in preparation for training. Some of the data needs to be processed in near-real time and other data can be moved hourly. There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data.   Which of the following services can feed data to the MapReduce jobs? (Select TWO)

Explanation

Amazon Kinesis and AWS Data Pipeline are both services that can feed data to MapReduce jobs. Amazon Kinesis is a real-time streaming service that can continuously collect and process large streams of data in near-real time. AWS Data Pipeline is a web service for orchestrating and automating the movement and transformation of data between different AWS services and on-premises data sources. Both of these services can be used to feed data to the MapReduce jobs for processing and analysis.

Submit
118. A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 1) Must have a recall rate of at least 80% 2) Must have a false positive rate of 10% or less 3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?

Explanation

The following calculations are required:

TP = True Positive
FP = False Positive
FN = False Negative
TN = True Negative
FN = False Negative

Recall = TP / (TP + FN)

False Positive Rate (FPR) = FP / (FP + TN)

Cost = 5 * FP + FN

Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective.

Submit
119. An ML Engineer at a real estate startup wants to use a new quantitative feature for an existing ML model that predicts housing prices. Before adding the feature to the cleaned dataset, the Engineer wants to visualize the feature in order to check for outliers and overall distribution and skewness of the feature. What visualization technique should the ML Engineer use? (Select TWO.)

Explanation

The ML Engineer should use a box plot to visualize the feature in order to check for outliers. A box plot provides information about the minimum, first quartile, median, third quartile, and maximum values of a dataset, allowing the Engineer to identify any extreme values that may be considered outliers. Additionally, the Engineer should use a histogram to visualize the overall distribution and skewness of the feature. A histogram provides a visual representation of the frequency distribution of the feature, allowing the Engineer to observe the shape of the distribution and identify any skewness.

Submit
120. A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data.   Which of the following methods should the Specialist consider using to correct this? (Select THREE)

Explanation

The Specialist should consider increasing regularization to prevent overfitting and improve generalization on the test data. Increasing dropout can also help reduce overfitting by randomly dropping out neurons during training. Decreasing feature combinations can help simplify the model and reduce the risk of overfitting.

Submit
121. A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance's Amazon EBS volume, and needs to take a snapshot of that EBS volume. However the ML Specialist cannot find the Amazon SageMaker notebook instance's EBS volume or Amazon EC2 instance within the VPC.   Why is the ML Specialist not seeing the instance visible in the VPC?

Explanation

not-available-via-ai

Submit
122. A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?

Explanation

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is
a valid stand-in for the missing full review text field.

Submit
123. A financial planning company is using the Amazon SageMaker endpoint with an Auto Scaling policy to serve its forecasting model to the company's customers to help them plan for retirement. The team wants to update the endpoint with its latest forecasting model, which has been trained using Amazon SageMaker training jobs. The team wants to do this without any downtime and with minimal change to the code. What steps should the team take to update this endpoint?

Explanation

To update the endpoint without any downtime and minimal code changes, the team should first de-register the endpoint as a scalable target. Then, they should update the endpoint using a new endpoint configuration that includes the latest model Amazon S3 path. Finally, they should register the endpoint as a scalable target again. This process ensures that the endpoint is temporarily taken out of the scaling policy, allowing the team to update it with the latest model while minimizing disruption to the customers.

Submit
124. A ride-share company wants to create intelligent conversational chatbots that will serve as first responders to customers who call to report an issue with their ride. The company wants these chatbot-customer calls to mimic natural conversations that provide personalized experiences for the customers. What combination of AWS services can the company use to create this workflow without a lot of ongoing management?

Explanation

The company can use Amazon Lex to parse the utterances and intent of customer comments, and Amazon Polly to generate personalized replies to the customers. Amazon Lex is a service for building conversational interfaces using voice and text, which can understand and interpret customer inputs. Amazon Polly is a service that turns text into lifelike speech, allowing the chatbot to respond to customers in a natural and personalized manner. This combination of services allows the company to create intelligent chatbots that can provide personalized experiences without requiring a lot of ongoing management.

Submit
125. What factors lead to the wide adoption of neural networks in the last decade? (Select THREE.)

Explanation

Neural networks have been widely adopted in the last decade due to several factors. Firstly, the development of efficient algorithms has made it easier to train and optimize neural networks, leading to improved performance. Additionally, the availability of cheaper GPUs (Graphics Processing Units) has significantly accelerated the computational power required for neural network training and inference. Lastly, there has been an exponential increase in the amount of data collected in various fields, allowing neural networks to learn from larger and more diverse datasets, leading to better performance and generalization.

Submit
126. A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is using one of the SageMaker built-in algorithms for the training. The dataset is stored in .CSV format and is transformed into a numpy.array, which appears to be negatively affecting the speed of the training.   What should the Specialist do to optimize the data for training on SageMaker?

Explanation

The Specialist should transform the dataset into the Recordio protobuf format. This format is optimized for high-performance, efficient data storage and retrieval, which can improve the speed of training on SageMaker.

Submit
127. A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker. When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization?

Explanation

Using log-scaled hyperparameters allows for a more efficient search of the hyperparameter space. Log scaling reduces the range of values that need to be searched, making the optimization process faster. By taking the logarithm of the hyperparameters, the search space is compressed, allowing the algorithm to explore a wider range of values in a shorter amount of time. This helps in finding the optimal combination of hyperparameters more quickly and improving the overall optimization process.

Submit
128. A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team.   Which solution requires the LEAST coding effort?

Explanation

The solution that requires the least coding effort is to run a daily Amazon EMR workflow to generate precision-recall data and save the results in Amazon S3. Then, the arrays can be visualized in Amazon QuickSight and published in a dashboard shared with the Business team. This solution minimizes coding effort by utilizing existing services and tools provided by Amazon, rather than building a custom solution from scratch.

Submit
129. If you wanted to build your own Alexa-type device that converses with customers using speech, which Amazon services might you use?

Explanation

To build an Alexa-type device that converses with customers using speech, the appropriate Amazon services to use would be Amazon Transcribe, Amazon Lex, and Amazon Polly. Amazon Transcribe is used to convert speech to text, Amazon Lex is used for natural language understanding and conversation management, and Amazon Polly is used for text-to-speech conversion. This combination of services allows for the device to understand and respond to customer queries using speech.

Submit
130. A Data Scientist at a credit card company trained a classification model to predict fraud at the time of a transaction. The Data Scientist used a confusion matrix to evaluate the performance of the model. Using the confusion matrix below, determine the percent of positive records that were classified correctly. Choose the answer that also labels this evaluation metric correctly.

Explanation

The confusion matrix is used to evaluate the performance of a classification model. It consists of four metrics: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this case, the question asks for the percent of positive records that were classified correctly, which is the definition of recall. Recall measures the ability of a model to correctly identify positive records out of all actual positive records. Therefore, the correct answer is 80%; Recall.

Submit
131. A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting. Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls.   What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps?

Explanation

The Machine Learning team should use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. This will allow the auditors to view the log activity report and ensure there are no unauthorized API calls. Additionally, the team should add code to push a custom metric to Amazon CloudWatch to monitor the model's performance. Finally, they should create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting. This solution addresses the requirements with the least amount of code and fewest steps.

Submit
132. In AWS SageMaker, which of the following statements accurately describes the role of a "SageMaker Endpoint" in the context of deploying machine learning models?

Explanation

A SageMaker Endpoint in AWS is used for real-time inference, enabling predictions on new data by deploying an already trained machine learning model. It is not used for storing training data, batch processing, or hyperparameter tuning. Training data storage and batch processing are handled by other services, while hyperparameter tuning is managed through Hyperparameter Tuning Jobs. The Endpoint focuses on providing real-time predictions once a model is trained.

Submit
133. A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily. Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?

Explanation

The option to insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL will require the least amount of development effort. This option leverages the existing data ingestion pipeline and adds a simple transformation step using SQL, which can be easily implemented. It eliminates the need for additional infrastructure setup like deploying EMR clusters or spinning up EC2 instances, making it a more efficient and cost-effective solution.

Submit
134. A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS. How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3?

Explanation

https://docs.aws.amazon.com/kms/latest/developerguide/control-access-overview.html#managing-access

Submit
135. Example Corp has an annual sale event from October to December. The company has sequential sales data from the past 15 years and wants to use Amazon ML to predict the sales for this year's upcoming event.   Which method should Example Corp use to split the data into a training dataset and evaluation dataset?

Explanation

Since Example Corp has sequential sales data from the past 15 years, it would be appropriate to use a sequential split for the training and evaluation datasets. This means that the data should be split in a way that maintains the order of the sales data. This will allow the model to learn from the past patterns and trends in the data, which is crucial for predicting future sales accurately.

Submit
136. A Data Scientist is using stochastic gradient descent (SGD) as the gradient optimizer to train a machine learning model. However, the model training error is taking longer to converge to the optimal solution than desired. What optimizer can the Data Scientist use to improve training performance? (Select THREE)

Explanation

The Data Scientist can use Adam, Adagrad, and RMSProp as optimizers to improve training performance. These optimizers are known for their ability to converge faster to the optimal solution compared to stochastic gradient descent (SGD). Adam combines the advantages of both AdaGrad and RMSProp, making it a popular choice. Adagrad adapts the learning rate for each parameter individually, which can be beneficial in dealing with sparse data. RMSProp also adapts the learning rate, but it uses a moving average of squared gradients to adjust the learning rate.

Submit
137. A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing. The Data Scientist has been given the following requirements to the cloud solution:
  • Combine multiple data sources.
  • Reuse existing PySpark logic.
  • Run the solution on the existing schedule.
  • Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

Explanation

The Data Scientist should use the architecture described in option 2. This option suggests writing the raw data to Amazon S3 and using AWS Glue ETL job to perform the ETL processing. By writing the ETL job in PySpark, the existing logic can be leveraged. A new AWS Glue trigger can be created to trigger the ETL job based on the existing schedule. The output target of the ETL job can be configured to write to a "processed" location in Amazon S3, which is accessible for downstream use. This architecture meets all the given requirements, including combining multiple data sources, reusing existing PySpark logic, running on the existing schedule, and minimizing the number of managed servers.

Submit
138. A Data Scientist working for an autonomous vehicle company is building an ML model to detect and label people and various objects (for instance, cars and traffic signs) that may be encountered on a street. The Data Scientist has a dataset made up of labeled images, which will be used to train their machine learning model. What kind of ML algorithm should be used?

Explanation

Instance segmentation should be used in this scenario. Instance segmentation not only classifies objects in an image but also provides a pixel-level mask for each individual object. This is important in the context of autonomous vehicles as it allows for accurate detection and labeling of people and various objects on the street. Image classification would only classify the entire image, while image localization would only provide bounding boxes around objects. Semantic segmentation would classify pixels into different categories but would not differentiate between individual objects.

Submit
139. A Data Scientist wants to create a linear regression model to train on a housing dataset to predict home prices. As part of that process, the Data Scientist created a correlation matrix between the dataset's features and the target variable. The correlations between the target and two of the features, feature 3 and feature 7, are 0.64 and -0.85, respectively. Which feature has a stronger correlation with the target variable?

Explanation

The strength of a correlation is determined by its absolute value. In this case, the correlation coefficient between feature 3 and the target variable is 0.64, and the correlation coefficient between feature 7 and the target variable is -0.85. The correlation coefficient of -0.85 is stronger than the correlation coefficient of 0.64 in terms of magnitude. Therefore, feature 7 has a stronger correlation with the target variable compared to feature 3.

Submit
140. A Machine Learning Specialist is working with multiple data sources containing billions of records that need to be joined.   What feature engineering and model development approach should the Specialist take with a dataset this large?

Explanation

not-available-via-ai

Submit
141. A Machine Learning Engineer is creating and preparing data for a linear regression model. However, while preparing the data, the Engineer notices that about 20% of the numerical data contains missing values in the same two columns. The shape of the data is 500 rows by 4 columns, including the target column. How could the Engineer handle the missing values in the data? (Select TWO.)

Explanation

The Engineer can handle the missing values in two ways. Firstly, they can fill the missing values with zeros, which means replacing the missing values with the value of zero. Secondly, they can impute the missing values using regression, which involves using the other available data to predict and fill in the missing values based on a regression model. These two approaches help to ensure that the missing values are accounted for and do not negatively impact the linear regression model's performance.

Submit
142. A company is observing low accuracy while training on the default built-in image classification algorithm in Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture instead of a ResNet architecture.   Which of the following will accomplish this? (Select TWO)

Explanation

The correct answer is to bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training, and to use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network and use this for model training. These options allow the Data Science team to use the Inception neural network architecture instead of the default ResNet architecture by customizing the image classification algorithm and loading the Inception network model for training.

Submit
143. A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.   The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.   Which solution should the Data Scientist build to satisfy the requirements?

Explanation

The Data Scientist should create a schema in the AWS Glue Data Catalog to define the structure of the incoming data. Then, they should use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform it into Apache Parquet or ORC format using the AWS Glue Data Catalog. The transformed data should be delivered to Amazon S3, which provides a highly available output datastore. Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to business intelligence tools using the Athena JDBC connector. This solution satisfies the requirements of buffering and converting the data, providing a highly available datastore, and enabling SQL queries and connection to existing dashboards.

Submit
144. An e-commerce company needs a customized training model to classify images of its shirts and pants products. The company needs a proof of concept in 2 to 3 days with good accuracy.   Which compute choice should the Machine Learning Specialist selects to train and achieve good accuracy on the model quickly?

Explanation

The Machine Learning Specialist should select the p3.2xlarge compute choice because it is a GPU accelerated computing option. GPU acceleration can significantly speed up the training process for machine learning models, especially when working with image classification tasks. The use of GPUs allows for parallel processing and optimization of computations, leading to faster training times and ultimately achieving good accuracy on the model quickly.

Submit
145. A Data Scientist at an ad-tech startup wants to update an ML model that uses an Amazon SageMaker endpoint using the canary deployment methodology, in which the production variant 1 is the production model and the production variant 2 is the updated model. How can the Data Scientist efficiently configure this endpoint configuration to deploy the two different versions of the model while monitoring the Amazon CloudWatch invocations?

Explanation

The Data Scientist should create an endpoint configuration with production variants for the two models with a weight ratio of 0:1. This means that the production variant 1 will have a weight of 0, indicating that it is not being used for inference, while the production variant 2 will have a weight of 1, indicating that it is the updated model being used for inference. By periodically updating the weights, the Data Scientist can gradually shift the traffic from the old model to the updated model and monitor the invocations using Amazon CloudWatch.

Submit
146. Real estate startup wants to use ML to predict the value of homes in various cities. To do so, the startup's data science team is joining real estate price data with other variables such as weather, demographic, and standard of living data. However, the team is having problems with slow model convergence. Additionally, the model includes large weights for some features, which is causing degradation in model performance. What kind of data preprocessing technique should the team use to more effectively prepare this data?

Explanation

The team should use the Standard Scaler data preprocessing technique to effectively prepare the data. Standard Scaler transforms the data to have a mean of 0 and a standard deviation of 1, which helps in reducing the impact of outliers and brings all features to a similar scale. This can help in improving the model's convergence and performance by reducing the influence of large weights and ensuring that all features contribute equally to the predictions.

Submit
147. A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?

Explanation

Use supervised learning to predict missing values based on the values of other features. Different
supervised learning approaches might have different performances, but any properly implemented supervised
learning approach should provide the same or better approximation than mean or median approximation, as
proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field
of research.

Submit
148. An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?

Explanation

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Submit
149. Which of the following best describes the purpose of Amazon SageMaker Ground Truth in the context of machine learning?

Explanation

Amazon SageMaker Ground Truth is a service that helps in generating high-quality labeled datasets by using a combination of human labeling and automated data labeling techniques. This is essential for training supervised machine learning models, as labeled data serves as the foundation for teaching models to make accurate predictions. The service significantly reduces the time and cost associated with data labeling.

Submit
150. A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.   Initial models have performed poorly. While reviewing the underlying data, the Data Scientist noticed that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features of these observations appear normal compared to the rest of the sample population.   How should the Data Scientist correct this issue?

Explanation

This is a widely accepted method for handling such anomalies. Since the age of 0 is clearly incorrect in a study of individuals over 65, imputing these values with the mean or median age of the other patients ensures that these records are still usable without introducing significant bias. The mean or median imputation would likely provide a reasonable estimate of what the age should have been.

Submit
151. A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. Currently, the company has the following data in Amazon Aurora.  
  • Profiles for all past and existing customers
  • Profiles for all past and existing insured pets
  • Policy-level information
  • Premiums received
  • Claims paid
  What steps should be taken to implement a machine learning model to identify potential new customers on social media?

Explanation

To implement a machine learning model to identify potential new customers on social media, the marketing manager should use clustering on customer profile data to understand key characteristics of consumer segments. This approach will help in grouping customers with similar attributes together, enabling the company to target specific segments effectively. By finding similar profiles on social media, the marketing manager can tailor the marketing campaign to reach potential customers who share similar characteristics with existing and past customers.

Submit
152. A news organization wants to extract metadata from its articles and blogs and index that metadata in Amazon Elasticsearch Service (Amazon ES) to enable faster searches. What AWS service can the organization use to achieve this goal?

Explanation

The organization can use Amazon Comprehend to extract metadata from its articles and blogs. Amazon Comprehend is a natural language processing (NLP) service that can analyze text and extract information such as key phrases, entities, sentiment, and language. By using Amazon Comprehend, the organization can extract the desired metadata from the articles and blogs, and then index that metadata in Amazon Elasticsearch Service for faster searches.

Submit
153. While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown.   What does this mean?

Explanation

The histogram for residuals not forming a zero-centered bell shape indicates that the regression model is not accurately representing the dataset. A zero-centered bell shape suggests that the model's predictions have minimal errors and are distributed evenly around the true values. However, if the residuals do not form this shape, it implies that the model is consistently overestimating or underestimating the target values, leading to inaccurate representation of the dataset.

Submit
154. You work in the security department of your company's IT division. Your company has decided to try to use facial recognition to improve security on their campus. You have been asked to design a system that augments your company's building access security by scanning the faces of people entering their buildings and recognizing the person as either an employee/contractor/consultant, who is in the company's database, or visitor, who is not in their database.   Across their many campus locations worldwide your company has over 750,000 employees and over 250,000 contractors and consultants. These workers are all registered in their HR database. Each of these workers has an image of their face stored in the HR database. You have decided to use Amazon Rekognition for your facial recognition solution. On occasion, the Rekognition model fails to recognize visitors to the buildings.   What could be the source of the problem?

Explanation

The source of the problem could be the face collection contents. Since the Rekognition model is failing to recognize visitors, it is possible that the faces of the visitors are not included in the face collection that the system is comparing against. The face collection should ideally contain images of both employees/contractors/consultants and visitors in order to accurately identify and differentiate between them.

Submit
155. A Data Scientist is training a convolutional neural network model to detect incoming employees at the company's front gate using a camera so that the system opens for them automatically. However, the model is taking too long to converge and the error oscillates for more than 10 epochs. What should the Data scientists do to improve upon this situation? (Select TWO.)

Explanation

The Data Scientist should normalize the images before training to improve the convergence of the model. Normalization helps to scale the pixel values of the images to a standard range, which can prevent the model from getting stuck in oscillating errors. Additionally, adding batch normalization can also help in improving the convergence by normalizing the activations of the neural network layers. These two techniques can help stabilize the training process and improve the performance of the model.

Submit
156. An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen.   Which combination of algorithms would provide the appropriate insights? (Select TWO)

Explanation

The principal component analysis (PCA) algorithm is suitable for this task as it can reduce the dimensionality of the data and identify the most important variables that contribute to the variance in the dataset. This can help in identifying patterns and relationships within the census information. The k-means algorithm can be used to cluster the data based on similarities, which can be useful in grouping provinces and cities with similar healthcare and social program needs. These algorithms together can provide valuable insights for determining healthcare and social program needs by province and city based on the census information.

Submit
157. A machine translation company is deploying its language translation models behind an Amazon SageMaker endpoint. The company wants to deploy a solution directly on its website so that users can input text in one language and have it translated into a second language. The company wants to reach a solution with minimal maintenance and latency for spiky traffic times. How should the company architect this solution?

Explanation

The company should use Lambda to call InvokeEndpoint and use the Amazon API Gateway URL to call the AWS Lambda function. This architecture allows for minimal maintenance and latency for spiky traffic times. By using Lambda, the company can easily scale the solution to handle increased traffic. The Amazon API Gateway acts as a proxy, providing a secure and scalable entry point for the website to call the Lambda function. This solution is efficient and provides a seamless experience for users on the website.

Submit
158. A Machine Learning Specialist is developing recommendation engine for a photography blog Given a picture, the recommendation engine should show a picture that captures similar objects The Specialist would like to create a numerical representation feature to perform nearest-neighbor searches.   What actions would allow the Specialist to get relevant numerical representations?

Explanation

The Specialist can obtain relevant numerical representations by running images through a neural network pre-trained on ImageNet and collecting the feature vectors from the penultimate layer. This approach allows the recommendation engine to capture high-level visual features and patterns in the images, which can be used for nearest-neighbor searches. The penultimate layer is chosen because it typically contains more abstract and discriminative features compared to earlier layers.

Submit
159. When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Select THREE.)

Explanation

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, three common parameters that must be specified are the IAM role, the EC2 instance class, and the output path. The IAM role is necessary for SageMaker to perform tasks on behalf of the users. The EC2 instance class determines whether the training will be run using CPU or GPU. The output path specifies where the trained model will be stored in an Amazon S3 bucket.

Submit
160. A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (Pll). The dataset:  
  • Must be accessible from a VPC only.
  • Must not traverse the public internet.
  How can these requirements be satisfied?

Explanation

To satisfy the requirements of making the dataset accessible from a VPC only and not traversing the public internet, a VPC endpoint needs to be created. Additionally, a bucket access policy should be applied to restrict access to the specific VPC endpoint and the VPC itself. This ensures that the dataset can only be accessed within the designated VPC, providing the necessary security measures for sensitive data.

Submit
161. A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create security vulnerability where malicious code running on the instances could compromise data privacy.   The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network.   How should the Data Science team configure the notebook instance placement to meet these requirements?

Explanation

The correct answer is to associate the Amazon SageMaker notebook with a private subnet in a VPC and ensure that the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it. This configuration allows the Data Science team to access data stored in Amazon S3 buckets and use Amazon SageMaker while keeping the instances within a secured VPC with no internet access. The use of VPC endpoints ensures that data communication traffic stays within the AWS network, addressing the IT Security team's concerns about data privacy and security vulnerabilities.

Submit
162. You work for a web retailer where you need to analyze data produced for your company by an outside market data provider. You need to produce recommendations based on patterns in user preferences by demographic found in the supplied data. You have stored the data in one of your company's S3 buckets. You have created a Glue crawler that you have configured to crawl the data on S3 and you have written a custom classifier. Unfortunately, the crawler failed to create a schema.   Why might the Glue crawler have failed in this way?

Explanation

The Glue crawler might have failed to create a schema because all the classifiers returned a certainty of 0.0. This means that the classifiers were not able to accurately identify the data types and structure of the data in the S3 bucket. As a result, the crawler was unable to generate a schema based on the patterns in the user preferences by demographic found in the data.

Submit
163. An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES.   The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time.   Which solution should the agency consider?

Explanation

The agency should consider using a proxy server at each local office and for each camera to stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, they should use Amazon Rekognition Video to create a stream processor that detects faces from a collection of known employees. This solution allows for real-time identification of activities performed by non-employees and alerts the agency when non-employees are detected.

Submit
164. A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours. With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s).   Which visualization will accomplish this?

Explanation

The scatter plot showing the correlation between maximum tree depth and the objective metric will accomplish the goal of decreasing the training time and costs. By analyzing the relationship between the maximum tree depth and the objective metric (AUC), the Machine Learning Specialist can identify the optimal value for the maximum tree depth that results in the best performance. This allows for fine-tuning the hyperparameters to improve the model's efficiency and reduce the time and cost required for training.

Submit
165. A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency?

Explanation

The correct answer is Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data. This solution allows the use of SQL to query the data stream with the least latency because Kinesis Data Analytics can process real-time streaming data using SQL queries, and AWS Lambda can be used to transform the data in real-time. This combination ensures that the insights can be gained in real-time with minimal delay.

Submit
166. A Machine Learning Specialist discover the following statistics while experimenting on a model. What can the Specialist from the experiments?

Explanation

The experiments suggest that the model in Experiment 1 initially had a high variance error, which means it was overfitting the training data. However, in Experiment 3, by applying regularization techniques, the variance error was reduced, indicating that the model became more generalizable. Additionally, Experiment 2 shows that there is minimal bias error in Experiment 1, meaning that the model was able to capture the underlying patterns in the data well.

Submit
167. You work in the data analytics department of a ride sharing software company. You need to use the K-means machine learning algorithm to separate your company's optimized ride data into clusters based on ride coordinates.   How would you best use AWS Glue to build the data tables needed to classify the ride data?

Explanation

AWS Glue crawlers can be used to automatically discover and catalog the ride data, including its structure and schema. This will create the necessary data tables that can be used for classification. By using a K-means classifier, the ride data can be separated into clusters based on the coordinates. This approach leverages the capabilities of AWS Glue to automate the process of preparing and organizing the data for classification.

Submit
168. Which AWS service provides a managed environment for training and deploying machine learning models with built-in support for distributed training, automatic model tuning, and integration with other AWS services?

Explanation

AWS SageMaker is the service that provides a managed environment for training and deploying machine learning models. It supports distributed training, automatic model tuning, and integrates with other AWS services. Unlike AWS Glue, which handles data integration and ETL, and Amazon Comprehend, which focuses on text analysis, SageMaker is specifically designed for end-to-end machine learning workflows, including model training, tuning, and deployment. Amazon Lex is for building chatbots, not model deployment.

Submit
169. A sports and betting company uses machine learning to predict the odds of winning during sporting events. It uses the Amazon SageMaker endpoint to serve its production model. The endpoint is on an m5.8xlarge instance. What can the company do to ensure that this endpoint is highly available while using the most cost-effective and easily managed solution?

Explanation

By increasing the number of instances associated with the endpoint to more than one, the company can ensure high availability of the endpoint. This means that even if one instance fails or experiences issues, the other instances can continue serving the production model, minimizing downtime. This solution is cost-effective as it does not require upgrading to a larger instance size or adding an elastic inference, and it is easily managed as the company can simply add more instances to the endpoint configuration. Additionally, putting the multiple endpoints behind an Application Load Balancer can further enhance availability and distribute the incoming traffic efficiently.

Submit
170. A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users.   The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns.   Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory. Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO)

Explanation

The Data Science team should generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This will help to balance the dataset and provide more examples of paid users for the model to learn from. Additionally, the team should change the cost function so that false negatives have a higher impact on the cost value than false positives. This will prioritize correctly identifying users who will become paid users, even if it results in more false positives.

Submit
171. You are a machine learning expert working for a marketing firm. You are supporting a team of data scientists and marketing managers who are running a marketing campaign. Your data scientists and marketing managers need to answer the question "Will this user subscribe to my campaign?" You have been given a dataset in the form of a CSV file which is formatted as such:   UserId, jobId, jobDescription, educationLevel, campaign, duration, willRespondToCampaign   When you build your schema for this dataset, which of the following data descriptors would you use to define the willRespondToCampaign attribute? (Select TWO).

Explanation

The "willRespondToCampaign" attribute is a binary variable that indicates whether a user will respond to the campaign or not. Therefore, the data descriptor "Binary" is appropriate to define this attribute. Additionally, the "targetAttributeName" is also a valid data descriptor as it specifies the attribute that is the target variable for the prediction task, which in this case is "willRespondToCampaign".

Submit
172. A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

Explanation

The decision tree model would have the highest recall with respect to the fraudulent class. Decision trees are known for their ability to capture complex relationships between features and target variables. In this case, the decision tree can effectively learn the patterns in the age of account and transaction month features that are indicative of fraudulent behavior. This allows the model to correctly identify a higher proportion of fraudulent instances, resulting in a higher recall for the fraudulent class.

Submit
173. A Machine Learning Engineer created a pipeline for training an ML model using an Amazon SageMaker training job. The training job began successfully but then failed after running for five minutes. How should the Engineer begin to debug this issue? (Select TWO.)

Explanation

To debug the issue, the Machine Learning Engineer should call the DescribeJob API to check the FailureReason option. This will provide information about the reason for the failure, which can help in identifying the problem. Additionally, the Engineer should go to Amazon CloudWatch logs and check the logs for the given training job. The logs can provide more details about any errors or issues that occurred during the training job, aiding in the debugging process. Checking the job history on the SageMaker console, directly checking the error in the training job, and checking AWS CloudTrail logs are not mentioned as appropriate steps for debugging this specific issue.

Submit
174. A company has raw user and transaction data stored in AmazonS3 a MySQL database, and Amazon RedShift. A Data Scientist needs to perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon RedShift, and then calculating the average-of a few selected columns from the joined data.   Which AWS service should the Data Scientist use?

Explanation

Amazon Redshift Spectrum should be used by the Data Scientist. Redshift Spectrum allows users to run complex SQL queries against data stored in Amazon S3, without the need to load the data into Redshift first. It can directly access data from S3 and join it with data in Redshift and MySQL. This makes it a suitable choice for the Data Scientist to join the datasets from Amazon S3, MySQL, and Amazon RedShift, and perform the required analysis.

Submit
175. You are a data scientist working for a cancer screening center. The center has gathered data on many patients that have been screened over the years. The data is obviously skewed toward true negative results, as most screened patients don't have cancer. You are evaluating several machine learning models to decide which model best predicts true positives when using your cancer screening data. You have split your data into a 70/30 ratio of training set to test set. You now need to decide which metric to use to evaluate your models.   Which metric will most accurately determine the model best suited to solve your classification problem?

Explanation

The PR Curve is the most suitable metric to determine the model best suited for the classification problem in this scenario. Since the data is skewed towards true negative results, precision and recall are more appropriate metrics than the ROC curve. Precision measures the proportion of true positive predictions out of all positive predictions, while recall measures the proportion of true positive predictions out of all actual positive cases. The PR Curve combines both precision and recall, providing a more accurate evaluation of the model's performance in identifying true positives.

Submit
176. You work for a real estate company where you are building a machine learning model to predict the prices of houses. You are using a regression decision tree. As you train your model you see that it is overfitted to your training data and that it doesn't generalize well to unseen data.   How can you improve your situation and get better training results in the most efficient way?

Explanation

Using a random forest by building multiple randomized decision trees and averaging their outputs can improve the situation and provide better training results. Random forests help to reduce overfitting by introducing randomness into the model. By building multiple decision trees with different subsets of the data and features, the model can learn from different perspectives and make more accurate predictions. Averaging the outputs of these trees helps to reduce the impact of individual overfitted trees and provides a more generalized prediction for unseen data.

Submit
177. A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet.   How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?

Explanation

If the corporate data security policy does not allow communication over the internet but the company still wants to use Amazon SageMaker, they can set up an Amazon SageMaker VPC (Virtual Private Cloud) configuration to ensure that the service is accessible only within their private network. They can create a Virtual Private Cloud (VPC) for Amazon SageMaker. This involves setting up private subnets within the VPC. Then, they can create VPC endpoints for the necessary AWS services that Amazon SageMaker requires. VPC endpoints allow you to privately connect your VPC to supported AWS services without needing public internet access.

Submit
178. An oil and natural gas company is utilizing machine learning to identify optimal drilling sites. They have selected Amazon SageMaker for developing their models. Their data scientists, working from notebook instances, face lengthy delays during model training. The company aims to reduce these delays to enhance the efficiency of model iterations, allowing data scientists to rapidly test and modify their models without significant code adjustments or extended waits for data loading and model training. How can a team of data scientists reduce the waiting time associated with training machine learning models for discovering drilling locations, using Amazon SageMaker, while minimizing code changes for faster experimentation?

Explanation



Using Amazon SageMaker Estimators in local mode allows data scientists to quickly test and iterate on their machine learning models directly within their SageMaker notebook instances without having to launch separate training jobs on SageMaker's managed training infrastructure. This approach dramatically reduces the waiting time typically associated with starting up training jobs and loading data in the cloud. By training models locally, data scientists can receive immediate feedback and make rapid adjustments to their models, thus accelerating the experimental cycle. This method also minimizes the changes needed in the code, as the same SageMaker Estimator can be used for both local training and full-scale cloud training simply by changing the instance type from local to a specific ML instance type provided by SageMaker.
Submit
179. A Machine Learning Specialist is using Apache Spark for pre-processing training data As part of the Spark pipeline, the Specialist wants to use Amazon SageMaker for training a model and hosting it.   Which of the following would the Specialist do to integrate the Spark application with SageMaker? (Select THREE)

Explanation

To integrate the Spark application with SageMaker, the Machine Learning Specialist would need to perform the following steps: 1) Install the SageMaker Spark library in the Spark environment, which allows for seamless integration between Spark and SageMaker. 2) Use the appropriate estimator from the SageMaker Spark Library to train a model, which provides a high-level API for training models on SageMaker using Spark. 3) Use the SageMaker Model transform method to get inferences from the model hosted in SageMaker, which allows for real-time inference on new data using the trained model.

Submit
180. Which of the following AWS services is NOT directly related to machine learning?

Explanation

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. While it can be used to support machine learning applications, it's not a dedicated machine learning service.

Amazon SageMaker is a comprehensive platform for building, training, and deploying machine learning models.

Amazon Rekognition is a service that uses deep learning to analyze images and videos.

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to extract insights from text.

Submit
View My Results

Quiz Review Timeline (Updated): Jan 8, 2025 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Jan 08, 2025
    Quiz Edited by
    ProProfs Editorial Team
  • Oct 02, 2020
    Quiz Created by
    Olatunde.olanayo
Cancel
  • All
    All (180)
  • Unanswered
    Unanswered ()
  • Answered
    Answered ()
A manufacturing company has structured and unstructured data stored in...
A manufacturing company has structured and unstructured data stored in...
10. A Data Scientist for a credit card company is creating a solution...
A Machine Learning Specialist must build out a process to query a...
A Machine Learning Specialist is working with a media company to...
A navigation and transportation company is using satellite images to...
A multi-national banking organization provides loan services to...
This graph shows the training and validation loss against the epochs...
A company is interested in building a fraud detection model....
A Machine Learning Specialist is configuring Amazon SageMaker so...
IT leadership wants to transition a company's existing machine...
An employee found a video clip with audio on a company's social...
A Machine Learning Specialist is required to build a supervised...
A Data Scientist wants to implement a near-real-time anomaly detection...
A video streaming company is looking to create a personalized...
The Chief Editor for a product catalog wants the Research and...
A Machine Learning Specialist deployed a model that provides product...
A Machine Learning Specialist is building a convolutional neural...
A Machine Learning Specialist is designing a system for improving...
Which of the following metrics should a Machine Learning Specialist...
An ML scientist has built a decision tree model using scikit-learn...
A Data Science team is designing a dataset repository where it will...
A Machine Learning Specialist is building a model that will perform...
A Machine Learning Specialist is building a prediction model for a...
A web-based company wants to improve its conversion rate on its...
A Machine Learning Specialist is building a model to predict future...
A company wants to classify user behavior as either fraudulent or...
You work for the information security department of a major...
A Machine Learning Specialist was given a dataset consisting of...
A Machine Learning Specialist is working with a large cyber security...
A Machine Learning Engineer is creating a regression model for...
Amazon Connect has recently been tolled out across a company as a...
You wish to use a SageMaker notebook within a VPC. SageMaker notebook...
During mini-batch training of a neural network for a classification...
The displayed graph is from a forecasting model for testing a time...
An online reseller has a large, multi-column dataset with one column...
A Machine Learning Specialist receives customer data for an online...
A Machine Learning Specialist needs to create a data repository to...
An ad tech company is using an XGBoost model to classify its...
A healthcare company using the AWS Cloud has access to a variety of...
A Machine Learning Specialist has completed a proof of concept for a...
A Data Scientist created a correlation matrix between nine variables...
A manufacturing company has a large set of labeled historical sales...
A real estate company wants to provide its customers with a more...
A log analytics company wants to provide a history of Amazon SageMaker...
A Machine Learning Specialist is building a logistic regression model...
A Data Scientist wants to include "month" as a categorical column in a...
A Data Scientist at a waste recycling company trained a CNN model to...
A video streaming company wants to analyze its VPC flow logs to build...
A manufacturing company asks its Machine Learning Specialist to...
A Machine Learning Specialist prepared the following graph displaying...
A Data Engineer needs to build a model using a dataset containing...
A team of Data Scientists wants to use Amazon SageMaker training jobs...
A large JSON dataset for a project has been uploaded to a private...
A city wants to monitor its air quality to address the consequences of...
A Machine Learning Specialist wants to determine the appropriate...
A video streaming company wants to create a searchable video library...
A Machine Learning Specialist is training a model to identify the make...
A Machine Learning Specialist is building a supervised model that will...
Given the following confusion matrix for a movie classification model,...
A Machine Learning Specialist built an image classification deep...
A monitoring service generates 1 TB of scale metrics record data every...
Which probability distribution would describe the likelihood of...
While working on a neural network project, a Machine Learning...
A Machine Learning Specialist is assigned a TensorFlow project using...
A Machine Learning Specialist needs to be able to ingest streaming...
A Machine Learning Specialist is evaluating an ML model using a custom...
Data and analytics company is expanding its platform on AWS. The...
A manufacturing company wants to increase the longevity of its factory...
A manufacturer of car engines collects data from cars as they are...
A Machine Learning Specialist is creating a new natural language...
A large consumer goods manufacturer has the following products on...
A security and networking company wants to use ML to flag certain IP...
A Machine Learning specialist is examining the root cause for...
A Machine Learning Specialist trained a regression model, but the...
In AWS SageMaker, what feature allows you to distribute machine...
A healthcare organization has an application that takes in sensitive...
A Machine Learning Specialist is implementing a full Bayesian network...
You work for a manufacturing company that produces retail apparel,...
A Machine Learning Specialist is packaging a custom ResNet model into...
A Machine Learning Specialist is working with a large company to...
A retail company intends to use machine learning to categorize new...
A Machine Learning Specialist has built a model using Amazon SageMaker...
A company's Machine Learning Specialist needs to improve the...
A Machine Learning Specialist is working for a credit card processing...
17. An online news organization wants to expand its reach globally by...
A transportation company currently uses Amazon EMR with Apache Spark...
A Machine Learning Specialist works for a credit card processing...
Your marketing department wishes to understand how their products are...
A social networking organization wants to analyze all the comments and...
An analytics company wants to use a fully managed service that...
A financial services company is building a robust serverless data lake...
You would like to draw inferences from a model deployed to Amazon...
An insurance company is developing a new device for vehicles that uses...
You work for a retail firm that wishes to conduct a direct mail...
For the given confusion matrix, what is the recall and precision of...
A Machine Learning Specialist is using Amazon SageMaker to host a...
An interactive online dictionary wants to add a widget that displays...
A term frequency–inverse document frequency (tf–idf)...
A data engineer needs to create a cost-effective data pipeline...
A Dats Scientist at a retail company is using Amazon SageMaker to...
A Data Scientist is working on an application that performs sentiment...
A Machine Learning Specialist working for an online fashion company...
A Machine Learning Specialist observes several performance problems...
An advertising and analytics company uses machine learning to predict...
A company is using Amazon Polly to translate plaintext documents to...
A financial organization uses multiple ML models to detect irregular...
A bank's Machine Learning team is developing an approach for...
A Machine Learning Engineer wants to use Amazon SageMaker and the...
A data scientist is working on optimizing a model during the training...
A company is running an Amazon SageMaker training job that will access...
A Data Scientist wants to tune the hyperparameters of a machine...
A Data Scientist wants to use the Amazon SageMaker hyperparameter...
A Machine Learning Specialist is developing a custom video...
A Machine Learning Specialist is developing a daily ETL workflow...
A company is using its genomic data to classify how different human...
A Machine Learning Specialist needs to move and transform data in...
A data scientist is evaluating different binary classification models....
An ML Engineer at a real estate startup wants to use a new...
A Machine Learning Specialist has created a deep learning neural...
A Machine Learning Specialist is using an Amazon SageMaker notebook...
A company has collected customer comments on its products, rating them...
A financial planning company is using the Amazon SageMaker endpoint...
A ride-share company wants to create intelligent conversational...
What factors lead to the wide adoption of neural networks in the last...
A Machine Learning Specialist is preparing data for training on Amazon...
A Machine Learning Specialist is configuring automatic model tuning in...
A company is running a machine learning prediction service that...
If you wanted to build your own Alexa-type device that converses with...
A Data Scientist at a credit card company trained a classification...
A Machine Learning team uses Amazon SageMaker to train an Apache MXNet...
In AWS SageMaker, which of the following statements accurately...
A retail chain has been ingesting purchasing records from its network...
A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket...
Example Corp has an annual sale event from October to December. The...
A Data Scientist is using stochastic gradient descent (SGD) as the...
A Data Scientist needs to migrate an existing on-premises ETL process...
A Data Scientist working for an autonomous vehicle company is building...
A Data Scientist wants to create a linear regression model to train on...
A Machine Learning Specialist is working with multiple data sources...
A Machine Learning Engineer is creating and preparing data for a...
A company is observing low accuracy while training on the default...
A Data Scientist needs to create a serverless ingestion and analytics...
An e-commerce company needs a customized training model to classify...
A Data Scientist at an ad-tech startup wants to update an ML model...
Real estate startup wants to use ML to predict the value of homes in...
A machine learning engineer is preparing a data frame for a supervised...
An insurance company needs to automate claim compliance reviews...
Which of the following best describes the purpose of Amazon SageMaker...
A Data Scientist is developing a machine learning model to predict...
A Marketing Manager at a pet insurance company plans to launch a...
A news organization wants to extract metadata from its articles and...
While reviewing the histogram for residuals on regression evaluation...
You work in the security department of your company's IT division....
A Data Scientist is training a convolutional neural network model to...
An agency collects census information within a country to determine...
A machine translation company is deploying its language translation...
A Machine Learning Specialist is developing recommendation engine for...
When submitting Amazon SageMaker training jobs using one of the...
A Machine Learning Specialist at a company sensitive to security is...
A Data Science team within a large company uses Amazon SageMaker...
You work for a web retailer where you need to analyze data produced...
An office security agency conducted a successful pilot using 100...
A Machine Learning Specialist kicks off a hyperparameter tuning job...
A Data Scientist wants to gain real-time insights into a data stream...
A Machine Learning Specialist discover the following statistics while...
You work in the data analytics department of a ride sharing software...
Which AWS service provides a managed environment for training and...
A sports and betting company uses machine learning to predict the odds...
A gaming company has launched an online game where people can start...
You are a machine learning expert working for a marketing firm. You...
A company wants to classify user behavior as either fraudulent or...
A Machine Learning Engineer created a pipeline for training an ML...
A company has raw user and transaction data stored in AmazonS3 a MySQL...
You are a data scientist working for a cancer screening center. The...
You work for a real estate company where you are building a machine...
A company is setting up an Amazon SageMaker environment. The corporate...
An oil and natural gas company is utilizing machine learning to...
A Machine Learning Specialist is using Apache Spark for pre-processing...
Which of the following AWS services is NOT directly related to machine...
Alert!

Advertisement