AWS Machine Learning - Specialty (MLS-c01 – 2020)

Reviewed by Editorial Team
The ProProfs editorial team is comprised of experienced subject matter experts. They've collectively created over 10,000 quizzes and lessons, serving over 100 million users. Our team includes in-house content moderators and subject matter experts, as well as a global network of rigorously trained contributors. All adhere to our comprehensive editorial guidelines, ensuring the delivery of high-quality content.
Learn about Our Editorial Process
| By Olatunde.olanayo
O
Olatunde.olanayo
Community Contributor
Quizzes Created: 1 | Total Attempts: 6,120
| Attempts: 6,165
SettingsSettings
Please wait...
  • 1/180 Questions

    A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800.000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1 5 MB in size. Most queries will span 5 to 10 columns only How should the Machine Learning Specialist transform the dataset to minimize query runtime?

    • Convert the records to Apache Parquet format
    • Convert the records to JSON format
    • Convert the records to GZIP CSV format
    • Convert the records to XML format
Please wait...
About This Quiz

The AWS Certified Machine Learning - Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate's ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

AWS Machine Learning - Specialty (MLS-c01  2020) - Quiz

Quiz Preview

  • 2. 

    A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

    • Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

    • Use AWS Glue to catalogue the data and Amazon Athena to run queries.

    • Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

    • Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

    Correct Answer
    A. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
    Explanation
    The correct answer is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. It can automatically discover and catalog data stored in Amazon S3, making it easier to query the data using SQL. Amazon Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. This combination of AWS Glue and Amazon Athena requires the least effort as it eliminates the need for manual data transformation and provides a simple and efficient way to query the data.

    Rate this question:

  • 3. 

    A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?

    • Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.

    • Use AWS Glue to catalogue the data and Amazon Athena to run queries.

    • Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.

    • Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

    Correct Answer
    A. Use AWS Glue to catalogue the data and Amazon Athena to run queries.
    Explanation
    The solution that requires the least effort to be able to query the data is to use AWS Glue to catalogue the data and Amazon Athena to run queries. AWS Glue can automatically discover and catalog the structured and unstructured data stored in the Amazon S3 bucket, making it easier to query. Amazon Athena is a serverless query service that allows the Machine Learning Specialist to run SQL queries directly on the data without the need for any infrastructure setup or management. This solution eliminates the need for additional data transformation or ETL processes, reducing the overall effort required.

    Rate this question:

  • 4. 

    10. A Data Scientist for a credit card company is creating a solution to predict credit card fraud at the time of transaction. To that end, the Data Scientist is looking to create an ML model to predict fraud and will do so by training that model on an existing dataset of credit card transactions. That dataset contains 1,000 examples of transactions in total, only 50 of which are labeled as fraud. How should the Data Scientist deal with this class imbalance?

    • Use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records

    • Undersample the non-fraudulent records to improve the class imbalance

    • Use K-fold cross validation when training the model

    • Drop all the fraud examples, and use a One-Class SVM to classify

    Correct Answer
    A. Use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records
    Explanation
    The Data Scientist should use the Synthetic Minority Oversampling Technique (SMOTE) to oversample the fraud records. This technique generates synthetic samples of the minority class (fraudulent records) in order to balance the class distribution. By creating additional synthetic fraud examples, the ML model will have more data to learn from and improve its ability to predict fraud accurately. This approach is suitable for dealing with class imbalance and can help improve the performance of the model in detecting credit card fraud.

    Rate this question:

  • 5. 

    A navigation and transportation company is using satellite images to model weather around the world in order to create optimal routes for its ships and planes. The company is using Amazon SageMaker training jobs to build and train its models. However, during training, it takes too long to download the company’s 100 GB data from Amazon S3 to the training instance before the training starts. What should the company do to speed up its training jobs while keeping the costs low?

    • Increase the instance size for training

    • Increase the batch size in the model

    • Change the input mode to Pipe

    • Create an Amazon EBS volume with the data on it and attach it to the training job

    Correct Answer
    A. Change the input mode to Pipe
    Explanation
    Changing the input mode to Pipe would speed up the training jobs while keeping the costs low. By using Pipe mode, the company can stream the data directly from Amazon S3 to the training instance without the need to download the entire 100 GB data before training starts. This eliminates the time-consuming download process and allows for faster training. Additionally, it helps in reducing storage costs as there is no need to store the data on the training instance.

    Rate this question:

  • 6. 

    A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published A sample of the data being used is below. Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values. What technique should be used to convert this column to binary values?

    • Binarization

    • One-hot encoding

    • Tokenization

    • Normalization transformation

    Correct Answer
    A. One-hot encoding
    Explanation
    The technique that should be used to convert the Day-Of_Week column to binary values is one-hot encoding. One-hot encoding is a technique used to represent categorical variables as binary vectors. Each category is converted into a binary column, where a value of 1 represents the presence of that category and a value of 0 represents the absence. This is commonly used in machine learning algorithms to handle categorical data and allow them to be used in mathematical calculations.

    Rate this question:

  • 7. 

    This graph shows the training and validation loss against the epochs for a neural network. The network being trained is as follows;
    • Two dense layers one output neuron
    • 100 neurons in each layer
    • 100 epochs
    • Random initialization of weights
      Which technique can be used to improve model performance in terms of accuracy in the validation set?

    • Early stopping

    • Random initialization of weights with appropriate seed

    • Increasing the number of epochs

    • Adding another layer with the 100 neurons

    Correct Answer
    A. Early stopping
    Explanation
    Early stopping is a technique that can be used to improve model performance in terms of accuracy in the validation set. It involves stopping the training process before all epochs are completed, based on a predefined condition. This helps prevent overfitting and allows the model to generalize better to unseen data. By monitoring the validation loss during training, early stopping can determine when the model starts to overfit and stop training at that point, resulting in improved accuracy on the validation set.

    Rate this question:

  • 8. 

    A multi-national banking organization provides loan services to customers worldwide. Many of its customers still submit loan applications in paper form in one of the bank’s branch locations. The bank wants to speed up the loan approval process for this set of customers by using machine learning. More specifically, it wants to create a process in which customers submit the application to the clerk, who scans and uploads it to the system. The system then reads and provides an approval or denial of the application in a matter of minutes. What can the bank use to read and extract the necessary data from the loan applications without needing to manage the process?

    • A custom CNN model

    • An LSTM model

    • Amazon Textract

    • Amazon Personalize

    Correct Answer
    A. Amazon Textract
    Explanation
    The bank can use Amazon Textract to read and extract the necessary data from the loan applications without needing to manage the process. Amazon Textract is a machine learning service that automatically extracts text and data from scanned documents. It can analyze the loan applications and extract relevant information such as personal details, financial data, and loan request information. This would significantly speed up the loan approval process as the system can quickly analyze and provide an approval or denial of the application in a matter of minutes.

    Rate this question:

  • 9. 

    An employee found a video clip with audio on a company's social media feed. The language used in the video is Spanish. English is the employee's first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task?

    • Amazon Transcribe, Amazon Translate, and Amazon Comprehend

    • Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq

    • Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)

    • Amazon Transcribe, Amazon Translate, and Amazon SageMaker BlazingText

    Correct Answer
    A. Amazon Transcribe, Amazon Translate, and Amazon Comprehend
    Explanation
    The most efficient combination of services to accomplish the task is Amazon Transcribe, Amazon Translate, and Amazon Comprehend. Amazon Transcribe can be used to transcribe the audio in the video clip from Spanish to text. Amazon Translate can then be used to translate the transcribed text from Spanish to English, allowing the employee to understand the content. Finally, Amazon Comprehend can be used to perform sentiment analysis on the translated text, providing insights into the sentiment expressed in the video clip.

    Rate this question:

  • 10. 

    A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier: Total number of images available = 1,000 Test set images = 100 (constant test set) The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners. Which techniques can be used by the ML Specialist to improve this specific test error?

    • Increase the training data by adding variation in rotation for training images.

    • Increase the number of epochs for model training.

    • Increase the number of layers for the neural network.

    • Increase the dropout rate for the second-to-last layer.

    Correct Answer
    A. Increase the training data by adding variation in rotation for training images.
    Explanation
    The ML Specialist can improve the test error by increasing the training data and adding variation in rotation for training images. This can help the model to learn and recognize cats even when they are held upside down by their owners. By exposing the model to a wider range of cat orientations during training, it can become more robust and accurate in identifying cats in different positions. Increasing the number of epochs, the number of layers, or the dropout rate may not specifically address the issue of misclassifying upside-down cats.

    Rate this question:

  • 11. 

    IT leadership wants to transition a company's existing machine learning data storage environment to AWS as a temporary ad hoc solution. The company currently uses a custom software process that heavily leverages SOL as a query language and exclusively stores generated csv documents for machine learning. The ideal state for the company would be a solution that allows it to continue to use the current workforce of SQL experts. The solution must also support the storage of csv and JSON files, and be able to query over semi- structured data. The following are high priorities for the company:
    • Solution simplicity
    • Fast development time
    • Low cost
    • High flexibility
    What technologies meet the company's requirements?

    • Amazon S3 and Amazon Athena

    • Amazon Redshift and AWS Glue

    • Amazon DynamoDB and DynamoDB Accelerator (DAX)

    • Amazon RDS and Amazon ES

    Correct Answer
    A. Amazon S3 and Amazon Athena
    Explanation
    The company's ideal state is to continue using SQL experts and support the storage of csv and JSON files, as well as querying over semi-structured data. Amazon S3 allows for the storage of csv and JSON files, while Amazon Athena is a serverless query service that supports SQL queries on data stored in Amazon S3. This solution is simple, has fast development time, low cost, and high flexibility, making it the best fit for the company's requirements.

    Rate this question:

  • 12. 

    A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked. Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)

    • AWS CloudTrail

    • AWS Health

    • AWS Trusted Advisor

    • Amazon CloudWatch

    • AWS Config

    Correct Answer(s)
    A. AWS CloudTrail
    A. Amazon CloudWatch
    Explanation
    The correct answer is AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is used to track API activity and monitor actions taken by users, including model deployments and endpoint invocations. Amazon CloudWatch is used to monitor resource utilization, such as GPU and CPU utilization on the deployed SageMaker endpoints. AWS Health, AWS Trusted Advisor, and AWS Config are not directly integrated with Amazon SageMaker for tracking this information.

    Rate this question:

  • 13. 

    A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

    • Oversampling using bootstrapping

    • Undersampling

    • Oversampling using SMOTE

    • Class weight adjustment

    Correct Answer
    A. Oversampling using SMOTE
    Explanation
    With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds
    new information by adding synthetic data points to the minority class. This technique would be the most effective
    in this scenario.

    Rate this question:

  • 14. 

    A Data Scientist wants to implement a near-real-time anomaly detection solution for routine machine maintenance. The data is currently streamed from connected devices by AWS IoT to an Amazon S3 bucket and then sent downstream for further processing in a real-time dashboard. What service can the Data Scientist use to achieve the desired outcome with minimal change to the pipeline?

    • Amazon CloudWatch

    • Amazon SageMaker

    • Amazon EMR with Spark

    • Amazon Kinesis Data Analytics

    Correct Answer
    A. Amazon Kinesis Data Analytics
    Explanation
    Amazon Kinesis Data Analytics is the correct answer because it is a service specifically designed for real-time data processing and analytics. It can easily integrate with AWS IoT and Amazon S3 to process the streamed data and provide near-real-time anomaly detection. Using Amazon Kinesis Data Analytics would require minimal changes to the existing pipeline, making it the ideal choice for the Data Scientist.

    Rate this question:

  • 15. 

    A Machine Learning Specialist deployed a model that provides product recommendations on a company's website Initially, the model was performing very well and resulted in customers buying more products on average However within the past few months the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago.  Which method should the Specialist try to improve model performance?

    • The model needs to be completely re-engineered because it is unable to handle product inventory changes

    • The model's hyperparameters should be periodically updated to prevent drift

    • The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes

    • The model should be periodically retrained using the original training data plus new data as product inventory changes

    Correct Answer
    A. The model should be periodically retrained using the original training data plus new data as product inventory changes
    Explanation
    The correct answer is that the model should be periodically retrained using the original training data plus new data as product inventory changes. This is because the Specialist has noticed a decrease in the effectiveness of the model over time, indicating that the model needs to be updated. By periodically retraining the model with both the original training data and new data that reflects the changes in product inventory, the model can adapt to the evolving patterns and preferences of customers, leading to improved performance.

    Rate this question:

  • 16. 

    A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output?

    • Dropout

    • Smooth L1 loss

    • Softmax

    • Rectified linear units (ReLU)

    Correct Answer
    A. Softmax
    Explanation
    The Softmax function is used to produce a probability distribution of the likelihood that the input image belongs to each of the 10 classes. It converts the output of the neural network into a probability distribution by normalizing the values and ensuring they sum up to 1. This is suitable for multi-class classification tasks, such as classifying different types of animals in this case. Dropout is a regularization technique used to prevent overfitting, Smooth L1 loss is a loss function used in object detection, and Rectified linear units (ReLU) is an activation function commonly used in neural networks.

    Rate this question:

  • 17. 

    A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users' behavior and product preferences to predict which products users would like based on the users' similarity to other users.   What should the Specialist do to meet this objective?  

    • Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR.

    Correct Answer
    A. Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
    Explanation
    To meet the objective of predicting which products users would like based on their similarity to other users, the Machine Learning Specialist should build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR. Collaborative filtering is a technique that analyzes the behavior of multiple users and identifies patterns and similarities between them to make recommendations. This approach is suitable for leveraging the large amount of information on users' behavior and product preferences that the company has. Apache Spark ML is a powerful machine learning library that can handle large-scale data processing, and Amazon EMR provides a scalable and cost-effective platform for running Spark applications.

    Rate this question:

  • 18. 

    The Chief Editor for a product catalog wants the Research and Development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company's retail brand. The team has a set of training data. Which machine learning algorithm should the researchers use that BEST meets their requirements?

    • Latent Dirichlet Allocation (LDA)

    • Recurrent neural network (RNN)

    • K-means

    • Convolutional neural network (CNN)

    Correct Answer
    A. Convolutional neural network (CNN)
    Explanation
    A Convolutional Neural Network (CNN) is the best machine learning algorithm for the given requirements. CNNs are specifically designed for image recognition tasks and have been proven to be highly effective in detecting patterns and features in images. They are able to automatically learn and extract relevant features from the training data, making them well-suited for identifying whether individuals in a collection of images are wearing the company's retail brand. LDA, RNN, and K-means are not specifically designed for image recognition tasks and may not perform as well in this scenario.

    Rate this question:

  • 19. 

    Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

    • Recall

    • Misclassification rate

    • Mean absolute percentage error (MAPE)

    • Area Under the ROC Curve (AUC)

    Correct Answer
    A. Area Under the ROC Curve (AUC)
    Explanation
    The Area Under the ROC Curve (AUC) is a commonly used metric to compare and evaluate machine learning classification models. It measures the ability of the model to distinguish between positive and negative classes by calculating the area under the Receiver Operating Characteristic (ROC) curve. A higher AUC value indicates a better model performance in terms of classification accuracy. Therefore, a Machine Learning Specialist should generally use AUC to compare and evaluate different classification models.

    Rate this question:

  • 20. 

    A video streaming company is looking to create a personalized experience for its customers on its platform. The company wants to provide recommended videos to stream based on what other similar users watched previously. To this end, it is collecting its platform’s clickstream data using an ETL pipeline and storing the logs and syslogs in Amazon S3. What kind of algorithm should the company use to create the simplest solution in this situation?

    • Regression

    • Classification

    • Recommender system

    • Reinforcement learning

    Correct Answer
    A. Recommender system
    Explanation
    The company should use a recommender system algorithm to create a personalized experience for its customers. A recommender system analyzes clickstream data and user behavior to provide recommendations based on what other similar users watched previously. This algorithm would be the simplest solution for the company to implement in order to provide recommended videos to stream on its platform.

    Rate this question:

  • 21. 

    An ML scientist has built a decision tree model using scikit-learn with 1,000 trees. The training accuracy for the model was 99.2% and the test accuracy was 70.3%. Should the Scientist use this model in production?

    • Yes, because it is generalizing well on the training set

    • No, because it is generalizing well on the training set

    • No, because it is not generalizing well on the test set

    • Yes, because it is not generalizing well on the test set

    Correct Answer
    A. No, because it is not generalizing well on the test set
    Explanation
    The model's training accuracy of 99.2% indicates that it is performing well on the training data. However, the test accuracy of only 70.3% suggests that the model is not generalizing well to new, unseen data. Therefore, it is not advisable to use this model in production as it may not perform well on real-world data.

    Rate this question:

  • 22. 

    A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis the Specialist observes that many features are highly correlated with each other. This may make the model unstable. What should be done to reduce the impact of having such a large number of features?

    • Perform one-hot encoding on highly correlated features

    • Use matrix multiplication on highly correlated features.

    • Create a new feature space using principal component analysis (PCA)

    • Apply the Pearson correlation coefficient

    Correct Answer
    A. Create a new feature space using principal component analysis (PCA)
    Explanation
    When many features in a prediction model are highly correlated, it can lead to instability in the model. To reduce the impact of this issue, the Machine Learning Specialist should create a new feature space using principal component analysis (PCA). PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components. By creating a new feature space using PCA, the Specialist can eliminate the problem of high correlation among features and improve the stability of the prediction model.

    Rate this question:

  • 23. 

    A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations.   Which solution should the Specialist recommend?

    • Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.

    • A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database

    • Collaborative filtering based on user interactions and correlations to identify patterns in the customer database

    • Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database

    Correct Answer
    A. Collaborative filtering based on user interactions and correlations to identify patterns in the customer database
    Explanation
    The Specialist should recommend collaborative filtering based on user interactions and correlations to identify patterns in the customer database. This approach is suitable for analyzing customer behavior and preferences based on their past interactions with the website. By analyzing the correlations between users and their interactions, the Specialist can identify patterns and make personalized recommendations to enhance the website's service.

    Rate this question:

  • 24. 

    A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model.   What should the Specialist do to prepare the data for model training?  

    • Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution.

    • Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude.

    • Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude.

    • Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.

    Correct Answer
    A. Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude.
    Explanation
    The Specialist should apply normalization to ensure each field has a mean of 0 and a variance of 1. This will standardize the magnitude of the input features and prevent variables with larger magnitudes from dominating the model. Normalization is a common preprocessing step in machine learning to scale the data and make it easier for the model to learn patterns and make accurate predictions.

    Rate this question:

  • 25. 

    A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided. Based on this information which model would have the HIGHEST accuracy?

    • Long short-term memory (LSTM) model with scaled exponential linear unit (SELL))

    • Logistic regression

    • Support vector machine (SVM) with non-linear kernel

    • Single perceptron with tanh activation function

    Correct Answer
    A. Support vector machine (SVM) with non-linear kernel
    Explanation
    Support vector machines (SVM) with a non-linear kernel are known for their ability to handle complex decision boundaries and capture non-linear relationships in the data. In this case, the data is not linearly separable, as there is overlap between the fraudulent and normal classes in both features. Therefore, using a non-linear kernel in the SVM model would likely result in the highest accuracy by effectively capturing the patterns and relationships in the data.

    Rate this question:

  • 26. 

    You work for the information security department of a major corporation. You have been asked to build a solution that detects web application log anomalies to protect your organization from fraudulent activity. The system needs to have near-real-time updates to the model where log entry data points dynamically change the underlying model as the log files are updated. Which AWS service component do you use to implement the best algorithm based on these requirements?

    • SageMaker Random Cut Forest

    • Kinesis Data Streams Naive Bayes Classifier

    • Kinesis Data Analytics Random Cut Forest

    • Kinesis Data Analytics Nearest Neighbor

    Correct Answer
    A. Kinesis Data Analytics Random Cut Forest
    Explanation
    The best algorithm to implement the given requirements is the Random Cut Forest algorithm provided by Kinesis Data Analytics. This algorithm is specifically designed for anomaly detection in streaming data, making it suitable for detecting web application log anomalies in near-real-time. It dynamically updates the underlying model as log files are updated, ensuring that the system can adapt to changing data patterns. Therefore, Kinesis Data Analytics Random Cut Forest is the most appropriate AWS service component for this scenario.

    Rate this question:

  • 27. 

    A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant. Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test?

    • Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon OuickSight to visualize logs as they are being produced.

    • Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker.

    • Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the data as it is generated by Amazon SageMaker.

    • Send Amazon CloudWatch Logs that were generated by Amazon SageMaker lo Amazon ES and use Kibana to query and visualize the log data.

    Correct Answer
    A. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker.
    Explanation
    Generating an Amazon CloudWatch dashboard will allow the Specialist to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker. This will provide a convenient and comprehensive way to review and monitor these metrics during the load test.

    Rate this question:

  • 28. 

    A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL. Which storage scheme is MOST adapted to this scenario?

    • .Store datasets as files in Amazon S3.

    • Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.

    • Store datasets as tables in a multi-node Amazon Redshift cluster.

    • Store datasets as global tables in Amazon DynamoDB.

    Correct Answer
    A. .Store datasets as files in Amazon S3.
    Explanation
    Storing datasets as files in Amazon S3 is the most adapted storage scheme for this scenario because it allows for scalability and cost-effectiveness. With S3, the Data Science team can easily store and retrieve large amounts of training data without worrying about capacity limitations. Additionally, S3 supports SQL-based querying using services like Amazon Athena, allowing for easy exploration of the data using SQL. This solution also aligns with the requirement of being able to create an arbitrary number of new datasets every day, as S3 can handle the storage of a large number of files.

    Rate this question:

  • 29. 

    An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

    • Listwise deletion

    • Last observation carried forward

    • Multiple imputations

    • Mean substitution

    Correct Answer
    A. Multiple imputations
    Explanation
    The Specialist should use the multiple imputations approach to reconstruct the missing data. This approach involves creating multiple plausible values for the missing data based on the observed data and then analyzing the dataset multiple times with different imputed values. It helps to preserve the integrity of the dataset by accounting for the uncertainty associated with the missing data and providing more accurate estimates.

    Rate this question:

  • 30. 

    A web-based company wants to improve its conversion rate on its landing page Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker. However, there is an over fitting problem training data shows 90% accuracy in predictions, while test data shows 70% accuracy only.   The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validation data?  

    • Increase the randomization of training data in the mini-batches used in training.

    • Allocate a higher proportion of the overall data to the training dataset

    • Apply L1 or L2 regularization and dropouts to the training.

    • Reduce the number of layers and units (or neurons) from the deep learning network.

    Correct Answer
    A. Apply L1 or L2 regularization and dropouts to the training.
    Explanation
    Applying L1 or L2 regularization and dropouts to the training is recommended to provide the highest accuracy model for the company's test and validation data. Regularization techniques such as L1 or L2 help to prevent overfitting by adding a penalty term to the loss function, which discourages the model from relying too heavily on any single feature. Dropouts, on the other hand, randomly deactivate a certain proportion of neurons during training, forcing the model to learn more robust and generalizable representations. These techniques can help improve the model's ability to generalize and perform well on unseen data, thus boosting the accuracy on the test and validation datasets.

    Rate this question:

  • 31. 

    A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data.    Which type of data repository is the MOST cost-effective solution?

    • An Amazon EBS-backed Amazon EC2 instance with hourly directories

    • An Amazon RDS database with hourly table partitions

    • An Amazon S3 data lake with hourly object prefixes

    • An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes

    Correct Answer
    A. An Amazon S3 data lake with hourly object prefixes
    Explanation
    An Amazon S3 data lake with hourly object prefixes is the most cost-effective solution because it allows for storing a large amount of time-based training data and easily adding new files every hour. With S3, the volume of updates can change significantly throughout a 24-hour period without incurring additional costs. Additionally, using object prefixes allows for efficient organization and retrieval of the data based on the time intervals required for training.

    Rate this question:

  • 32. 

    Amazon Connect has recently been tolled out across a company as a contact call center. The solution has been configured to store voice call recordings on Amazon S3. The content of the voice calls is being analyzed for the incidents being discussed by the call operators. Amazon Transcribe is being used to convert the audio to text, and the output is stored on Amazon S3. Which approach will provide the information required for further analysis?

    • Use Amazon Comprehend with the transcribed files to build the key topics.

    • Use Amazon Translate with the transcribed files to train and build a model for the key topics.

    • Use the AWS Deep Learning AMI with Gluon Semantic Segmentation on the transcribed files to train and build a model for the key topics.

    • Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the transcribed files to generate a word embeddings dictionary for the key topics.

    Correct Answer
    A. Use Amazon Comprehend with the transcribed files to build the key topics.
    Explanation
    Amazon Comprehend is a natural language processing (NLP) service that can be used to analyze text and extract key topics. In this scenario, the voice calls are being converted to text using Amazon Transcribe, and the transcribed files are stored on Amazon S3. By using Amazon Comprehend with the transcribed files, the company can extract the key topics discussed in the calls, providing the required information for further analysis. This approach leverages the capabilities of Amazon Comprehend to analyze the text data and identify important topics, making it the appropriate choice for this situation.

    Rate this question:

  • 33. 

    The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?

    • The model predicts both the trend and the seasonality well.

    • The model predicts the trend well, but not the seasonality.

    • The model predicts the seasonality well, but not the trend.

    • The model does not predict the trend or the seasonality well.

    Correct Answer
    A. The model predicts both the trend and the seasonality well.
    Explanation
    The graph shows that the predicted values closely follow the actual values for both the trend and the seasonality. This indicates that the model is accurately capturing the patterns and fluctuations in the time series data, suggesting that it predicts both the trend and the seasonality well.

    Rate this question:

  • 34. 

    A Machine Learning Specialist was given a dataset consisting of unlabeled data The Specialist must create a model that can help the team classify the data into different buckets What model should be used to complete this work?

    • K-means clustering

    • Random Cut Forest (RCF)

    • XGBoost

    • BlazingText

    Correct Answer
    A. K-means clustering
    Explanation
    K-means clustering should be used to complete this work because it is a popular unsupervised learning algorithm that is used for clustering data. It is suitable for this task because the dataset consists of unlabeled data and the goal is to classify the data into different buckets. K-means clustering works by partitioning the data into k clusters based on their similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. This algorithm is widely used for data clustering and can help the Machine Learning Specialist in this task.

    Rate this question:

  • 35. 

    You wish to use a SageMaker notebook within a VPC. SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC. How would you use SageMaker within a VPC without opening up Internet access?

    • Uncheck the option for Internet access when creating your notebook instance, and it will handle the rest automatically.

    • No action is required, the VPC will block the notebook instances from accessing the Internet.

    • Use IAM to restrict Internet access from the notebook instance.

    • Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance's security group to allow outbound connections for training and hosting.

    Correct Answer
    A. Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance's security group to allow outbound connections for training and hosting.
    Explanation
    To use SageMaker within a VPC without opening up Internet access, you need to disable direct Internet access when specifying the VPC for your notebook instance. Additionally, you should use VPC interface endpoints (PrivateLink) to allow the connections needed for training and hosting your model. To ensure outbound connections for training and hosting, you should modify your instance's security group. This approach ensures that the SageMaker notebook instance remains within the VPC and does not create a potential security hole by accessing the Internet.

    Rate this question:

  • 36. 

    During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates What is the MOST likely cause of this issue?

    • The class distribution in the dataset is imbalanced

    • Dataset shuffling is disabled

    • The batch size is too big

    • The learning rate is very high

    Correct Answer
    A. The learning rate is very high
    Explanation
    During mini-batch training, the learning rate determines the step size at which the model's parameters are updated. If the learning rate is set too high, the model may overshoot the optimal parameters, causing the training accuracy to oscillate. This is because the model is unable to converge to the optimal solution and keeps bouncing back and forth. Therefore, the most likely cause of the issue is that the learning rate is very high.

    Rate this question:

  • 37. 

    A Machine Learning Specialist is working with a large cyber security company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis. What is the MOST efficient way to accomplish these tasks?

    • Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.

    • Ingest the data into Apache Spark Streaming using Amazon EMR. and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake.

    • Ingest the data and store it in Amazon S3 Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.

    • Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.

    Correct Answer
    A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
    Explanation
    The most efficient way to accomplish the tasks of ingesting and analyzing the data in real-time is by using Amazon Kinesis Data Firehose to ingest the data and Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. After detecting the anomalies, the results can be streamed to Amazon S3 using Kinesis Data Firehose. This approach allows for real-time analysis and storage of the results in a scalable and efficient manner.

    Rate this question:

  • 38. 

    An ad tech company is using an XGBoost model to classify its clickstream data. The company’s Data Scientist is asked to explain how the model works to a group of non-technical colleagues. What is a simple explanation the Data Scientist can provide?

    • XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees

    • XGBoost is a state-of-the-art algorithm that uses logistic regression to split each feature of the data basedon certain conditions

    • XGBoost is a robust, flexible, scalable algorithm that uses logistic regression to classify data into buckets

    • XGBoost is an efficient and scalable neural network architecture.

    Correct Answer
    A. XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees
    Explanation
    XGBoost is an Extreme Gradient Boosting algorithm that is optimized for boosted decision trees. It is a powerful and popular algorithm used for classification tasks. It works by iteratively combining multiple weak decision trees to create a strong predictive model. This algorithm is specifically designed to handle large datasets and provides high accuracy and efficiency.

    Rate this question:

  • 39. 

    A Machine Learning Engineer is creating a regression model for forecasting company revenue based on an internal dataset made up of past sales and other related data. What metric should the Engineer use to evaluate the ML model?

    • Cross-entropy log loss

    • Sigmoid

    • Root Mean squared error (RMSE)

    • Precision

    Correct Answer
    A. Root Mean squared error (RMSE)
    Explanation
    The Machine Learning Engineer should use Root Mean Squared Error (RMSE) to evaluate the regression model. RMSE is a commonly used metric for regression models as it measures the average difference between the predicted and actual values. It provides a single value that represents the overall accuracy of the model's predictions. The lower the RMSE, the better the model's performance in accurately forecasting the company's revenue based on the internal dataset.

    Rate this question:

  • 40. 

    A healthcare company using the AWS Cloud has access to a variety of data types, including raw and preprocessed data. The company wants to start using this data for its ML pipeline, but also wants to make sure the data is highly available and located in a centralized repository. What approach should the company take to achieve the desired outcome?

    • Create a data lake using Amazon S3 as the data storage layer

    • Store unstructured data in Amazon DynamoDB and structured data in Amazon RDS

    • Use Amazon FSx to host the data for training

    • Use Amazon Elastic Block Store (Amazon EBS) volumes to store the data with data backup

    Correct Answer
    A. Create a data lake using Amazon S3 as the data storage layer
    Explanation
    The company should create a data lake using Amazon S3 as the data storage layer. This approach allows the company to store both raw and preprocessed data in a centralized repository. Amazon S3 provides highly available and scalable storage for various data types, making it an ideal choice for the company's ML pipeline.

    Rate this question:

  • 41. 

    A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS. Which approach should the Specialist use for training a model using that data?

    • Write a direct connection to the SQL database within the notebook and pull data in.

    • Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.

    • Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.

    • Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pull data in for fast access.

    Correct Answer
    A. Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
    Explanation
    The Specialist should push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook. This approach allows for efficient and scalable storage of the historical training data in Amazon S3, which can then be easily accessed and used for training the model in Amazon SageMaker. It also ensures that the data is securely stored and can be easily shared and accessed by other services or users within the AWS environment.

    Rate this question:

  • 42. 

    A manufacturing company has a large set of labeled historical sales data The manufacturer would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to solve this problem?

    • Logistic regression

    • Random Cut Forest (RCF)

    • Principal component analysis (PCA)

    • Linear regression

    Correct Answer
    A. Linear regression
    Explanation
    Linear regression is the appropriate machine learning approach to solve the problem of predicting the number of units of a particular part that should be produced each quarter. Linear regression is used for predicting a continuous numerical value, which aligns with the problem of predicting the quantity of units to be produced. Logistic regression, Random Cut Forest (RCF), and Principal Component Analysis (PCA) are not suitable in this case because they are used for different types of problems such as classification, anomaly detection, and dimensionality reduction, respectively.

    Rate this question:

  • 43. 

    A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold. What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model's performance?

    • Receiver operating characteristic (ROC) curve

    • Misclassification rate

    • Root Mean Square Error (RM&)

    • L1 norm

    Correct Answer
    A. Receiver operating characteristic (ROC) curve
    Explanation
    The Specialist should use the Receiver Operating Characteristic (ROC) curve to understand how different classification thresholds will impact the model's performance. The ROC curve is a graphical representation that shows the trade-off between the true positive rate and the false positive rate at various classification thresholds. By analyzing the curve, the Specialist can determine the optimal threshold that balances the true positive rate and the false positive rate, and therefore maximize the model's performance.

    Rate this question:

  • 44. 

    A log analytics company wants to provide a history of Amazon SageMaker API calls made on its client’s account for security analysis and operational troubleshooting purposes. What must be done in the client’s account to ensure that the company can analyze the API calls?

    • Use IAM roles. “logs:*” are added to those IAM roles.

    • Enable AWS CloudTrail.

    • Enable CloudWatch logs.

    • Use the Amazon SageMaker SDK to call the ‘sagemaker_history()’ function.

    Correct Answer
    A. Enable AWS CloudTrail.
    Explanation
    Enabling AWS CloudTrail is necessary to ensure that the log analytics company can analyze the API calls made on the client's account. AWS CloudTrail provides a comprehensive history of API calls made within an AWS account, including Amazon SageMaker API calls. This allows for security analysis and operational troubleshooting purposes, as requested by the log analytics company. IAM roles and permissions, as well as CloudWatch logs, may be relevant for other purposes but are not specifically mentioned in relation to analyzing API calls. Using the Amazon SageMaker SDK and calling the 'sagemaker_history()' function is not mentioned as a requirement for this specific task.

    Rate this question:

  • 45. 

    A Data Scientist created a correlation matrix between nine variables and the target variable. The correlation coefficient between two of the numerical variables, variable 1 and variable 5, is -0.95. How should the Data Scientist interpret the correlation coefficient?

    • As variable 1 increases, variable 5 increases

    • As variable 1 increases, variable 5 decreases

    • Variable 1 does not have any influence on variable 5

    • The data is not sufficient to make a well-informed interpretation

    Correct Answer
    A. As variable 1 increases, variable 5 decreases
    Explanation
    The correlation coefficient of -0.95 indicates a strong negative correlation between variable 1 and variable 5. This means that as variable 1 increases, variable 5 tends to decrease. The closer the correlation coefficient is to -1, the stronger the negative correlation. Therefore, the Data Scientist can interpret that there is a strong inverse relationship between variable 1 and variable 5.

    Rate this question:

  • 46. 

    A real estate company wants to provide its customers with a more accurate prediction of the final sale price for houses they are considering in various cities. To do this, the company wants to use a fully connected neural network trained on data from the previous ten years of home sales, as well as other features. What kind of machine learning problem does this situation represent?

    • Regression

    • Classification

    • Recommender system

    • Reinforcement learning

    Correct Answer
    A. Regression
    Explanation
    This situation represents a regression problem. Regression is a type of machine learning problem where the goal is to predict a continuous numerical value. In this case, the real estate company wants to predict the final sale price of houses, which is a continuous variable. By using a fully connected neural network trained on previous home sales data, the company can make more accurate predictions for their customers.

    Rate this question:

  • 47. 

    A Data Engineer needs to build a model using a dataset containing customer credit card information.   How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?

    • Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.

    • Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers.

    • Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers.

    • Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.

    Correct Answer
    A. Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.
    Explanation
    The correct answer is to use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue. This solution ensures that the data is encrypted both at rest and in transit, using AWS KMS for encryption. Additionally, AWS Glue can be used to remove or redact the credit card numbers from the dataset, further enhancing the security of the data.

    Rate this question:

  • 48. 

    A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1:10] Considering the graph, what is a reasonable selection for the optimal choice of k?

    • 1

    • 4

    • 7

    • 10

    Correct Answer
    A. 4
    Explanation
    Based on the graph, we can observe that the decrease in the sum of squared errors (SSE) is significant until k=4, after which the decrease becomes less pronounced. This suggests that k=4 is a reasonable selection for the optimal choice of k, as it provides a good balance between minimizing the SSE and avoiding overfitting.

    Rate this question:

  • 49. 

    A manufacturing company asks its Machine Learning Specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100000 images per defect type for training. During the initial training of the image classification model the Specialist notices that the validation accuracy is 80%, while the training accuracy is 90%. It is known that human-level performance for this type of image classification is around 90%. What should the Specialist consider to fix this issue?

    • A longer training time

    • Making the network larger

    • Using a different optimizer

    • Using some form of regularization

    Correct Answer
    A. Using some form of regularization
    Explanation
    The Specialist should consider using some form of regularization to fix the issue. Regularization techniques such as L1 or L2 regularization can help prevent overfitting, which is indicated by the higher training accuracy compared to the validation accuracy. By adding regularization, the model's complexity is reduced, preventing it from memorizing the training data and improving its generalization ability. This can help bring the validation accuracy closer to the human-level performance of 90%.

    Rate this question:

Quiz Review Timeline (Updated): Jan 8, 2025 +

Our quizzes are rigorously reviewed, monitored and continuously updated by our expert board to maintain accuracy, relevance, and timeliness.

  • Current Version
  • Jan 08, 2025
    Quiz Edited by
    ProProfs Editorial Team
  • Oct 02, 2020
    Quiz Created by
    Olatunde.olanayo
Back to Top Back to top
Advertisement
×

Wait!
Here's an interesting quiz for you.

We have other quizzes matching your interest.