AWS Machine Learning - Specialty (MLS-c01 – 2020)

180 Questions | Total Attempts: 3209

SettingsSettingsSettings
Please wait...
AWS Machine Learning - Specialty (MLS-c01  2020)

The AWS Certified Machine Learning - Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate's ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.


Questions and Answers
  • 1. 
    A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:
    • Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.
    • Support event-driven ETL pipelines.
    • Provide a quick and easy way to understand metadata.
    Which approach meets these requirements?
    • A. 

      Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.

    • B. 

      Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.

    • C. 

      Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.

    • D. 

      Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.

  • 2. 
    A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis the Specialist observes that many features are highly correlated with each other. This may make the model unstable. What should be done to reduce the impact of having such a large number of features?
    • A. 

      Perform one-hot encoding on highly correlated features

    • B. 

      Use matrix multiplication on highly correlated features.

    • C. 

      Create a new feature space using principal component analysis (PCA)

    • D. 

      Apply the Pearson correlation coefficient

  • 3. 
    An employee found a video clip with audio on a company's social media feed. The language used in the video is Spanish. English is the employee's first language, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task?
    • A. 

      Amazon Transcribe, Amazon Translate, and Amazon Comprehend

    • B. 

      Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq

    • C. 

      Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)

    • D. 

      Amazon Transcribe, Amazon Translate, and Amazon SageMaker BlazingText

  • 4. 
    A Machine Learning Specialist deployed a model that provides product recommendations on a company's website Initially, the model was performing very well and resulted in customers buying more products on average However within the past few months the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago.  Which method should the Specialist try to improve model performance?
    • A. 

      The model needs to be completely re-engineered because it is unable to handle product inventory changes

    • B. 

      The model's hyperparameters should be periodically updated to prevent drift

    • C. 

      The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes

    • D. 

      The model should be periodically retrained using the original training data plus new data as product inventory changes

  • 5. 
    A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes. Which function will produce the desired output?
    • A. 

      Dropout

    • B. 

      Smooth L1 loss

    • C. 

      Softmax

    • D. 

      Rectified linear units (ReLU)

  • 6. 
    An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?
    • A. 

      Listwise deletion

    • B. 

      Last observation carried forward

    • C. 

      Multiple imputations

    • D. 

      Mean substitution

  • 7. 
    A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations.   Which solution should the Specialist recommend?
    • A. 

      Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.

    • B. 

      A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database

    • C. 

      Collaborative filtering based on user interactions and correlations to identify patterns in the customer database

    • D. 

      Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database

  • 8. 
    A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily. Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?
    • A. 

      Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3 then use AWS Glue to do the transformation

    • B. 

      Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3

    • C. 

      Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.

    • D. 

      Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehouse stream that transforms raw record attributes into simple transformed values using SQL.

  • 9. 
    A web-based company wants to improve its conversion rate on its landing page Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker. However, there is an over fitting problem training data shows 90% accuracy in predictions, while test data shows 70% accuracy only.   The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validation data?  
    • A. 

      Increase the randomization of training data in the mini-batches used in training.

    • B. 

      Allocate a higher proportion of the overall data to the training dataset

    • C. 

      Apply L1 or L2 regularization and dropouts to the training.

    • D. 

      Reduce the number of layers and units (or neurons) from the deep learning network.

  • 10. 
    A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data.    Which type of data repository is the MOST cost-effective solution?
    • A. 

      An Amazon EBS-backed Amazon EC2 instance with hourly directories

    • B. 

      An Amazon RDS database with hourly table partitions

    • C. 

      An Amazon S3 data lake with hourly object prefixes

    • D. 

      An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes

  • 11. 
    A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions - Here is an example from the dataset;   "The quck BROWN FOX jumps over the lazy dog "   Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Select THREE)
    • A. 

      Perform part-of-speech tagging and keep the action verb and the nouns only

    • B. 

      Normalize all words by making the sentence lowercase

    • C. 

      Remove stop words using an English stopword dictionary.

    • D. 

      Correct the typography on "quck" to "quick."

    • E. 

      One-hot encode all words in the sentence

    • F. 

      Tokenize the sentence into words.

  • 12. 
    For the given confusion matrix, what is the recall and precision of the model?
    • A. 

      Recall = 0.92 Precision = 0.84

    • B. 

      Recall = 0.84 Precision = 0.8

    • C. 

      Recall = 0.92 Precision = 0.8

    • D. 

      Recall = 0.8 Precision = 0.92

  • 13. 
    A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes.   Which prior probability distribution should the ML Specialist use for this variable?  
    • A. 

      Poisson distribution

    • B. 

      Uniform distribution

    • C. 

      Normal distribution

    • D. 

      Binomial distribution

  • 14. 
    A Machine Learning Specialist is using Amazon SageMaker to host a model for a highly available customer- facing application. The Specialist has trained a new version of the model, validated it with historical data, and now wants to deploy it to production. To limit any risk of a negative customer experience, the Specialist wants to be able to monitor the model and roll it back, if needed.   What is the SIMPLEST approach with the LEAST risk to deploy the model and roll it back, if needed?  
    • A. 

      Create a SageMaker endpoint and configuration for the new model version. Redirect production traffic to the new endpoint by updating the client configuration. Revert traffic to the last version if the model does not perform as expected.

    • B. 

      Create a SageMaker endpoint and configuration for the new model version. Redirect production traffic to the new endpoint by using a load balancer. Revert traffic to the last version if the model does not perform as expected.

    • C. 

      Update the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant. Revert traffic to the last version by resetting the weights if the model does not perform as expected.

    • D. 

      Update the existing SageMaker endpoint to use a new configuration that is weighted to send 100% of the traffic to the new variant. Revert traffic to the last version by resetting the weights if the model does not perform as expected.

  • 15. 
    A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model.   What should the Specialist do to prepare the data for model training?  
    • A. 

      Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution.

    • B. 

      Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude.

    • C. 

      Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude.

    • D. 

      Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.

  • 16. 
    A Data Engineer needs to build a model using a dataset containing customer credit card information.   How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?
    • A. 

      Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.

    • B. 

      Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers.

    • C. 

      Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers.

    • D. 

      Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.

  • 17. 
    A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users' behavior and product preferences to predict which products users would like based on the users' similarity to other users.   What should the Specialist do to meet this objective?  
    • A. 

      Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • B. 

      Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • C. 

      Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR.

    • D. 

      Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR.

  • 18. 
    A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1:10] Considering the graph, what is a reasonable selection for the optimal choice of k?
    • A. 

      1

    • B. 

      4

    • C. 

      7

    • D. 

      10

  • 19. 
    While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown. What does this mean?
    • A. 

      The model might have prediction errors over a range of target values.

    • B. 

      The dataset cannot be accurately represented using the regression model.

    • C. 

      There are too many variables in the model.

    • D. 

      The model is predicting its target values perfectly.

  • 20. 
    An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen. Which combination of algorithms would provide the appropriate insights? (Select TWO)
    • A. 

      The factorization machines (FM) algorithm

    • B. 

      The Latent Dirichlet Allocation (LDA) algorithm

    • C. 

      The principal component analysis (PCA) algorithm

    • D. 

      The k-means algorithm

    • E. 

      The Random Cut Forest (RCF) algorithm

  • 21. 
    A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements However company acronyms are being mispronounced in the current documents. How should a Machine Learning Specialist address this issue for future documents?
    • A. 

      Convert current documents to SSML with pronunciation tags.

    • B. 

      Create an appropriate pronunciation lexicon.

    • C. 

      Output speech marks to guide in pronunciation.

    • D. 

      Use Amazon Lex to preprocess the text files for pronunciation.

  • 22. 
    A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker. The historical training data is stored in Amazon RDS. Which approach should the Specialist use for training a model using that data?
    • A. 

      Write a direct connection to the SQL database within the notebook and pull data in.

    • B. 

      Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.

    • C. 

      Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in.

    • D. 

      Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pull data in for fast access.

  • 23. 
    A Mobile Network Operator is building an analytics platform to analyze and optimize a company's operations using Amazon Athena and Amazon S3. The source systems send data in CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement?
    • A. 

      Ingest CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet.

    • B. 

      Ingest CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.

    • C. 

      Ingest CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.

    • D. 

      Ingest CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

  • 24. 
    A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours. With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s). Which visualization will accomplish this?
    • A. 

      A histogram showing whether the most important input feature is Gaussian.

    • B. 

      A scatter plot with points colored by target variable that uses (-Distributed Stochastic Neighbor Embedding (I-SNE) to visualize the large number of input variables in an easier-to-read dimension.

    • C. 

      A scatter plot showing the performance of the objective metric over each training iteration.

    • D. 

      A scatter plot showing the correlation between maximum tree depth and the objective metric.

  • 25. 
    A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker. The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm. The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model. What should the Specialist do to address the performance issues with the current solution?
    • A. 

      Use the SageMaker batch transform feature.

    • B. 

      Compress the training data into Apache Parquet format.

    • C. 

      Ensure that the input mode for the training job is set to Pipe.

    • D. 

      Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.

Back to Top Back to top