In this section, I'll explain how automated machine learning fits into the end-to-end machine learning workflow. A typical machine learning workflow covers tasks that are required to first ingest and analyze your data, which includes defining the machine learning problem, exploring and understanding your data, and selecting the right algorithm, or algorithms, to use for your experiments. Next, you prepare and transform your data, by performing feature engineering tasks and transforming the data into a format that matches the format required by the algorithm. Finally, you typically train multiple models across a number of experiments until you have a well performing model that's created using a specific combination of algorithm, data transformations, and hyper parameters. AutoML aims to apply machine learning and automation to a large portion of these tasks and steps that are required to find the most performant model that can then be deployed. Let's dive into some of the tasks within each of these workflow stages, to better understand AutoML capabilities. In the first week of this course, we talked about data ingestion and exploration for the machine learning problem that you're trying to solve. In this case, you're ingesting product review data that is labeled-- meaning the data set includes the target that you're trying to predict. So, in this case, your data has the target classes of positive, neutral, or negative, for each product review on input. Based on your data, you can further refine your problem definition to determine what class of problem you're trying to solve. So is this a regression problem or is this a classification problem? In this case, you can see that you're really trying to determine whether a specific product review is either positive, neutral, or negative, meaning you have three different classes that you're trying to predict. So in this case, you have a classification problem. Once you've narrowed down your machine learning problem, you want to identify potential algorithms that you want to try as part of your training experiments. After you've done some analysis on your data and determine the type of machine learning problem that you're trying to solve for, you can then look at which algorithm, or algorithms, that are best suited for your data and the problem you're trying to solve. When you perform that data analysis, you want to understand your data. So getting insight into things like data distribution, attribution correlation, or do you have potential quality issues in your data, like missing data, then based on your machine learning problem, combined with your data analysis, you're able to identify the algorithm, or the algorithms, that you'd like to try for your experiments. After you've done these things, you can then outline some experiments. In this case, let's say you decided to use xgboost for your first experiment, which you'll also see inside the lab for this week as well. Xgboost is an implementation of gradient boosted decision trees and can be used for classification problems, as well as regression. However, selecting the right algorithm, or algorithms, is only part of the process. For each algorithm, there are also a number of hyper parameters that you need to consider, as you tune your model for optimal performance. Also, each algorithm can have different expectations, in terms of the format of the data that expects on input for training. So, let's take a look at a few considerations related to the product review data set. The schema for the product review data set includes three attributes on input, including, first, a review id, which is just a numeric, unique identifier for a specific review, the review text, which is your text- based or categorical data that contains the actual product review, and, finally, the sentiment for the review. This is your target or your label that you're trying to predict. This attribute in this case is numeric. These numeric values represent the classes where 1 is positive, 0 is neutral, and -1 is negative. So looking at your dataset schema, what transformations are you going to need to make so that you can help ensure that your selected algorithm, in this case xgboost, can accept and understand your data on input for training. Looking at your data in combination with some of the statistics that you previously captured and analyzed, you know that review text is categorical, but you also know that it contains too many unique values to be impactful to your algorithm, if you tried something like one hot encoding. So, in this case, you want to treat your attributes text and apply text-processing transformations instead. Text transformation is a pretty broad topic because it can include some additional data processing, like using tokenization to convert sentences to words, removing stop words or words such as the or is, which may not be impactful to the overall performance of your model. You then typically perform some type of feature extraction, where you map your textual data to vectors. And there are a number of different techniques to do this, and text transformations can often take a lot of time and effort to optimize. In the lab this week, you'll use one technique called Term Frequency Inverse Document Frequency, or TFIDF. I'll cover that a bit more later in this session. Finally, your last attribute is your label, which in this case we have three unique classes, and it's already a numeric format, which looks good for our algorithm. And if you remember, data preparation also includes looking for class imbalance or signals of data bias, using statistical bias detection techniques. To balance data, you'll want to determine how you plan to handle class imbalance, which can involve things like changing your performance metric, applying resampling techniques, generating synthetic data, or even changing your selected algorithm. As an example, xgboost tends to handle class imbalance well, but it also supports additional hyper parameter tuning to further tune for data imbalance in classification problems. So, like you see here, where the number of positive reviews is significantly larger than the neutral or negative reviews. Figure out how to handle problems like class imbalance can consume a lot of your cycles and require multiple experiments, using different combinations of data transformations and training tasks before finding that optimal combination of data transformations, algorithm, and hyper parameter, that gives you the results that you need. This leads me to the next part of your workflow and your final prepare and transform task. Once you've done your data transformations, you can then use your process data set to create your training and validation data sets. For this, you reserve the largest portion of your data for training your model. So, this is the data that the model learns from, and you can use it to calculate model metrics, such as training accuracy and training loss. The validation data set is a second dataset, or holdout data set, created from your fully processed training data set, and you'll use it to evaluate your model performance, usually after each epoch or full pass through the training set. The purpose of this evaluation is to fine tune the model hyper parameters and determine how well your model is able to generalize on unseen data. Here you can calculate metrics, like validation accuracy and validation loss. After you have your data sets ready, you can now move on to model training and tuning. Model training and validation is highly iterative and typically happens over many experiments. During this step, your goal is to determine which combination of data, algorithm, hyper parameters results in the best performing model. For each combination that you choose, you need to train the model and evaluate it against that holdout data set. Then you repeat these steps until you have a model that is performing well according to your objective metric, whether that's accuracy or something like an F1 score, depending on what you're optimizing for. As you can imagine, all of these iterations can take a lot of compute, and you typically want to be able to rapidly iterate without running into bottlenecks; so this is where training at scale comes in. The cloud gives you access to on- demand resources, that allow you to train an experiment at scale, without wait time or scheduling training time on on-premises resources, that are often constrained or limited by GPU, CPU, or storage. Without resource limitations, you can also further optimize your training time, using capabilities like distributed training or taking advantage of parallel processes. So I just covered the high level steps from data preparation to model training and tuning. And as you can see, there's a lot of work that goes into understanding your data, determining the best algorithm, or algorithms, to use in performing your data preparation or transformations, and finally training and tuning for each of your experiments. Each combination of data, algorithm, and hyper parameters can consume many human hours and compute hours, before getting to a model that is performing well. So, this is where AutoML comes in. There are a lot of different implementations of AutoML, but, in general, AutoML allows you to reduce the need for data scientists to build machine learning models, because it uses machine learning to automate the machine learning workflow tasks that are highlighted here in blue and which I just covered in detail for a product review case. So does this mean that we no longer need data scientists? Not at all. Data scientists are still critical, but the goal is to allow data scientists to focus on those really hard-to- solve machine learning problems. This can also include having data scientists refine the data transformations or the code that's generated by AutoML, to further optimize those results that are produced through automated machine learning. So I'll spend some time now talking about the kinds of tasks that automated machine learning is designed to accomplish. And in the next video, I'll dive into the details of the Amazon Sagemaker implementation of AutoML, called Amazon Sagemaker Autopilot. With AutoML, you first provide your labeled data set, which includes the target that you're trying to predict. Then, AutoML is going to automatically do some analysis of that data and determine the type of machine learning problem. So, is this a binary classification, multi-class classification, or a regression problem? AutoML will then typically explore a number of algorithms and automatically select the algorithm that best suits your ML problem and your data. Once AutoML selects an algorithm, it will automatically explore various data transformations that are likely to have an impact on the overall performance of your model. And then, it will automate the creation of those scripts that will be necessary to perform those data transformations across your tuning experiments. Finally, AutoML will select a number of hyper parameter configurations, to explore over those trading iterations, to determine which combinations of hyper parameters and feature transformation code results in the best-performing model. AutoML capabilities reduce a lot of the repetitive work, in terms of building and tuning your models, through the numerous iterations and experiments that are typically required. Some common scenarios for AutoML include, first, enabling people who don't have machine learning expertise to build models. AutoML lets people that aren't classically trained data scientists benefit from using machine learning to solve everyday problems. It also allows for expert data scientists to focus on those really hard problems that can't be solved through AutoML. Second, AutoML allows you to experiment and build models at scale by reducing the amount of human intervention required across that machine learning workflow, especially on some of the resource intensive tasks like feature engineering and hyper parameter tuning. Finally, AutoML is all about automation, even if the automation doesn't get you all the way there, you can still use AutoML to reduce a lot of the repetitive work, but still use your experts to focus on high- value tasks, like taking that AutoML output and applying their domain knowledge, or doing additional feature engineering, or using data scientists to evaluate and analyze the results of that AutoML. However, there are some considerations when selecting an implementation of AutoML. Depending on the implementation of AutoML that you choose or you're deciding to use, there may be a balance, in terms of iterating faster but still maintaining the transparency and control that you may be looking for. Some implementations of AutoML provide limited visibility into the background experiments, which may produce a really performant model, but that model is often hard to understand, explain, or reproduce manually. Alternatively, there are implementations of AutoML, that not only provide the best model, but they also provide all of the candidates and the full source code that was used to create that model. This is valuable for being able to understand and explain your model, but it also allows you to take that model and potentially further optimize it for extra performance by doing things like applying some of that additional domain knowledge or doing some additional feature engineering on top of the recommended feature engineering code. In this section, I walked through the task, or the steps, in the machine learning workflow that are often requiring a lot of resources, not only in terms of human time to perform these tasks, but also in terms of compute cycles or resource costs. Using solutions that take advantage of automated machine learning helps you avoid those challenges by using machine learning to automate either all or part of your model building activities. Next, I'm going to cover Amazon Sagemaker's implementation of AutoML, called Sagemaker Autopilot, that not only includes the automation of your machine learning workflow task, but it also gives you that level of control and transparency to understand exactly how your data was processed and how your model is built.