Welcome, you will now see how you can train one model to get you very good results on several NLP tasks. When training such a model, we usually append a tag to notify whether we're training on either machine translation, question answering, summarization, sentiment or some other type of task. Let's see how you can use this in your own applications. So the multitask training strategy works as follows. So if you want to translate from English to German, you append the prefix translates English to German, and it gives you the corresponding translation. For cola sentence like the course is jumping well and it says it's not acceptable because it's grammatically incorrect. If you have two sentences and you want to identify their similarity, you put in the stsb sentence one and then sentence two inside over here, sentence one, sentence two, and then you get the corresponding score. If you want to summarize, you add the summarized prefix to the article or the text you want to summarize and it gives you the summary. So this is how it works. Input and output format, so for machine translation, you just do translate like blank to blank and you add the sentence. To predict entailments, contradiction, or whether it's neutral, you would feed in something as follows. So mnli premise, I hate pigeons then the hypothesis, my feelings towards pigeons are filled with animosity and the target is entailment. So basically over here, this is going to try to learn the overall structure of entailment, and by feeding in the entire thing, the model would have full visibility over the entire input and then it would be tasked with marking a classification by us putting the word entailment. So it is easy for the model to learn to predict one of the correct class labels given the task prefix mnli in this case. If you know the main difference between prefects lm and the birth architecture, is that the classifier is integrated to the output layer of the transformer decoder and the prefix lm. And over here you have the Winograd schema which is another way to predict whether a pronoun for example over here, the city councilmen refused the demonstrators a permit because they feared violence. So you're going to feed this into your model and then it will be tasked to predict they as the city councilmen. So for multitask training strategy, this is a table found in the original paper and we'll talk about what the glue benchmark is and these other benchmarks, you can check them out. But for the purpose of this week, we'll be focusing on the glue benchmark, which would be the next video, and we'll talk about adapter layers and gradual unfreezing. But these are the scores reported, and you can see that the T5 paper actually reaches states of the arts in many tasks. So how much data from each has to train on? So for the data training strategies, there is examples proportional mixing and in this case what you end up doing, you take an equal proportion say like 10% from each data that you have. And if the first data sets for example blue, you take 10% of this, then you'll get 10% over here, 10% of this is larger and 10% is just a random number I picked but you get the point. For the other type of data training strategy is equal mixing. So regardless of the size of each data, you take an equal sample. And then there's something in the middle called temperature-scaled mixing where you try to play with the parameters to get something in between. Now we'll talk about gradual unfreezing versus adapter layers. So in gradual unfreezing, what ends up happening, you unfreeze one layer at a time. So you say this is your neural network unfreezing the last one, you fine tune using that, you keep the others fixed, then unfreezing this one and then you unfreeze this one, so you keep unfreezing each layer. And for the adapter layers, you basically add a neural network to each feed forward in each block of the transformer. And then these new feed forward networks, they're designed so that the output dimension matches the input. And this allows them to be inserted without having any structural change. When fine tuning, only these new adapter layers and the layer normalization parameters are being updated. So we'll talk now a bit more about fine tuning. The approach that's usually being used here has the goal of training a single model that can simultaneously perform many tasks at once. For example, the model, most of its parameters are shared across all of the tasks, and we might train a single model on many tasks. But when reporting performance, we can select a different check points for each task. So over here, the task could be like translation, summarization or mask language modeling. And they do the training in 2 to the power of 18 steps. You learned about the multiple training strategies used for your transformer model. In this week's programming exercise, you will explore this in even more detail. Now that you know how to train this model, you need the way to evaluate it. Concretely, you'll be evaluating this using the glue benchmark, which stands for general language understanding evaluation benchmark. See you there.