In the last video, we talked about the programming model for Spark where RDD's get generated from external datasets and gets partitioned. We said, RDDs are immutable meaning they can't be changed in place even partially. They need a transformation operation applied to them and get converted into a new RDD. This is essential for keeping track of all the processing that has been applied to our dataset providing the ability to keep a linear chain of RDDs. In addition, as a part of a big data pipeline, we start with an RDD. And through several transformation steps, many other RDDs as intermediate products get executed until we get to our final result. We also mention that an important feature of Spark is that all these transformation are lazy. This means they don't execute immediately when applied to an RDD. So when we apply a transformation, nothing happens right away. We are basically preparing our big data pipeline to be executed later. When we are done defining all the transformations and perform an action, Spark will take care of finding the best way to execute this computation and then start all the necessary tasks in our worker nodes. In this video, we will explain some common transformation in Spark. After this video, you will be able to explain the difference between a narrow transformation and wide transformation. Describe map, flatmap, filter and coalesce as narrow transformations and list two wide transformations. Let's take at look at, probably the simplest transformation, which is a map. By now, you're well versed in home networks. It applies the function to each partition or element of an RDD. This is a one to one transformation. It is also in the category of element-wise transformations since it transforms every element of an RDD separately. The code example in the blue box here applies a function called lower to all the elements in a text_RDD. The lower function turns all the characters in a line to lower case letters. So the input is one line of text with any kind of capitalization and the outfit is going to be the same line, all lower case. In this example, we have two worker nodes drawn as orange boxes. The black boxes are partitions of our dataset. We work by partition and not by element. As you would remember this, it is the difference between Spark and MapReduce. The partition is just a chunk of our data with some number of elements in it and the map function gets applied to all elements in that partition in each worker node locally. Each node applies the map function to the data or RDD partition they received independently. Let's look at a few more in element-wise transformation category. FlatMap is very similar to map. However, instead of returning an individual element for each map, it returns an RDD with an aggregate of all the results for all the elements. In the example in the blue box, the split_words fuction takes a line as an input, which is one element and it's output is each word as a single element. So, it splits a line to words. The same thing gets done for each line. When the output for all the lines is flattened, we get a simple one-dimensional list of words. So, we'll get all the words in all the lines in just one list. Depending on the line length the output partitions might be of different sizes. Detected here by the height of each black box being different. In Spark terms, map and flatMap are narrow transformations. Narrow transformation refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is not necessary. Another very important transformation is filter. Often, we're interested just in a subset of our data or we want to get rid of bad data. Filter transformation takes the function take executes on each element of a RDD partition and returns only the elements that the transformation element returns true. The example code in the blue box here, applies a filter function that filters out words that start with the letter a. The function starts with a, takes the input word, then transforms it to lowercase and then checks if the word starts with a. So, the output of this operation will be a list with only words that start with a. This is another narrow transformation. So, it only gets executed locally without the need to shuffle any RDD partitions across the word kernels. The output of filter depends on the input and the filter functions. In some cases, even if you started with even RDD partitions within the worker nodes, the RDD size can significantly vary across the workers after a filter operation, then this happens is a pretty good idea to join some of those partitions to increase performance and even out processing across clusters. This transformation is called coalesce. Coalesce simply helps with balancing the data partition numbers and sizes. When you have significantly reduced your initial data after some filters and other transformations, having a large number of partitions might not be very useful anymore. In this case, you can use coalesce to reduce the number of partitions to a more manageable number. Until now, we talked about narrow transformations that happen in a worker node locally without having to transfer data through the network. Now, let's start talking about wide transformations. Let's remember our Word Count example. As a part of the Word Count example, we map the words RDD could generate tuples. The output of map is a key value pair list where the key is the word and the value is always one. We then apply it reduceByKey to tuples to generate counts, which simply sums the values for each key or word. Let's imagine for a second that we use groupByKey instead of reduceByKey. We will come back to reduceByKey in just a little bit. Remember, mapped outputs tuples, which is a list of key value pairs in the forms of word one. At each worker node, we will have tuples that have the same word as key. In this example, apple as the key and 1 as the count and 2 worker nodes. Trying to group together, all the counts of a word across worker nodes requires shuffling of data between these nodes. Just like we do for the word apple here. GroupByKey is the transformation that helps us combine values with the same key into a list without applying a special user define function to it. As you see on the right, the result of a groupByKey transformation on all the map outputs by the word apple is the key ends up in a list with all ones. If you instead apply the function to list like summing up all the values, then we could have had the word count results. In this case, 2. If we instead applied a function to the list like summing up all the values, then we would have had the word count results. If we need to apply such functions to a group of values related to a key like this, we use the reduceByKey operation. ReduceByKey helps us to combine the value using a reduce function, which in the word count case is a simple summation. In groupByKey and reduceByKey transformations, we observe the behavior that require shuffling of the data across work nodes, we call such transformations wide transformations. In wide transformation operations, processing depends on data residing in multiple partitions distributed across worker nodes and this requires data shuffling over the network to bring related datasets together. As a summary, we have listed a small number of transformations in Spark with some examples and distinguished between them as narrow and wide transformations. Although this is a good start, I advise you to go through the list provided at the link shown here after you complete this beginner course. Read about the rest of the transformations in Spark before you start programming in Spark and have fun with transformations.