Having mastered computing the connecting components of an undirected graph in linear time, let's now turn our attention to directed graphs, which also arise in all kinds of different applications. Now the good news is, is we'll be able to get just a blazingly fast primitive for computing connectivity information for directed graphs. The bad news, if you want to call it hat, is we'll have to think a little bit harder. So it won't be as obvious how to do it. But by the end of this lecture will you know linear time algorithms, a very good concept. It's really just based on depth first search for computing all of the pieces of a directed graph. In fact, it's not even so obvious how to define pieces, how to define connected components in a directed graph. Certainly not as obvious as it was with undirected graphs. So, see what I mean, let's consider the following four node directed graph. So on the one hand, this graph in some sense in one piece. If this was an actual physical object, made say of a bunch of strings connected to each other and we picked it up with our hands, it wouldn't fall apart into two pieces, it would hang together in one piece. On the other hand, when you think about moving around this network, it's not connected in the sense that we might think about. You cannot get from any one point a, to any other point b. For example, if you started the right-most node in this graph, certainly there's no directed path that will get you to the left-most node. So what's typically studied and most useful for directed graphs is what you call strong connectivity. And a graph is strongly connected if you can get from any one point to any other point and vice versa. And the components then informally are the maximal portions of this graph, the maximal regions, which are internally strongly connected. So the maximum regions from within which you can get from any one ,point a to any other point b along a directed graph. For the record, let me go ahead and give you a formal definition, although this intuition is perfectly valid. Just regions where you can get from anywhere to anywhere else. So we say that the strongly connected components of directed graph, or the SCCs for short. And as in the undirected case, we're going to give a somewhat slick definition rather than talking about maximal region satisfying some property. We're going to talk about the equivalence classes of a particular equivalence relation. But really it means the same thing. This is just sort of the more mathematically more mature way of writing it down. So the equivalence relation we're going to use, if it's on nodes of the graph, and we'll say that u and node is related to a node v if you can get from u to v via directed path, and also from v to u in some other directed path. I'm not going to bore you with the verification that this is an equivalence relation. That's something you can work out in the privacy of your home. So remember what it means to be an equivalence relation, that's reflexive, that is everybody's related to itself. But of course there is a path from every node to itself. It also means that it's symmetric, so if u is related to v, then v is related to u. Well again, by definition we're saying that the vertices are mutually reachable from each other. So that's clearly symmetric by definition. And then it has to be transitive. And the way you prove it's transitive is you just paste paths together, and it just works the way you'd expect it to. So let's illustrate this concretely with a somewhat more complicated directed graph. So here's a directed graph, and I claim that it has exactly four strongly connecting components. There's a triangle on the left. A triangle on the right. It has this single node on top. And then it has this directed forecycle with a diagonal at the bottom. So what I hope is pretty clear is that each of these circled regions is indeed strongly connected. That is, if you start from one node in one of these circled regions, you can have a directed path to any other node. So that's symmetric, because on a directed cycle you can get from any one starting point to anyother place. And all of these have directed cycles. And then there's the one strong component that just has the one node, which is obviously strongly connected. What I also claim is true is that all of these regions are maximal. Subject to being strongly connected. That's why they're the strongly connected components. That's why they're the equivalence classes of this equivalence relation we just defined. So if you take any two pairs of nodes which lie in two different circles, you either won't have a path from the first one to the second, or you won't have a directed path from the second one back to the first one. In fact, the structure of the strong components in this black graph exactly mirrors the directed acyclic graph that we started in red. So in the same way in the red four-node graph, you can't move from right to left. Here in this bigger graph, you can't move from any of the circled SCCs to the right to any of the circled SCCs to the left. So for example, from the right-most nodes, there are no directed paths to the left-most nodes. So that's a recap of the definition of the strongly connected components. I've motivated in a separate video some reasons why you might care about computing strongly connected components. And in particular, on extremely large graphs which motivates the need for blazingly fast sub routines, so four free primitives that will let you compute connectivity information. So you'd love to be able to just know the pieces of a directed graph. Maybe you don't even know why they're going to be useful but you just compute them because why not? It's four free primitive. So, that's what I'm going to give you on this lecture. So the algorithm that we're going to present is based on depth-first search. And that's going to be the main reason why it's going to be so blazingly fast, is because depth-first search is blazingly fast. Now, you might be wondering, what on earth does graph search have to do with computing components? They don't seem obviously related. So let's return to the same directed graph that I shows you on the previous slide. And to see why something like depth-first search might conceivably have some use for computing strong components. Suppose we called depth-first search starting from this red node as a starting point. What would it would explore? So remember what the guarantee of something like depth-first search or breath first search for that matter is. You find everything that findable, but naturally nothing else. So what is findable from this red node? Where by findable I mean, you can reach it from a directed path emanating from this red node. Well, there's not much you can do. So from here you can explore this arc. And you can explore this arc. And then you can go backward. And so, if you do DFS or BFS from this node, you're going to find precisely the nodes in this triangle. All of the other arcs involved go the wrong direction and they won't be traversed by, say, a depth-first search call. So, why is that interesting? What's interesting is that if we invoke DFS from this red node, or any of the three nodes from this triangle, then it's going to discover precisely this strongly connected component, precisely the three nodes in this circled SCC. So that seems really cool, seems like maybe we just do a DFS, and boom we get an SCC. So maybe if we can do that over and over again we'll get all the SCCs. So that's a good initial intuition, but something can go wrong. Suppose that instead of initiating DFS from one of these three nodes on the triangle, we say, initiated from this bottom most node in green. So remember, what is the guarantee of a graph search subroutine like DFS? It will find everything findable but of course, nothing more. So what's findable from this green node? Well, naturally everything in its own SCC, right? So the four nodes here, it'll certainly find those four nodes. On the other hand, if we start from this green node, since there are arcs that go from this bottom-most SCC to the right-most the SCC. Not only will this DFS call find the four nodes in the green node's strong component, but it will also traverse these blue arcs and discover the three nodes in the red triangle. So, if we call DFS from this green node, we'll capture all seven of these. So the point is, if we call DFS, it looks like we're going to get a union of possibly multiple SCCs. In fact, in the worst case, if we invoke DFS from the leftmost node, what's it going to discover? It's going to discover the entire graph. And that didn't give us any insight into the strong component structure at all. So, what's the takeaway point is, the takeaway point is if you call DFS from just the right place, you'll actually uncover an SCC. If you call it from the wrong place, it will give you no information at all. So the magic of the algorithm that we're going to discuss next is we'll show having this super slick pre-processing step which ironically is itself is called a depth-first search. We can in linear time compute exactly where we want to start the subsequent depth-first searches from, so that each indication gets us exactly one strongly connected component and nothing more. So the algorithm that I'm going to show you is due to Kosaraju and it will show the following theorem, that the strongly connected components of a directed graph can be computed in linear time. And as we'll see, the constants are also very small. It's really just going to be two passes of depth first search. And again I'm going to remind you that for many problems, it's natural to assume that the number of edges is at least the number of nodes because you're only interested in cases where the graph is connected. Of course when you're computing connected components, that's one of the most natural cases where you might have a super sparse broken up graph. So we're not going to assume M is at least N so that's why linear time is going to be M plus N because that's the size of the input. And we don't know either m could be bigger than n or n could be bigger than m. We have no idea. Kosaraju's algorithm is shocking in its simplicity. It has three steps. Let me tell you what they are. First very mysteriously we are going to reverse all of the archs of the given graph. Totally not clear why that would be an interesting thing to do yet. Then we're going to do our first pass, our first depth first search, and we're going to do it on the reverse graph. Now the naive way to implement this would be to literally construct a new copy of the input graph with all the the arcs in the reverse direction, and then just run depth first search on it. Of course, the sophisticated, the sort of obvious optimization would be to just run DFS on the original graph, but going across arcs backwards. So I'll let you think through the details of how you'd do that, but that just works. You run DFS, and instead of going forward along edges, you go backward along edges, that simulates depth first search on the reverse graph. Now I've written here DFS loop and that just means the user will check more to make sure that you see all of the nodes of the graph even if it's disconnected you have an outer loop where you just try each starting point separately. If you haven't already seen it then you run DFS from that given node. I'll be more detailed on the next slide. And the third step is just you run depth-first search again, but this time on the original graph. Now at this point you should be thinking that I'm totally crazy, right, so what are we trying to do? We're trying to compute these strongly connected components. We're trying to actually compute real objects, these maximal regions and all I'm doing is searching the graph. I do it once forward. I do it once backward. I mean it doesn't seem like I'm computing anything. So here's the catch and it's a very minor catch. So we're going to get you to do a little bit of bookkeeping, it's going to be very little overhead so we'll still have a blazingly fast algorithm. So but with a little bit of bookkeeping, here's what's going to happen. The second depth first search. Which searches the graph, will in it's search process discover the components one at a time in a very natural way. And that will be really obvious when we do an example which we'll do in just a second. Now, for the second depth first search to work as magical way where it just discovers the connective component one at a time, it's really important that executes the depth first searches in a particular order, that it goes to the nodes of the graph in a particular order. And that is exactly the job of the first pass. The depth first search on reverse graph is going to compute an ordering of the nodes which, when the second depth first search goes through them in that order, it will just discover the SCCs one at a time. In linear time. So let me say a little bit more about the form of the bookkeeping and then I'll show you how that bookkeeping is kept in as we do depth-first search. So we're going to have a notion of a finishing time of a vertex. And that's going to be computed in the first pass when we do depth-first search in the reverse graph. And we're going to make use of this data in the second pass. So rather than just going through the nodes of the graph in an arbitrary order, like we usually do when we sort of have a loop to depth first search. We're going to make sure that we go through the vertices in decreasing order of these finishing times. Now there's still the question of what sense doe this second depth first search discover and report to the strong connected components that it finds? So we're going to label each node in the second pass with what we call a leader. And the idea is that the nodes in the same strong connected component will be labeled with exactly the same leader node. And again, all of this will be much more clear once we do a concrete example, but I want to have it down for the record right now. So that's the algorithm at a high level. It's really just two passes of DFS with some bookkeeping, but this is under specify. You really shouldn't understand how to implement the algorithm just yet. So what do I owe you? I owe you exactly what I mean by the DFS-loop, although this is seen more or less in the past. It's just a loop over all the vertices of the graph, and if you haven't seen something yet in DFS from that starting point, I need to tell you what finishing times are and how they get computed. They're just going to be integers 1 to n, which is basically when depth first search gets finished with one of the nodes, and then I need to tell you how you compute these leaders. So, let me tell you all three of those things on the next slot. So the work course for Kosaraju's strongly connected components algorithm is this DFS-loop subroutine, and it takes, as input, a graph. So it does not take as input a starting node, it's going to loop over possible starting nodes. Now for the bookkeeping to compute finishing nodes, we're going to keep track of a global variable of it all called T. Which we initialize to zero. The point of T is to count how many nodes we've totally finished exploring at this point. So this is the variable we use to compute those finishing times in the first pass, that magical ordering that I was talking about. Now we're also going to have a second global variable to compute these things I was calling leaders, and these are only relevant for the second pass. So what S is going to keep track of is the most recent vertex from which a DFS was initiated. So to keep the code simple, I'm just going to do all of the bookkeeping in the DFS-Loop, so really DFS-Loop gets called twice, once in a reverse graph, once in the forward graph. And we only need to compute these finishing times in the first pass on the reverse graph and we only need to compute these leaders on the second pass for the forward graph. But let's just keep them both in there just for kicks. Now we're going to need to loop through the vertices. And so the question is in what order are we going to loop through the vertices? And that's going to happen differently in the two different passes, but let me just use some common notation. Let's just assume, in this sub-routine, that the nodes are somehow labeled from 1 to n. In our first depth first search it's going to be labeled totally arbitrary, so these are basically just the names of the node or their position in the node array, whatever, you just do it in some arbitrary order. Now the second time we run DFS loop, as indicated on the previous slide, we're going to use the finishing times as the labeling. As we'll see, the finishing times are indeed numbers between 1 in it, so now what do we do is we just iterate through the nodes in decreasing order. And if we haven't already seen node i, then we initiate a DFS from it, so as usual we're going to be maintaining a local boolean to keep track of what we had already seen a node yet in one of the DFS passes. Now remember, the global variable s is responsible for keeping track of the most recent node from which Depth First Search had been initiated, so if i's not explored and we initiate a Depth First Search from it, we better reset s. And then we do the usual DFS ng starting from the source node i. So for completeness let me just remind you what the depth first search sub-routine looks like, so now we're given a graph and a starting node. So the first time we see a node we mark it as explored. And just a side note that once a node is marked explored, it's explored for this entire indication of DFS loop. Okay so even if this DFS from a given node i finishes, and then the outer for loop marches on, and encounters i again, it's still going to be marked as explored. Now one of our bookkeeping jobs is to keep track of from which vertex did the DFS that discovered i get called. So when i is first encountered, we remember that s was the node from which this DFS originated. And that by definition is the leader of i. And then we do what we always do with depth first search, we immediately look at the arcs going our of i and we try recursively DFS on any of those. Although we don't bother to do it if we've already seen those nodes. Now once this for loop has completed, once we've examined every outgoing arc from i and for each node j. Either we already saw it in the past or we've recursively explored from j and have returned. At the point, we call ourselves done with node i, there's no more outgoing arc to explore. We think of it being finished, remember t is the global variable that's keeping track of how many nodes we're done with, so we increment t because now we're done with i. And we also remember that i was the t-th vertex with which we finished. That is, we said i's finishing time to be t. Because depth first search is guaranteed to visit every node exactly once, and that therefore finish with every node exactly once. This global counter t, well when the first node is finished it'll be value 1, then when the next node gets finished I'll have value 2, then it'll have value 3 and so on. When the final node gets finished with it'll have value n. So the finishing times of the nodes are going to be exactly the integers from 1 to n. Let's make this much more concrete by going through a careful example. In fact, I think it'll be better for everybody if you, yourself, traced through part of this algorithm on a concrete example. So let me draw a nine node graph for you. So to be clear, let's assume that we've already executed step one of the algorithm, and we've already reversed the graph. So that is, this blue graph that I've drawn on the slide, this is the reversal. We've already reversed the arcs. Moreover the nodes are labeled in some arbitrary way from 1 to 9. Just assume these are how they show up in the node array for example and remember in the DFS loop routine you're supposed to process the nodes from top to bottom from n down to 1. So my question for you then is in the second step of the algorithm when we run DFS-Loop, and we process the nodes from the highest name 9 in order down to the lowest name 1. What are the finishing times that we're going to compute as we run DFS-Loop? Now, it is true that you can get different finishing times depending on the different choices that the DFS-Loop has to make about which outgoing arc to explore next. But I've given you four options for what the finishing times of the nodes 1,2,3, all the way up to 9, respectively, might be. And only one of these four could conceivably be an output of the finishing time of DFS loop on this graph, so which one is it? All right so the answer is the fourth option, that is the only one of these four sets of finishing times that you might see being computed by DFS-Loop on this blue graph. So let's go ahead and trace through DFS-Loop and see how we might get this set of finishing times. So remember in the main loop we start from the highest node 9 and then we descend down to the lowest node 1. So we start by invoking DFS from the node 9. So now from here there's only one outgoing arc, we have to go to so we mark 9 as explored. And then there's only one place we can go, we can go to 6. So we mark 6 as explored. Now there's two places we can go next, we can either go to 3 or we can go to 8 and in general DFS could do either one. Now to generate this fourth set of finishing times I'm going to need to assume that I go to 3 first okay? So again, what DFS does, what we're assuming it does, it starts at 9, and it has to go to 6, it marks those as explored, then it goes to 3. It does not go to 8 first it goes to 3 first. Now, from 3, there's only one outgoing arc which goes to 9, but 9, we've already marked as explored. So it's not going to re-explore 9, it's going to skip that arc. Since that's 3's only outgoing arc, then that for loop completes, and then 3 is the first node to finish. So when we finish with 3, we increment t, it started at 0, now it's 1, and we set the finishing time of 3 to be 1. Just like we said it was in the example. So, now we backtrack to 6. Now we have another outgoing arc from 6 to explore, so now we go to 8. From 8 we have to go to 2, from 2 we have to go to 5, from 5 we have to go 8. 8 we've already seen, so then we're going to be done with 5, because that was its only outgoing arc. So then we increment t, now it's 2, and the finishing time of 5 is going to be 2 as promised. So now we've backtracked to 2, there's no more outgoing arcs from 2. So 2 is going to be the third one that we finish as promised. Then we finish with 8, so the finishing time for 8 is going to be the fourth node to be done as promised. And now I back track back to 6, now at 6, that's the fifth node to be completed as promised. And finally we got all the way back to where we started at 9 and 9 is the sixth node to be completed as promised. Now if we were computing those leaders all of these nodes would get the leader 9, but again the leaders are only relevant for the second pass. So we're just going to ignore the leaders as we're doing this tree so we're just going to keep track of finishing times. So now we're not done so all we did is we finished with the DFS that is invoked from the node 9 and we found 6 of the nodes total in that depth first search. So now we return to the outer for loop and we decrement i. So it started at 9, we're done with that, now we go down to 8. We say, have we already seen 8, yes 8's already explored so we skip it. We go, we decrement i down to 7, we say have we already seen node 7? No we have not okay? 7 is not yet explored. So we invoke DFS now from node sever 7 has two outgoing arcs, it can either go to 4 or it can go to 9. Let's say it checks the outgoing arc to 9 first. Now 9 we already explored. Granted, that was an earlier part of the for loop, but we remember that. We're going to keep track of who got explored on previous iterations of the for loop so we don't bother to re-explore 9, so we skip that. So now from 7 we have to go to 4, from 4 we have to go to 1, from 1 we have to go back to 7. 7's already been exploratory backtrack and now we're done with 1. So 1 is the next one we're completed with and the finishing count of 1 is going to be 7 as promised. We backtrack to 4, there's no more outgoing arcs from 4 to explore, so that's going to be the eighth one to finish. As promised, and the last one to finish is poor node 7. It is last. So that would be an example of how the DFS-Loop subroutine computes finishing times on a reversed graph. So now, let's work through the second pass on the forward version of the graph using the same example. Now remember, the point of the first pass is to compute a magical ordering, and the magical ordering is these finishing times. So now we're going to throw out the original node names, and we're going to replace the node names in blue by the finishing times in red. We're also going to work with the original graphs, which means we have to reverse the arcs back to where they were originally. So those are the two changes you're going to see when I redraw this graph. First of all, all the arcs were reverse orientation. Second of all, all of nodes will change names from their original ones to the finishing times that we just computed. So here's our new graph with the new node names and all of the arcs with their orientation reversed. And now we run DFS again on this graph. And again we're going to process the nodes in order from the highest label 9 down to the lowest label 1. Moreover, we don't need to compute finishing times in the second pass, we only need to do that in the first pass. In the second pass we have to keep track of the leaders, and remember the leader of a vertex Is the vertex from which DFS was called that first discovered that node. All right, so what's going to happen? Well, in the outer for loop, again, we start with i equal to nine, and we invoke DFS from the node 9. So, that's going to be the current leader because that's where the current DFS got initiated. Now, from 9, there's only one choice. We have to go to 7. From 7, there's only one choice, we have to go to 8. From 8, there's only one choice, we have to go back to 9. And then, 9's already been seen, so we backtrack. We go back to 8, we go back to 7, we go back to 9, and that's it. So, when we invoke DFS for node 9, the only things that we encounter are, the nodes 7, 8, and 9. And these are all going to be given the leader vertex 9. You will notice that this is indeed one of the strongly connected components of the graph. We just sort of found it with this indication of DFS from the node 9. So, now we go back to the outer for loop. And we say, okay, let's go to node 8, have we already seen 8? Yes. What about 7, have we already seen 7? Yes. What about 6? Have we have already seen 6? We have not, we have not yet discovered 6, so we invoke DFS from node 6, we reset the global source vortex s to 6. From 6, we can go to 9, we can go to 1. So, let's say we explore 9 first. Well, we already saw 9 in an earlier iteration in the for loop, so we don't explore it again, so, we don't discover 9 now, so we backtrack to 6. We go to 1, from 1, we have to go to 5, and 5 we have to go to 6, and then we start backtracking again. So, the only new nodes that we encounter when we invoke DFS from the node 6 are the vertices 6, 1, and 5. And all of these will have a leader vertex of 6 because that's where we called DFS from when we first discovered these 3 nodes. And you'll notice, this is another FCC of this directed graph. So, we invoke DFS again, not from a new node, the new node 6. And what it discovered, the new nodes it discovered was exactly an FCC of the graph. Nothing more, nothing less. So, now we return to the outer for loop, we go to node 5, have we already seen 5? Yes. Have we already seen 4? No, we haven't seen 4 yet. So, now we invoke DFS from 4. Again, we could try to explore 5, but we've seen that before, we're not going to explore it again. So from 4 then, we have to go to 2, from 2 we have to go to 3, from 3 have to go back to 4, and then, after all the backtracking, we're done. So, the final call to DFS will be from the node 4. And, that DFS will discover precisely, newly discover, precisely the nodes 2, 3 and 4. They will all have the leader vertex 4 because that was where this DFS was called from. It's true we'll go back to the for loop and we'll check have we seen 3 yet? Yes. Have we seen 2 yet? Yes. Have we seen 1 yet? Yes, and then the whole thing completes. And, what we see is that, using the finishing times computed from that first depth first search pass, somehow the strongly connected components of this graph just showed up and presented themselves to us and one at a time on a silver platter. Every time we invoke DFS, the nodes we discovered newly were precisely one of the FCCs, nothing more, nothing less. And, that's really what's going on in this algorithm, turns out this was true in general. The first pass, DFS on the reverse graph, computes finishing times so that if you then process nodes according to decreased order in finishing times. In the second pass, each invocation to DFS will discover one new FCC and exactly one FCC. So, they'll just present themselves to you, one per DFS call in that second pass' for loop. This is, of course, merely an example. You should not just take a single example as proof that this algorithm always works. I will give you a general argument in the next video. But hopefully there is at least a plausibility arcing. No longer does this three-step algorithm seem totally insane. And maybe you could imagine, perhaps it works. At least there's some principles going on. Where you first compute the right ordering to process the nodes, and then the second pass peels off FCCs one at a time like layers from an onion. One thing that I hope is pretty clear is that this algorithm, correct or not, is blazingly fast. Pretty much all you do is two depth per searches. And, since depth per search as we've seen in the past, runs in time linear in the size of the graph, so does Kosaraju's two-pass algorithm. There are a couple subtleties and I encourage you to think about this and you'll be forced to think about this in the program and project for week four. So, for example, in the second pass, how do you process the nodes in decreasing order of finishing time? You don't want to sort the nodes by their finishing time, because that would take n log and time. So, you need to make sure that you remember in the first pass, that you sort of remember the nodes in a way that you can just do a linear scan through them in a second pass. So, there are some details, but, if your intuition is that this is really just double DFS properly implemented, that's pretty much exactly right. So, having spelled out the full implementation, argued that it's definitely a linear time algorithm and given at least a plausibility argument via an example that it might conceivably be correct, let's now turn to the general argument.