So what's the problem? Well, I did say most of the stuff about graph search really doesn't matter, undirected or directed, it's pretty much cosmetic changes. But the big exception is when you're computing connectivity, when you're computing the pieces of the graph. So right now, I'm only going to talk about undirected graphs. The directed case, we can get very efficient algorithms for, but it's quite a bit harder work, so that's going to be covered in detail in a separate video. So for now, focus just on an undirected graph, G. And we're certainly not going to assume that G is connected, and the part of the point here is to figure out whether or not it's connected, i.e., in one piece. So maybe the graph looks like this. So for example maybe the graph has ten vertices and looks like this on the right. And intuitively, especially given that I've drawn it in such a clean way, it's clear that this graph has three pieces. And those are the things that we want to call the connected components. But we do want to sum up more formal definitions, something which is actually in math that we could say is true or false about a given graph. And roughly, we define the connected components of an undirected graph as the maximal regions that are connected. In the sense you can get from any vertex in the region from any other vertex in the region using a path. So, maximal connected regions in that sense. Now the slick way to do this is using an equivalence relation. And I'm going to do this here in part because it's really the right way to think about the directed graph case, which we'll talk about in some detail later. So from your [INAUDIBLE] graphs, so this isn't super important, but let me go ahead and state the formal definition just for completeness about what is a connecting component. What do I mean by a maximal region that's mutually connected. So a good formal definition is as the equivalence classes of the relation on vertices where we define by u being related to v if and only if there's a path between u and v in the graph. So I'll leave it for you to do the simple check that this squiggle is indeed an equivalence relation. I'm going to remind you what equivalence relations are. This is something you generally learn in your first class on proofs or your first class in discrete math. So it's just something which may or may not be true about pairs of objects. In an equivalence relation, you have to satisfy three properties. So first you have to be reflexive, meaning everything has to be related to itself, and indeed in a graph there is a path from any node to itself, namely the empty path. Also a couple of these relations have to be symmetric, meaning that if u and v are related then v and u are related. Because this is an undirected graph it's clear that this is symmetric. If there's a path from u to v in the graph there's also a path from v to u so no problem there. Finally equivalence classes have got to be transitive. So that means if u and v are related and so are v and w and so are u and wo. That is if u and v have a path, v and w have a path, then so does u and w and you can prove transtivity just by pasting the two paths together. And so the upshot is, when you want to say something like the maximal subset of something where everything is the same. The right way to make that mathematical is using equivalence relations. So over in this blue graph we want to say one, three, five, seven, and nine, which are all the same in the sense that they're mutually connected and so that's exactly what this relation is making precise. All five of those nodes are related to each other. 2 and 4 are related to each other. 6, 8, and 10, all pairs of them are related. So equivalence relations always have equivalence classes, the maximal mutual related stuff. And in this graph context, it's exactly these connected components, exactly what you want. So what I want to show you is you can use wrapped in an outer of the vertices to compute, to identify all of the connected components of a graph. In time linear in the graph in n plus n time. Now you might be wondering why do you want to do that. Well there's a lot of reasons, so an obvious one which is relevant for physical networks is to check if a network has broken into two pieces. So certainly if you're an Internet service provider, you want to make sure that from any point in your network, you can reach any other point in the network. And that boils down to just understanding whether the graph that represents you network, is a connected graph, that is if it's in one piece, or not in one piece. So obviously, you can ask this same question about recreational examples. So if you return to the movie graph, maybe you're wondering, can you get from every single actor in the IMDB database to Kevin Bacon? Or are there actors for which you cannot reach Kevin Bacon? Via a sequence of edges, a sequences of movies in which two actors have both played a role. So that's something that boils down to a connectivity computation. If you have networked data and you want to display it, you want to visualize it and show it to a group of people so that they can interpret it. Obviously one thing you want to do is you want to know if there's multiple pieces, and then you want to display the different pieces separately. So let me make sure that one probably a little less obvious application of undirected connectivity is that it gives us a nice quick and dirty heuristic for doing clustering if you have paralyzed information about objects. Let me be a little more concrete. Suppose you have a set of objects that you really care about. This could be documents, maybe web pages that you crawl, something like that. It could be a set of images, either your own or drawn from some data base. Or it could be, for example, a set of genomes. Suppose further, that you have a pairwise function, which for each pair of objects tells you whether they are very much like each other or very much different. So let's suppose that two objects are very similar to each other, like the two web pages that are almost the same. Or there are two genomes where you can get from one to the other with a small number of mutations. Then, they have a low score. So low numbers, close to zero, indicates that the objects are very similar to each other. High numbers, let's say, they can go up to a thousand or something, indicate that they are very different objects. Two webpages that have nothing to do with each other, two genomes for totally unrelated parts, or two images that seem to be of completely different people or even completely different objects. Now here's a graph you can construct using these objects and the similarity data that you have about them. So you can have a graph where the nodes are the objects. Okay, so for each image, for each document, whatever, you have a single node and then for a given pair of nodes, you put in an edge if and only if the two objects are very similar. So for example, you could put in an edge between two objects if and only if the score is at most ten. So remember, the more similar two objects are, the closer there score is to zero. So you're going to get an edge between very similar documents, very similar genomes, very similar images. Now in this graph you've constructed, you can find the connected components. So each of these connected components will be a group of objects, which more or less are all very similar to each other. So this would be a cluster of closely related objects in your database. And you can imagine a lot of reasons why, given a large set of unstructured data, just a bunch of pictures, a bunch of documents or whatever, you might want to find clusters of highly related objects. So we'll probably see more sophisticated heuristics for clustering in some SQL course. But already undirected connectivity gives you a super fast linear time quick and dirty heuristic for identifying clusters of similar objects, given pairwise data about similarity. So that's some reasons you might want to do it. Now let's actually talk about how to compute the components in the near time using just a simple for loop and breadth-first search as it's inner work horse. So here's the code to compute all the connected components of an undirected graph. So first we initialize all nodes as being unexplored. I'm also going to assume that the nodes have names. Let's say that the names are from 1 to n. So these names could just be the position in the node array that these nodes occupy. So this is going to be an outer for loop, which walks through the nodes in an arbitrary order, let's say from 1 to n. This outer for loop is to ensure that every single node of the graph will be inspected for sure at some point in the algorithm. Now, again, one of maxims is that we should never do redundant work, so before we start exploring from some node, we check if they've already been there. And if we haven't seen i before, then we invoke the breath-first search, a term we were talking about previously in the lecture, in the graph G, starting from the node I. So to make sure this is clear, let's just run this algorithm on this blue graph to the right. So we start in the outer for loop and we said i equal to 1. And we say have we explored node number 1 yet. And of course not, we haven't explored anything yet. So the first thing we're going to do is we're going to invoke BFS on node number 1 here. So now we start running the usual breadth for search subroutine starting from this node one and so we explore layer one here is going to be nodes 3 and 5. So we explore them in some order for example maybe node number 3 is what we explore second. Then node number five is what we explore third and then the second layer in this component is going to be the nodes 7 and 9. So we'll explore them in some order as well and say 7 first followed by 9. So after this BFS initiated from node number one completes, of course it will have found everything it could possibly find, namely the five nodes in the same connected component as node number 1. And of course, all of the five of these nodes will be marked as explored. So now we return once that exits we return to the outer for loop we increment I we go to I equal 2, and we say we already explored node number 2 no we have not. And so now we invoke BFS again from node number 2. So that will be the sixth node we explore. There's not much to do from two, all we can do is go to 4, so that's the seventh node we explore. That BFS terminates at finding the nodes in this connected component, then we go back to the outer for loop. We increment i to 3, we say we've already seen node number three. Yes we have, we saw that in the first breadth first search. So we certainly don't bother to BFS from node 3, then we increment item four. Have we seen 4? Yes we have, in the second called to BFS. Have we seen node 5? Yes we have, in the first call to BFS. Have we seen node 6? No, we have not. So the final indication of breadth-first search begins from node number 6. That's going to be the eighth node overall that we see. And then we're going to see the notes 8 and 10 in some order, so for example maybe we first explore note number 8. That's one of the first layer in this component, and then note number 10 is the other note of the first layer in this component. So in general, what's going on, well what we know will happen when we invoked that first search from a given node i. We're going to discover exactly the nodes in i's connected component. Right, anything where there's a path from i to that node, we'll find it. That's the BFS guarantee, that's also the definition of a connected component. All the other nodes which have a path to i. Another thing that I hope was clear from the example, but just to reiterate it, is every search call, when you explore a node, you remember that through the entire for loop. So when we invoke breadth-first search from node number 1, we explore nodes 1, 3, 5, 7 and 9, and we keep those marked as explored for the rest of this algorithm. And that's so we don't do redundant work when we get to later stages of the for loop. So as far as what does this algorithm accomplish, well, it certainly finds every connected component. There is absolutely no way it can miss a node because this for loop literally walks through the nodes, all of them, one at a time. So you're not going to miss a node. Moreover, we know that as soon as you hit a connected component for the first time, and you do breadth-first search from that node, you're going to find the whole thing. That the breadth-first a search guarantee. As far as what's the running time, well it's going to be exactly what we want. It's going to be linear time, which again means proportional to the number of edges plus the number of vertices. And again depending on the graph, one of these might be bigger that the other. So why is it O of m plus n? Well as far as the nodes, we have to do this initialization there where we mark them all as unexplored, so that takes constant time per node. We have just the basic overhead of a for loop, so that's constant time per node. And then we have this check, constant time per node, so that's O of n. And then recall we proved that within breath for research, you do a amount of work proportional. You do constant time for each node in that connected component. Now, each of the nodes of the graph is in exactly one of the connected components. So you'll do constant time for each node in the BFS in which you discover that node. So that's again, o of n over all of the connected components. And as far as the edges, note we don't even bother to look at edges until we're inside one of these breadth first search calls. They played no role in the outer for loop or in the pre-processing. And remember what we proved about an indication of breadth first search. The running time, you only do constant amount of work per edge in the connected component that you're exploring. In the worst case, you look at an edge once from either endpoint and each of that triggers a constant amount of work. So when you discover a given connected component, the edge work is proportional to the number of edges in that kind of component. Each edge of the graph is only in exactly one of the connect components, so over this entire for loop, over all of these BFS calls. For each edge of the graph, you'll only be responsible for a constant amount of work of the algorithm. So summarizing because breadth-first search from a given starting node. That is, works in time proportional to the size of that component, piggybacking on that sub routine and looping over all of the nodes of the graph. We find all of the connecting components in time proportional to the amount of edges and nodes in the entire graph.