So, we've now put in a lot of work designing and analyzing super fast algorithms for reasoning about graphs. So in this optional video, what I want to do is show you why you might want such a primitive, especially for computation on extremely large graphs. Specifically, we're going to look at the results of a famous study that computes the strongly connected components of the web graph. So what is the web graph? Well it's the graph in which the vertices correspond to webpages. So for example I have my own webpage where I list my research papers and also links to courses, such as this one. And the edges are going to be directed and they correspond precisely to hyperlinks. So the links that bring you from one web page to another. Note, of course, these are directed edges, where the tail is the page that contains the hyperlink. And the head is the page that you go to if you click the hyperlink. And so, this is a directed graph. So from my home page you can get to my papers, you can get to my courses. Sometimes I have random links up to things I like, say, my favorite record store. And of course for many of these webpages, there are additional links going out or going in. So for example, from my papers I might link to some of my co-authors. Some of my co-authors might be linking from their homepages to me. Or of course, there's webpages out there which list the currently available free online courses and so on. So obviously, this is just part of a massive web graph, just a tiny, tiny piece of it. So the origins of the web were probably around 1990 or so, but it started to really explode in the mid 90s. And by the year 2000, it was sort of already beyond comprehension. Even though, in Internet years, the year 2000 is sort of the stone age, relative to right now, relative to 2012. But still, even by 2000 people were so overwhelmed with the massive scale of the web graph, they wanted to understand anything about it, even the most basic things. Now of course, one issue with understanding what the graph looks like is you don't even have it locally, right? It's distributed over all of these different servers over the entire world. So the first thing people really focused on, when the wanted to answer this question, was on techniques for crawling. So having software which just follows lots of hyperlinks, reports back to the home base, from which you can assemble at least some kind of sketch of what this graph actually is. But then the question is, even once you have this crawled information, even once you've accessed a good chunk of the nodes and the edges of this network, what does it look like? So what makes this a difficult question, more difficult than, say, for any other directed graph you might encounter? Well, it's simply the massive scale of the web graph, it's just so big. So for the graph used in the particular study I'm going to discuss, like we said, it was in the year 2000, which is sort of the stone age compared to 2012. So the graph was small, relatively, but still the graph was really, really big. So it was something like 200 million nodes and one billion edges, really one and a half billion edges. So the reference for the work I'm going to discuss is a paper by a number of authors. The first author is Andre Broder and then he has many co-authors and this was a paper that appeared in the WWW Conference of the year 2000, that's the World Wide Web Conference. And I encourage to those of you who are interested to go track down the paper online and read the original source. So Andre Broder, the lead author, at this time he was at a company that was called Alta Vista. So how many of you remember a company called Alta Vista? Well, some of you, especially the youngest ones among you maybe have never heard of Alta Vista. And the youngest ones among you maybe can't even conceive of a world in which we didn't have Google. But in fact, there was a time when we had web search, but Google did not yet exist. That was sort of in the maybe '97 or so, and so this is in the very embryonic years of Google, and this data set actually came out of Alta Vista instead. So Broder et al wanted to get some answers to this question, what does this web graph look like? And they approached it in a few ways. But the one I'm going to focus on here is, they asked, well, what's the most detailed structure we can get about this web graph, without doing an infeasible amount of computation? Really just sticking to linear time algorithms, at the worst. And, what have we seen? We've seen that in a directed graph you can get full connectivity information just really using depth first search. That you can compute strongly connected components in linear time with small constants. And that's one of the major things that they did in this study. Now if you wanted to do the same computation today, you'd have one thing going against you and one thing going for you. The obvious thing that you'd have going against you is that the web is still very much bigger than it was here, certainly by an order of magnitude. The thing that you'd have going for you is now there's specialized systems which are meant to operate on massive data sets. And in particular, they can do things like compute connectivity information on graph data. So what you have to remember, for those of you who are aware of these terms, in 2000 there was no MapReduce, there was no Hadoop. There were no tools for automated processing large data sets, these guys really had to do it from scratch. So let me tell you about what Broder et al found when they did strong connectivity computations on the web graph. They explain their results in what they called the bow tie picture of the web. So let's begin with the center, or the knot of the bow tie. So in the middle we have what we're going to call a giant strongly connected component. With the interpretation being this is the core of the web in some sense. All right, so all of you know what an SCC is at this point. A strongly connected component is a region from which you can get from any point to any other point along a directed path. So in the context of the web graph, with this giant SCC, what this means is that from any webpage inside this blob, you can get to any other webpage inside this blob, just by traversing a sequence of hyperlinks. And hopefully it doesn't strike you as too surprising that a big chunk of the web is strongly connected, is well-connected in this sense, right? So if you think about all the different universities in the world, probably all of the webpages corresponding to all of the different universities, you can get from any one place to any other place. For example, from the homepage on which I put my papers, I often include links to my co-authors, which very commonly are at other universities. So that already provides a web link from some Stanford page to some page at say, Berkeley or Cornell or whatever. And of course, I'm just one person, I'm just one of many faculty members at Stanford. So you put all of these together, you would expect all of the different SCCs corresponding to different universities to collapse into a single one. And so on for other sectors as well. And then of course, if you knew that a huge chunk of the web was in the same strongly connected component, so let's say 10% of the web, which would be tens of millions of webpages. You wouldn't expect there to be a second one, right? It would be super weird if there were two different blobs, 10 million web pages each that somehow were not mutually reachable from each other. All it takes to collapse two SCCs into one is a lone arc going from one to the other and then a lone arc going in the reverse direction. And then those two SCCs collapse into one. So we do expect a giant SCC, just sort of thinking anecdotally about what the web looks like. And then once we realize there's one giant SCC, we don't expect there to be more than one. All right, so is that the whole story? Is the web graph just one big SCC? Well, one of the perhaps interesting findings of this Broder et al paper is that there is a giant SCC, but it doesn't actually take up the whole web, or anything really that close to the entire web. So what else would there be in such a picture? Well, there's the other two ends of the bow tie, which are called the in and the out regions. In the out regions you have a bunch of strongly connected components, not giant SCCs. We've established their shouldn't be any other giant SCCs. But small SCCs, which you can reach from the giant strongly connected component. But from which you cannot go back to the giant strongly connected component. I encourage you to think about what types of websites you would expect to see in this out part of the bow tie. I'll give you one example. Very often if you look at a corporate site, including those of well known corporations, which you would definitely expect to be reachable from the giant SCC. It's actually a corporate policy that no hyperlinks can go from something in the corporate site to something outside the corporate site. So that means the corporate site is going to be a collection of webpages, which is certainly strongly connected. Because it's a major corporation, you can certainly get there from the giant SCC. But because of its corporate policy, you can't get back out. Symmetrically, in the in part of the bow tie, you have strongly connected components, generally small ones. From which you can reach the giant SCC, but you cannot get to them from the giant SCC. Again, I encourage you to think about all the different types of webpages you might expect to see in this in part of the bow tie. Certainly I think one really obvious example would be new webpages. So if you just create something, and then if I just created a webpage and pointed it to Stanford University, that would immediately be in this in component or this in collection of components. Now, if you think about it, this does not exhaust all of the possibilities of where nodes can lie. There's a few other cases that frankly are pretty weird, but they're there. You can have passive hyperlinks, which bypass the giant SCC and go straight from the in part of the bow tie to the out part. So Broder et al suggested calling these tubes. And then there's also a kind of very curious outgrowths going out of the in component, but which don't make it all the way to the giant SCC. And similarly, there's stuff which goes into the out component. And Broder et al recommended calling these strange creatures tendrils. And then, in fact, you can just have some weird isolated islands of SCCs that are not connected, even weakly, to the giant SCC. So this is the picture that emerged from Broder et al's strong component computation on the web graph. And here's, qualitatively, some of the main findings that they came up with. So first of all, that picture on the previous slide I drew roughly to scale. In the sense that all four parts, so the giant SCC, the in part, the out part, and then the residual stuff, the tubes and tendrils, have roughly the same size, more or less 25% of the nodes in the graph. I think this is surprising people. I think some people might have thought that the core, that the giant SCC might have been a little bit bigger than just 25 or 28%. But it turns out there's a lot of other stuff outside of this strongly connected core. You might wonder if this is just an artifact of this data set being from the stone age, being from 2000 or so. But people have rerun this experiment on the web graph again in later years. And of course the numbers are changing because the graph is growing rapidly. But these qualitative findings have seemed pretty stable throughout subsequent reevaluations of the structure of the web. On the other hand, while the core of the web is not as big as you might have expected, it's extremely well connected. Perhaps better connected then you might have expected. Now you'd be right to ask the question, what could I mean by unusually well connected? We've already established that this giant core of the web is strongly connected. You can get from any one place to any other place via a sequence of hyperlinks. What else could you want? Well in fact, it has a very richer notion of connectivity called the small world property. So let me tell you about the small world property or the phenomenon colloquially known as six degrees of separation. So this is an idea that had been in the air at least since the early 20th century, but really kind of was studied in a major way and popularized by Stanley Milgram, who's a social scientist, back in 1967. So, Milgram was interested in understanding, are people at great distance, in fact, connected by a short change of intermediaries? So, the way he evaluated this, he ran the following experiment. He identified a friend in Boston, Massachusetts, a doctor, I believe. And so this was going to be the target. And then he identified a bunch of people who were thought to be far away, both culturally and geographically, specifically Omaha. So for those of you who don't live in the US, just take it on faith that many people in the US would regard Boston and Omaha as being fairly far apart, geographically and otherwise. And what did is he wrote each of these residents of Omaha the following letter. He said, look, here is the name and address of this doctor who lives in Boston. Your job is to get this letter to this doctor in Boston. Now, you're not allowed to mail the letter directly to the doctor. Instead, you need to mail it to an intermediary, someone who you know on a first name basis. So of course, if you knew the doctor on a first name basis, you could mail it straight to them, but that was very unlikely. So what people would do in Omaha is they'd say, well, I don't know any doctors or I don't know anyone in Boston, but at least I know somebody in Pittsburgh. And at least that's closer to Boston than Omaha, that's further eastward. Or maybe someone would say, well, I don't really know anyone on the east coast but at least I do know some doctors here in Omaha. And so they'd give the letter to somebody that they knew on a first name basis in Omaha. And then the situation would repeat. Whoever got the letter, again they'd be given the same instructions. If you know this doctor in Boston on a first name basis, send him the letter. Otherwise, pass the letter on to somebody who seems more likely closer to them than you are. Now, of course, many of these letters never reach their destination, but shocking, at least to me, is that a lot of them did. So something like 25%, at least, of the letters that they started with made it all the way to Boston. Which I think says something about people in the late 60s just having more free time on their hands than they do in the early 21st century. I find this hard to imagine, but it's a fact. So you had dozens and dozens of letters reaching this doctor in Boston. And they were able to trace exactly which path of individuals the letter went along before it eventually reached this doctor in Boston. And so then what they did is they looked at the distribution of chain links. So how many intermediaries were required to get from some random person in Omaha to this doctor in Boston? Some were as few as two, some were as big as nine but the average number of hops, the average number of intermediaries, was in the range of five and a half or six. And so this is what has given rise to the colloquialism, even the name of a popular play, the six degrees of separation. So that's the origin myth, that's where this phrase comes from, these sort of experiments with physical letters. But now in network science, the small world property is meant to be a network. Which on the one hand is richly connected, but also in some sense there are enough cues about which links are likely to get closer to some target. So that if you need to route information from point a to point b, not only is there a short path, but if you in some sense follow your nose, then you'll actually exhibit a short path. So in some sense, routing information is easy in small world networks. Now this is exactly the property that Broder et al identified within this giant SCC. Very rich with short paths, and if you want to get from point a to point b, just follow your nose and you'll do great. You don't need a very sophisticated shortest path algorithm to find a short path. Some of you may have heard of Sammy Milgram, not for the small world experiment, but for another famous, or maybe infamous experiment he did earlier in the 60s. Which consisted into tricking volunteers into thinking they are subjecting other human beings to massive doses of electric shocks. So that wound up causing a rewrite to certain standards of ethics in experimental psychology. You don't hear about that so much when people are talking about networks, but that's another reason why Milgram's work is well known. And just as a point of contrast, outside of this giant strongly connected component, which has this rich small world structure, very poor connectivity in the other parts of the web graph. So there's lots of cool research going on these days about the study of information networks like the web graph. So I don't want you to get the impression that the entire interaction between algorithms and thinking about information networks has just been this one strongly connected component computation in 2000. Of course, there's all kinds of interactions. I've just singled one out that was easy to explain and also highly intellectual and interesting back in the day. But these days, lots of stuff's going on. People are thinking about information networks in all kinds of different ways and of course algorithms, like in almost everything, is playing a very fundamental role. So let me just dash off sort of a few examples, maybe to whet your appetite. Maybe you want to go explore this topic in greater depth outside of this course. So one super interesting question is, rather than looking at a static snapshot of the web, like we were doing so far in this video, the web's changing all the time. New pages are getting created, new links are getting created and destroyed and so on. And how does this evolution proceed? Can we have a mathematical model, which faithfully reproduces the most important first order properties of this evolutionary process? So a second issue is to think not just about the dynamics of the graph itself, but the dynamics of information that gets carried by the graph. And you could ask this both about the web graph and about other social networks, like say, Facebook or Twitter. Another really important topic, which there's been a lot of work on, but we still don't fully understand by any means, is getting at the finer grain structure in networks including the web graph. In particular, what we'd really like to do is have foolproof methods for identifying communities. So groups of nodes, this could either be webpages in the web graph or individuals in a social network, which we should think of as grouped together. We discussed this a little bit when we talked about applications of cuts. One motivation for cuts is to identify communities, if you think of communities as being relatively densely connected inside and sparsely connected outside. But that's just a baby step. Really, we need much better techniques for both defining and computing communities in these kinds of networks. So I think these questions are super interesting, both from a mathematical/technical level, but also they're very timely. Answering them really helps us understand our world better. Unfortunately, these are going to be well outside the course of just the bread-and-butter design analysis of algorithms, which is what I'm tasked with covering here. But I will leave you with a reference book that I recommend if you want to read more about these topics. Namely, the quite recent book by David Easley and John Kleinberg called Networks, Crowds, and Markets.