We discussed two computational formulations of the assembly problem and algorithms for solving it, but we had to admit that those algorithms had some pretty serious problems with them. So now, let's talk about how real software tools for solving the assembly problem work. And how they deal with repetitive genomes and other issues in practice. So we talked about two different kinds of graphs, overlap graphs and De Bruijn graphs. These were used in different algorithms. The overlap graph was used in shortest common superstring, and the De Bruijn was used in the Eulerian walk algorithm. And so even though we won't be using shortest common superstring or Eulerian walk, we're still going to use these graphs, and the different graphs correspond to two different categories of assembly programs. So assembly programs on the one hand that use the overlap graph are sometimes called overlap-layout-consensus programs. And on the other hand, programs that use the De Bruijn graph are sometimes called De Bruijn graph-based programs. So in both cases, though, the first step is to build the graph, and we've already seen how to do this for these two types of graph. Once we have the graph, well, we're not as close to having the final answer as you might hope. The graph is going to be big and messy and here's an example of a big, messy De Bruiijn graph. This graph is going to be a, a lot messier than all of the tiny examples that we've seen so far. And the walk through the graph that corresponds to the correct assembly is not going to be obvious at all. So why is the graph so messy? Well, there's a few reasons, so first of all we have sequencing errors and sequencing errors can introduce sort of spurious little diversions in the graph, little dead ends. And one of our jobs, if we want to know the true reconstruction of the genome, is to get rid of or otherwise ignore these dead ends. So here, for example, is a dead end that's caused by one sequencing error. And then also these little disconnected bits of the graph off floating around, those are also due to sequencing errors. So another reason the graph can be messy, and this point is specific to the overlap graph, is that it can contain edges that, while they're not incorrect, they don't actually tell us anything that we didn't know already. What do I mean by that? For example, here we have a small overlap graph, and now we'll highlight a few edges here. These three edges, three overlaps here at the top, and two are in blue and one is in green. And now this is an overlap graph so the notes correspond to reeds and the edges correspond to overlaps between the reeds and each edge is labeled with the length of the overlap. So if you look at these three edges, and stop to think about what they mean, you'll realize that the presence of those two blue edges implies the presence of the green edge. In other words, if you just told me that those two blue edges were there, I could figure out that the green edge was also there. The blue edges are saying that there's an overlap of six between these two reads, and an overlap of six between these two reeds. And all the reads are of length seven. And when you put those facts together, you realize that there must be an overlap of length five between these two reads. So the green edges, in some sense, implied by the two blue edges. Or another way you can say that is that the green edge is transitively inferable from the blue edges. And then you can see another example down here. This green edge with the label 4 is transitively inferable by the three edges that are labeled 6. Another reason the graph can be messy is because of polyploidy. So an example of polyploidy that we're familiar with is the human genome. And so for each chromosome of our genome, we know that we inherit two copies, two versions of that chromosome. One from our mother, and one from our father. And for some positions in the genome, some bases in the genome, you'll find that the version inherited from the mother and the version inherited from the father are not the same. There's a different base inherited from the mother and the father. Like in this example here, the blue and the red base. So now let's picture what the corresponding De Bruijn graph would look like. It's going to look like this. And they key point here is that the two bases present at this position in the middle of this stretch of DNA. The base that varies between the two parents, results in this kind of bubble shape that appears in the De Bruijn graph. This bubble here. And the nodes and edges along one edge, one side of this bubble correspond to the camers that are only present in the maternal copy, because they overlap the base that comes from the mother. And then along the other branch, are the camers that are only present in the paternal copy because they overlap the base that came from the father. And so this is not a mistake. This is not like a sequencing error. This is a real base that just happens to be present in two different copies. And so one way to deal with this is to collapse the graph, get rid of the bubbles so that there's just a straight line walk. But then keep a note that says, oh, by the way. At this position I noticed that there's a different base in the mother and in the father. And then one last way that the graph can be messy is because of repeats, because of repetitive DNA. And as we've seen, what repeats do is they'll create ambiguities in the graph. And if we leave these ambiguities in the graph and then try to reconstruct the genome sequences, then we'll probably make mistakes, different mistakes depending on what method we use. For example, shortest common superstring was vulnerable to overcollapsing repeats and then the De Bruijn graph Euler and Walk technique was vulnerable to shuffling around. Bits of the genome in between the repetitive bits. So either way, we got the wrong answer, so how do we deal with this? Basically we deal with it by chopping the assembly into pieces. What do I mean? Even if the graph as a whole has ambiguities in how the genome should be reconstructed, There will be pieces of the graph for which there's no ambiguity. For this piece of the graph, it's clear what the reconstruction is. So maybe we can't put the whole puzzle together, but we can still put portions of the puzzle together. So for example, here is an overlapped graph, where the overall reconstruction is not clear. But this part here is not ambiguous. We know how to get through that part of the graph. We know how to reconstruct that part of the genome. And again, there's only one way to walk through this part of the overlap graph. And this part in the middle, we know what sequence is there but we don't know how many times to walk through it. And so these partial reconstructions that we can put together unambiguously are called contigs, which is short for contiguous sequences. So, this is a principle that real world assemblers use when the overall graph is ambiguous, which it almost always is because of repetitive DNA. We cannot reconstruct the original genome unambiguously, that's what the third law says. So instead what we do is to identify portions of the graph that can be reconstructed unambiguously and assemble those. And so at the end of the day instead of reporting a single assembled genome sequence an assembler will report a set of contigs. That's sort of the best we can do. Nearly every assembly that's been produced, with the exception of pretty simple genomes, like bacterial genomes, for example. Every assembly is fragmented, it's in pieces. Even the human reference genome, which is the most studied genome on the planet, still has holes in it. So for example this figure is showing a result from a recent effort to fix problems with the human reference genome. And everywhere where you see a red diamond, so these are the chromosomes here, labeled with what chromosome they are. So here's chromosome 16. And everywhere where you see a red diamond, like this red diamond here or this red diamond here, that's a place where a gap in the assembly was fixed as part of this study. So, that's a place where the space between two fragments was filled in so you could merge the two fragments. But, there are many, many, even after efforts like this, there are many gaps that remain in the human genome assembly. That's just something we have to live with because of repetitive DNA.