So remember that the third law of assembly says that repetitive DNA makes assembly difficult. But there is, in theory at least, something that we can do to counteract the issue of repetitive DNA. We can make the reads longer. So, why does it help to make the reads longer? Well, because longer reads are. The longer the reads are the more likely we are to get a read that anchors some repetitive sequence that glues it with some surrounding non-repetitive sequence. And that's what tells us where the repetitive sequence should go in the assembly. What do I mean by that? Let's look at a few examples. The first example uses our puzzle analogy. So what's the hardest part of assembling this particular puzzle that I'm showing here? Well, probably the sky, because it's featureless. It's just blue, okay? It's repetitive in a sense. So it's hard to determine where any sky piece should go. So what happens if we make the puzzle pieces bigger? Does it get easier or harder? Well, clearly, it gets easier. So, first of all, bigger pieces means fewer pieces, so that's one reason. But then the more important reason is that now that the pieces are the bigger, the sky pieces are more like to have some distinguishing feature on them. So these pieces, for example here, are mostly sky, but they also have a little bit of cloud on them. And that's going to be very helpful when you go to try to figure out where those pieces go. So the larger the puzzle pieces, the more likely a given piece is to have some distinguishing feature on it. Some non-repetitive sequence if we're talking about genomes, that helps us figure out where the piece belongs. So here's an example with DNA. This is an example that we saw before. This is where the greedy shortest common superstring algorithm over-collapsed a repeat. The genome has three copies of the word long up here, but the result of shortest common superstring had two copies of the word long here. So this is an example of a problem that can be fixed with longer reads. So how does that work? Well, instead of letting our reads be all the 6-mers of the genome, let's instead let our reeds be all the 8-mers of the genome, and see what happens. We're going to now use the exact same algorithm, and what we can see is that the repeat is no longer overcollapsed. We get the corrected assembly down here. So three copies of the word long. So why does 8-mers give us the right answer, whereas 6-mers give us a wrong number with an overcollapsed repeat? Well let's consider one very important 8-mer. This 8-mer right here, g_long_l. And this 8-mer, is special, because it spans all three copies of the word long, it spans this copy, it goes all the way across this copy and touches this copy too. So, what this means is this 8-mer alone, tells us that there are at least three copies of the word long in our original genome. And so now, the greedy shortest common superstring can't overcollapse anymore, the 8-mer prevents that from happening. In this case, the longer reads prevented us from overcollapsing a repeat. Speaking more generally, the reason that longer reads can counteract the problem of repetitive DNA is that they anchor repetitive sequences to their surrounding nonrepetitive context. And if the reads are long enough to extend all the way through the repetitive sequence and overlap the non-repetitive sequence on either side then that is what's going to allow us to recreate the genome sequence unambiguously. So here, for example, is a picture of a genome. And the red bits in the middle are many copies of a repeat, so that red sequence is repeated three times throughout this genome. But then the surrounding sequences, which are shown in many colors here, are non-repetitive. So, how long would a read have to be in order for us to reconstruct this genome unambiguously. Well, here are some candidate read lengths down here. Here, different lengths that the reads might be. And so, only this top most horizontal line, this read length, only this is long enough in order to span the entire repeat, and the unambiguous sequence on either side of the repeat. So that's the only read length that's going to help us, that's going to allow us to make the final assembly completely unambiguous. Great. So now longer reads will help us, but how do we get longer reads? That, as it turns out, is an interesting question. A hard technological question. So we said way at the beginning of the course that DNA sequencers are very good at collecting lots of short substrings. Sampled from the reference genome. And there really hasn't been a technology invented to date that's capable of reaching much longer stretches of DNA. Say tens or hundreds of thousands of bases long very accurately. But there are technologies that get a bit closer. And we'll discuss two of them briefly. So the first is called paired end sequencing. Way back in the first module we discussed how a second generation sequencer sequences many DNA templates at once. And now it turns out that we can use the same technology to do something just a little bit different. It's going to give us a bit more information per DNA template sequence. So let's say this black screen here is our template molecule. It's going to give us a bit more information about this template. So normally we'd run the sequencer for some number of sequencing cycles. In reality it would be 100 or 200 or 300 or so sequencing cycles. But in this example, let's just say it's ten sequencing cycles. But there are some bases of the template that are not sequenced. These bases over here. We didn't sequence those. With paired end sequencing, what we can do instead of sequencing just one end of the template, is we can sequence both ends. Depending on how long the template is, we might then get something that looks like this, where we sequenced and met in the middle. And so what we get at the end of the day was essentially a read that's twice as long and as the read we would have got, if we had only sequenced in one direction. But we might also get something like this. So what if the template molecule is longer than that? Well then we'll get a bit at one end and we'll get a bit at the other end and there will be this sort of mysterious gap in between. This gap here. So, but even in this case we can use more or less the same sorts of methods that we talked about in this course in order to either align these reads to a genome or assemble them into a genome. We just have to deal with the fact that there's some amount of missing sequence between the two ends. Which is not completely trivial, but it's something that can be done, and in fact, paired end sequencing is extremely common in practice. It's a way to get about twice as many bases out of every template strand sequenced, without sacrificing much in terms of accuracy or speed. So it's very, very popular. Okay, so paired end sequencing gets us a factor of two or so improvement in read length, but that still isn't a huge improvement, right? We want something like order of magnitude or two order of magnitude improvement in read length. As it turns out, some very recent technologies can get one or two orders of magnitude improvement in read length, though at the expense of speed and accuracy. And how these technologies work exactly is beyond the scope of our course, but one method shown here uses essentially a very tiny camera to eavesdrop on the DNA polymerase as it synthesizes the complimentary strand. Another method draws the DNA through a tiny hole called a nanopore and then measures the electrical current that's passing through the pore as the DNA is moving through the pore. And that signal in turn tells us which nucleotides are passing through the pore. One thing these technologies have in common, is that they sequence one molecule at a time. And this is in contrast to the sequencers that we talked about, the sequencing biosynthesis method that we talked about in the first module, where you might recall where I actually, when we're looking from the light coming from the slide, we're sequencing a community of a bunch of clones that are clustered close together on the slide. So in that scenario we are not sequencing one molecule at a time, but with these technologies we are. Single molecule sequencers like these are capable of generating reads that are on the order of tens or even hundreds of thousands of bases long. And that's wonderful, because there are very few stretches of repetitive DNA that cannot be resolved, that cannot be anchored to nearby unique sequence with reads that are that long. Unfortunately that link comes at a cost. The reads are very error prone. The sequencer makes lots of mistakes, so on the order of ten to 15% of the time the sequencer will make a mistake in reading a base. And that means that tools like read aligners and assemblers have to be really exceptionally flexible to these kinds of mismatches and gaps. And it's still the very early days for these technologies, but at least for some assembly projects, there seem to be some exciting early successes, so it's something to keep an eye on.