In the previous lecture and practical, we introduced a computational problem called the shortest common superstring problem where given a set of input reads, we looked for the shortest string, which is called a superstring, that contains all of the input strings and sub-strings. So this formulation had a downside though, which is that there were really no efficient solutions to solving this problem. So we saw a solution but it was very slow because it involved us trying every possible ordering for n different input strings and their n factorial such orderings. So in this lecture we'll see an alternative that's actually much faster. Its called greedy shortest common superstring. And we call this algorithm greedy, because the algorithm will make a series of decisions and at each decision point, it will choose the option that reduces the length of the eventual superstring the most. This seems like a good strategy the shorter the superstring the closer we are to the shortest common superstring. However as we'll see making the greedy decision at each point in the algorithm does not necessarily mean that we'll get to an optimal solution. We can visualize the greedy shortest common superstring algorithm using an overlap graph. So let's start with this overlap graph here. So remember that the nodes of the overlap graph correspond to reads and the edges of the overlap graph correspond to overlaps or suffix prefix matches between pairs of reads. Each edge here is labeled with a number that gets the length of the overlap The length of the suffix prefix match. So, the principle behind the greedy shortest common superstring algorithm is that we're going to proceed in rounds. And in each round, we're going to pick the edge that represents the longest remaining overlap in the graph. In other words, we're going to pick the edge that has the greatest number as it's label. And then we're going to merge the nodes on either side of that edge. Picking the longest overlap seems to make sense because the longer the overlaps between the strings, the shorter the final string will be. So that's what we mean by greedy algorithm in this case. We're always going to pick the longest overlap, trying to get to the shortest superstring. So let's see this principle in action. So here we have an overlap graph and so to apply our principle, we must first locate the edge that corresponds to the longest overlap and well, in this case, there's a tie, right? Because there are actually several edges that have an overlap length of two, which is the maximal overlap length in this case. So when this happens we just pick one of the edges at random. So let's say that we pick this edge here, highlight it in red. So now we're going to merge the nodes on either side of this edge. We're going to merge AAA and AAB. So when we merge nodes we replace them with one new node that corresponds to the two labels of the original nodes glued together according to the overlap. In other words if we took the strings and these nodes and made a superstring based on the longest overlap between them, that's the label that we'll put in the new node. So this merge, there, we just did it. So this merge has made the graph a little bit simpler, a little bit smaller. It has one less node in it and it has at least one edge in it, and actually we might remove several edges when we merge two nodes. So now what do we do next? We do the same thing again. So again we're going to pick the edge with maximal overlap length. Again there's a tie, so we're going to pick one at random. So let's say we pick this one, highlight it in red, and then again we'll merge these two nodes. So we'll merge ABB and BBB to get ABBB. So here's our new graph. And we'll do it again, so now let's pick another node with maximal overlap. Or, sorry, edge with maximal overlap. Let's say we pick this one, merge, okay then, pick this edge, then merge, and now we're done. So the answer is sitting right in front of us. Once we've done all that merging, there's just one node left. And this node's label is the superstring that we've obtained from greedy shortest common superstring. If there happened to be multiple nodes at the end of this merging process, in other words, if after one merge we've run out of edges, but there are still multiple nodes, then we can create the superstring just by concatenating the labels of those nodes together. So this process is much faster than the algorithm we discussed in the previous lecture and in the practical. For that algorithm we had to try every possible ordering of the input strings. But for this one, the amount of work we do is greatly reduced. So we'll see examples in the upcoming practical session where our greedy algorithm can finish very quickly, much more quickly than our original algorithm that tried all orderings. But this speed comes at a cost and the cost is that the greedy algorithm doesn't always find the correct answer. So that is the greedy algorithm can sometimes report a superstring, a string that does contain all of the input strings, but it's not necessarily the shortest common superstring. So let's look at an example where the result of the greedy shortest common superstring algorithm returns a superstring but not the shortest one. All right, so this time rather then drawing the overlap graph, we'll visualize what happens in this way. So we'll simply write out the labels of all the nodes on a line like you see here. So this first line here has the labels of all five of the nodes that we saw in the previous overlap graph. And then for each round of the algorithm we'll write a new line. And then the new line will reflect the merging that happened in the previous step. So, for example, the line that you see here is the result of having merged these two nodes from the previous step. And the nodes that get merged I'm always going to highlight in red. So if we repeat this process again and again until we get to the end of the algorithm, we get to this answer that you see down here. This is the result of the greedy shortest common superstring algorithm in this case. There were ties along the way. There were cases where we had a tie in terms of which nodes we could glue together, in terms of which overlaps had the greatest weight. And we broke them in a certain way, and it's a little bit different from the way we broke the ties before. And as a result, we actually ended up with a different superstring. And in fact, in this case, if you look at this line here, here are two strings that have no overlap with each other. There's no suffix, prefix match between these two strings. So, this was a case where if we'd drawn out the graph, we would have seen that the graph had two nodes left and zero edges left at the end of the repeated merging of nodes. So we concatenated them together to get the final superstring. So is this the shortest common superstring? It turns out that no it's not, right. So we can show this easily enough because the superstring that we got from the previous run of the algorithm, where we made slightly different choices about which edges to merge, this superstring is actually shorter than the one that we obtained when we ran the algorithm this time. So we did not get the shortest common superstring running the greedy shortest common superstring algorithm. So now we've seen the greedy algorithm and we've seen that while it's faster than the slow algorithm that we tried before, it has the problem that it doesn't necessarily return the shortest common superstring. It might return a super string that's longer. In the next lecture we'll examine another down side of both common super sting and the greedy version of that algorithm which will bring us to the third law of assembly.