And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside.

Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is.

Before I go any further, let me remind the casual reader that: 1) it is most likely nice and sunny out there where you live and you would be considerably better off looking at squirrels running through the trees 2) if you have even the slightest inkling of formal mathematical/computer science training, you will be better served foregoing this edulcorated version in favour of one of the 10 million tutorials and entries on bioinformatics available throughout the internets (Wikipedia being a good place to start). The entry written henceforth is geared at some hypothetical grandparents who would care to know what the fuss with modern Science is all about (for instance mine, were they not already perfectly content in the sole knowledge that the good Lord has put all these tiny amino-acids together in the best possible way of all worlds and that modern genetics is the work of the Devil^{1}).

In last month’s episode, we laboriously learnt that Biology abounds with really, *really*, tough problems. Two major points were:

1. For all practical matters, NP-Complete problems are all in the same bag: finding a way to solve one efficiently would mean you can solve any other in roughly the same order of time.

2. Once you have proved that a problem is NP-Complete, trying to find an exact solution for a real-life set of data, is about as meaningful as trying to take down the Everest with a toothpick. There are however plenty of ways to find an *approximate* solution. Proving NP-Completeness is your cue to start looking for approximation algorithms; and thus the fun begins.

Today, instead of going straight onto the myriad fun ways in which mathematicians solve biology problems, and which one of those I am actually connected to, another digression and an illustration everyone has heard of: genome sequencing.

Full genome sequencing (mapping the entire DNA of a given organism) is one of the earliest application of modern bioinformatics techniques, a seminal example: it starts off as a rather straightforward bio-chemistry problem, soon runs into pesky matters of size, complexity and intractability, goes through a difficult phase of alcohol and substance abuse, but is ultimately saved by the power of Love and Mathematics.

Before I go into the gory details, allow me to dissipate a common misconception about DNA sequencing: it is nowhere as easy as you might have been led to believe by your TV (most people’s preferred source of Science™ facts). Hearing of “DNA tests”, “DNA crime database” and other everyday life DNA-related techniques might make it sound like sequencing is as easy as sending your saliva swab to the lab and waiting a couple days for the results. In reality, despite serious advances, actual full genome sequencing is still a multi-year, multi-million-dollar affair. When people talk about DNA in a forensics or medical context, they are usually looking at a *single* base nucleotide, located at a precise location on *one* gene, out of the entire genome. Even cases that require a larger sample of such observations (*e.g.* DNA matching, when it actually uses sequencing altogether) are still somewhere in the lower hundreds (if that). That’s a mere 100 bases to look at, against 100+ million for the first organism fully sequenced, 10 years ago (make that 3 billions for humans). Quite a difference in scale. And, of course, this is one of those problems where solving twice the size requires *much* more than twice the time (hopefully by now, this does not surprise you, otherwise you might want to go back and read episode 1 again).

OK, let’s start:

### The Easy Part

Long before full-genome sequencing arose, there have been many efficient methods to map *small* fragments of DNA. How small depends a lot on how efficient, how fast and how reliable you want the result, but 100 bases is a good figure to go with (could be 100 or 1000: it would not matter much when compared to aforementioned billions bases of the human genome). The methods used to do small-scale sequencing are quite nifty, but since they merely rely on boring^{2} bio-chemistry stuff, I will let you google around (Wikipedia’s entry on the topic is not the greatest introduction, for once). For our sake, let’s just say we have biological tools that can accurately map segments of DNA of 100 bases (or smaller) in a fairly short time with near-perfect accuracy.

Other magical tools that we have at our disposal in our bio-chemistry toolbox allow us to somewhat painlessly:

1. replicate a sequence (of any length) an arbitrary number of times.

2. cut sequences in smaller bits at arbitrary positions: usually upon encountering a specific pattern in the sequence and with a certain frequency rate (*e.g.*: upon encountering the pattern ‘CCCGGG’ anywhere in the sequence, the cut will occur 1 out of 6 times).

3. work on a subsequence at a specific position (as if it was cut from the rest): provided you have an *exact* pattern where you want to do it (and the pattern is long enough to be unique in the sequence), you can isolate a subsequence from the rest with absolute precision and accuracy. However, engineering such an artificial cut is infinitely more complex, long and costly as using some of the “ready-made” cutters of 2.

*The same with more science and less words, you ask? Sure, here you go: DNA primers, DNA polymerase, PCR, restriction enzymes, kangaroos ^{3}.*

For the sake of keeping things simple (ha!) we will blatantly ignore the fact that DNA is not a single strand, but two complementary sequences tied together that need to be periodically pried apart or rebuilt during these operations. In the end, it does not matter much to the problem at hand.

Equipped with such tools, how should we go about processing entire chromosomes (usually north of 10k bases for simple organisms, 50 million for humans)?

### The Straightforward Method (aka The Doomed Approach)

“Primer walking” is a (comparatively) simple method that more or less consists of focussing on segments small-enough to fit our small-scale sequencing method: start at a given position (see tool 3. above), map a few hundred bases, use the chunk just mapped to find the next starting position, lather, rinse, repeat.

There are two major problems with this method: As mentioned, “starting at a specific position” on a sequence is a very *ad hoc* and therefore costly (in money, in time…) operation. That operation furthermore requires knowing *where* to start, which means having processed the previous segment (so we can use the end of this segment as the pattern to start at in the following step).

The entire process therefore requires *sequentially* executing a series of operations, each of which requires a certain time. Surprisingly no NP-Completeness or exponentials involved here: only a very, very long task with no way to parallelize, which is only theoretically better. In practice, this makes this method viable (albeit slow and costly) for sequences of up to 10,000 or so bases.

On longer sequences, and particularly on our friendly multi-billion genomes, that would be a Fail.

### The Smart Way (aka The One That Gives Headaches)

From the above, it seems obvious the only way to stand a chance at decoding entire genomes in less than a lifetime is to tackle many small segments, possibly all of them, at the same time. And this is exactly what shotgun sequencing does: after making a good number of copies of the original sequence, we sic a bunch of ready-made sequence cutters on them. The resulting mixture is a horrible mess of small fragments, that are small enough to be each sequenced easily. Which is great, except you then have no idea how to put them back together so as to get the bigger picture.

Not clear? Need another silly analogy? Sure:

Imagine you have a (very very long) text printed on a piece of paper: make a dozen photocopies and leave them in a room with your 5-year-old nephew and a good pair of scissors (arranging the pages in such a way that each is cut separately, in its own unique way). Put all the resulting confetti in a bag, shake a few times.

Now put the original text back together. Go ahead, I’ll wait.

Funny enough, this is near-exactly what DNA sequencing entails: once the first phase of sequencing is done, you are given a large bag of short strings (long sequences of ‘A’, ‘G’, ‘T’ and ‘C’ instead of common words in our analogy, but the same for every other purpose). You are looking for the text that is made of these strings in a specific order (and repeated over, as many times as there were initial copies). An “easier” problem is to find *any* text on which you could fit each little fragments (with lots of overlap, since there are many copies, each cut a different way) without worrying too much about whether this text really is your original. In practice, to get the real solution, all you have to do is find the *shortest* such text^{4}: the one where you have no “superfluous” bits of text (not covered by any fragment) and as much overlap as possible (ideally in such a way that each bit of text is covered exactly as many times as there were original copies).

This problem is known as the *shortest common superstring*. And guess what: remember that little list I mentioned last week? Yea, bummer.

Given a candidate, merely verifying whether you have found the true Shortest Common Superstring is an NP-Complete problem. Finding that string from scratch, belongs to an even harder class of problems known as “NP-Hard” (where, as the name lets on, real badass NP problem kids hang out). Given the size of your data (millions of tiny fragments), finding an exact solution is obviously not gonna happen in this lifetime. Neatly enough, solving this problem turns out to be exactly the same as solving a particular brand of Traveling Salesman Problem (remember?). But as I mentioned last week: where standard algorithm fail and exact solution cannot be found, it is usually still possible to find a decent approximation in a fraction of the time. Sometimes this solution even turns out to be the exact solution, or so close that it doesn’t matter. This is the case here.

*And excited as I am sure you were all, to jump into the actual mathematics of solving this particular problem, a sense of human decency and the already astronomical word count on this entry conjointly force me to skip to the end and suggest that the morbidly obsessed bravest among you peruse their local Google to look for some fascinating readings on reducing Shortest Common Superstring to Asymmetric Traveling Salesman Problem and/or finding approximation algorithms for the Shortest Common Superstring problem.*

One of the trick behind Shotgun Sequencing, beside the actual approximation algorithm used to rebuild the original sequence, is the clever use of statistical redundancy to reach a guaranteed level of accuracy. That’s how we are able to say with certainty that the accuracy for the current Human Genome project is “no more than 1 error every 100,000 bases on average” (which sounds an awful lot better than “fuck if we know, but we dang sure hope it’s enough to tell it apart from them monkey DNA”).

### The thankfully brief conclusion

And there we are, I have exposed, in nearly less time than it takes to kill a grizzly with a feather^{5}, how Mathematics and Algorithmic can save the day in Genetics and help ensure that we will all have a bright future made of beautiful blond, three-eyed babies.

Of course, given that Shotgun Sequencing in its earliest incarnation, dates back about 40 years, there is very little left to be done for young upcoming bioinformaticians in this particular direction. And none of my work has anything to do with common sequencing problems.

But next ~~week~~ ~~month~~ soon, I will maybe talk of more recent challenges in bioinformatics, perhaps even some of mine!^{6}

- I know: one is not supposed to capitalise the name of God’s evil nemesis, but I am going on the assumption that Satan is a vindictive bastard and one can never be too prudent in courting the good graces of major players of the afterworld. [↩]
- just kidding, fellow biologist friends: keep looking after those test tubes and churning out my data. kthxbye. [↩]
- Have fun with your friends: can you spot the famous Australian marsupial among these terms related to DNA engineering? [↩]
- the only reason this is not 100% true is the pesky problem of highly repetitive sequences. Something unlikely to arise in a real book, but very common in some DNA. [↩]
- Pro tip: go for that spot under their foot, grizzlies are extremely ticklish there. [↩]
- … in part 205 to 300 of this entrancing 1400-part series! [↩]

Great writeup. Two questions on shotgunning:

– Since your strings are long sequences of 4 characters (ATGC), there’s going to be tons and tons of repeated sequences. Does something prevent shotgunners from overlapping by too much when gluing these strings back together? TCAAA and AAAGT might actually be TCAAAAAGT (no idea if that’s even a legal sequence but you see what I’m getting at).

– Do they already know the final length of the strings they’re trying to assemble? If so, that’s at least a small foothold for optimization.

Now I understand better why current DNA testing isn’t actually the magic forensics bullet it was claimed to be in the 90s.

Scott: thanks.I’d recommend you check some of the papers and tutorials on shotgun algorithms out there: pretty sure you won’t have any trouble making sense of them (as I pointed out, this is essentially a computer science problem dealing with string manipulation)…

– The overlap is part of what makes it possible to glue it back together (otherwise you’d have no way of finding the original order). But the number of repeats of the entire sequence and the somewhat random occurrence of cuts (each enzyme tends to cut at certain places, but only with a given rate of certainty, hence a different “splitting pattern” for each original copy of the sequence). In your example (keeping in mind of course that real fragments are considerably longer: around 100-200 aa), if the original sequence is ‘TCAAAAAGT’ and you have ‘TCAAA’ and ‘AAAGT’ floating around in your final broth, this means you also have a ‘AAGT*’ and a ‘*TCAA’ somewhere… Likely you will also have a ‘AAAAA’, a ‘*TC’, a ‘GT*’ etc. Given enough substrings and enough variation in splitting patterns, typical shotgunning algorithm will end up finding that the most satisfying superstring is indeed ‘TCAAAAAGT’ (a longer string, eg ‘TCAAAAAAGT’, would not be fully covered as many times).

– Yes, the real shotgunning algorithm is actually a somewhat iterative process, where you start by cutting the entire DNA in smaller bits of a given size, then smaller again etc.

I cannot find good detailed illustrations of the common statistical approach to solving the shotgunning problem, but to put it extremely roughly (and from memory, since this isn’t at all what I work on), it works a bit along the line of:

1) look for possibly contiguous pairs of fragments (‘contigs’)

2) (figuratively) put all the contigs found, one under the other, so as to form a sort of diagonal running the length of the original string.

3) use properties of the set of fragment (chiefly the ‘cover’, or total number of identical copies of the original string) to “rectify” the diagonal, i.e. correct possible misalignments between pairs of fragments (answers your case)…

4) run 2 and 3 iteratively until you obtain a satisfying alignment of all fragments.

Once again: very rough and possibly inexact in more modern implementations, but should give you an idea: essentially, it’s an iterative optimisation process…

That being said, sequence with ultra-high repetition rates and other idiosyncrasies *will* trip the algorithm. Which is where people fall back to more expensive methods (BAC). AFAIK, only about 99% of the human genome could be easily decoded with shotgun sequencing (which leaves a very big chunk, as you can imagine).

About DNA testing: it’s a whole other topic, but the fact that sequencing is a very tough task, doesn’t mean that testing is any less reliable. In fact, one of the consequences of the work on sequencing, is that there is now a very good understanding of modification patterns across individuals (see SNPs, aka “snips”). So all forensics have to do (and I am not sure they even go that length: there might be even more rudimentary methods available for the same result), is pick a couple hundreds bases at one or many very specific high-mutation loci and they get a DNA fingerprint with as large a safety margin as they want (100 SNPs would mean upward of 2^100 combinations: fair to say that the chances of accidental mismatch are low)…

Hi Dave, love your work.

your bioinf blog distracted me from Kanji box, which in turn was distracting me from trying to learn how to do bioinformatics.

i guessing there ought to be some kind of mathematical decscription for my circular procrastination trouble, but this might lead me into some kind of spiral.

anyway

Cheers!

matt

Hi Matt,

I am surprised the weak bioinfo-related rant above could be of interest to anybody in the field, let alone distracting. But thanks nonetheless!

Good luck with both kanji and bioinfo studies