Seed Backup Drive

Shower thought here:

Imagine sequencing seed DNA and storing it in a digital seed bank. Useful genes could later be printed out and edited back into crops with CRISPR. A backup drive for life.

SciFi? Probably not:

An obvious question that comes to mind is storage. Genes contain a lot of information, and seeds themselves are highly efficient natural storage mechanisms… All that information packed into tiny living DNA pods that can survive drying.

But seeds have to be kept cold and dry. The rule of thumb is that reducing water content by 1% or temperature by 10 degrees Fahrenheit will double a seed’s life span.

Seed Banks are our current approach to “datacenters” for genetic diversity. The Svalbard Global Seed Vault is embedded in permafrost… the ideal environment for keeping seeds around a long time.

Svalbard Global Seed Vault

…But keeping these seeds alive for hundreds of years is a challenge. Seed banks wisely take the position that Lots of Copies Keeps Stuff Safe, using seed swaps to keep seeds alive and planting.

Backing It Up

Of course, having a digital backup wouldn’t hurt, and digitizing the genetic information could be useful for lots of other reasons. Gene data is a valuable resource for scientific analysis. Indeed, the Jodrell Laboratory maintains a digital DNA barcode bank (not full sequences).

So how much space would full genomes take? Time for some silly back-of-the-envelope math. Whole genome sequencing generates a lot of data – there are about six billion base pairs in each human diploid genome. Storing that can take anywhere between 200GB and 125MB (if you’re just storing mutations).

For scale, the 1000 genomes project is about 200TB of data representing 1700 participants. You can download it from Amazon AWS. You could say that it’s “webscale”.

Word on the street is that all of the video on YouTube comes out to about 100 petabytes.

1PB = 1,000,000 GB
100PB = 100,000,000 GB
Human genome = 200 GB
(100,000,000 GB / 200 GB) = 500,000

So we could store 500k individual human genomes for the cost of 1 YouTube. Not terribly efficient. For reference the Millennium Seed Bank physically stores 34,000 species and 1,980,405,036 individual seeds.

However, YMMV with seed genomes. The human genome is 6 billion base pairs, or 6000Mbp, by contrast a tomato is 900Mbp. Soybean is 1115Mbp.

Full human genome = 6,000Mbp and 200GB
Full soy genome = 1115Mbp
6,000 / 1,115 = 5.381
200GB / 5.381 = 37.167GB

Let’s say 40GB per full plant genome. You could store more than 5x tomatoes and soybeans than people for the same space.

And we could get creative. The difference between individuals in a species is the sum of their mutations, and we can store those mutation diffs for ~125MB.

What if we were to sequence 34,000 species, then store idividuals of those species as diffs against the “base genome”? Let’s say we store 34,000 full “base” genomes, the same number as the Millennium seed bank and that each genome takes about 40GB.

34,000 species * 40GB = 1,360,000GB
YouTube = 100,000,000GB
100,000,000GB - 1,360,000GB = 98,640,000GB
125MB = 0.125GB
98,640,000GB / 0.125GB = 789,120,000

34k species + 789,120,000 individuals for the space of a YouTube. Still not as efficient as traditional seed banks, but it seems within the realm of plausability to create a useful digital seed bank.

It makes me wonder if we couldn’t take a SETI@home approach to the storage problem… donate a bit of your hard drive space to a BitTorrent swarm that keeps those valuable seed genomes alive. Lots of Copies Keeps Stuff Safe.

Disclaimer: this is me being curious. I’m no expert. Please chime in if you have corrections.

Fun reading:

  • Updated
    Oct 19, 2016