Interesting overview the high-throughput sequencing technologies, with quotes below from 454 people and on cost of storing a Solexa's run of data, estimated at ~$15K/run .
The Drive for the $1000 Genome
By Kevin Davies
May 15, 2007 | J. Craig Venter recently made his Comedy Central debut on The Colbert Report. Asked by host Stephen Colbert "What makes you think you can do a better job with life and genetics than God?" Venter shot back: "We have computers!" rendering Colbert momentarily (and uncharacteristically) speechless.....
A potential downside of the SOLiD setup is the premium it puts on compute power and storage. The complete SOLiD system including compute and workflow pieces, could push the price above $600,000....
"No-one stores the [Sanger] images when they're small, so who's going to store them when they're large? So we want to get you past the images and into analyzing the data," says Rhodes.
By contrast, a typical 454 FLX run produces a paltry 13 GB of raw image data After data extraction, namely base calling, we're at a final of just less than 20 GB in total. That's actually quite manageable, especially nowadays with 500-GB hard drives," says Harkins. "We're looking to compress that down so potentially you could burn a DVD for one drive. You could store an instrument run for a few dollars."
Harkins notes that other next-generation sequencing platforms are talking about terabytes per run. "We're talking about pushing the science, but these other companies have a dilemma. It could cost more in computer hardware than reagents for an instrument run," says Harkins.
While Illumina's Smith agrees that, "The really big data is in the images," Illumina offers customers the opportunity to store all of their images, "because there will be people who want to do that. The issue is you get into hundreds of GB or even 1 TB [per run]." And that will only increase in the future. "The customer may decide to store a subset of the images for quality control purposes, or store images for a particularly important run and archive them to a tape backup."
The question for the market, Harkins reckons, is: Do you want to save your raw data? 454 allows users to re-evaluate their raw data. "We had one customer who re-processed his raw data using the updated GS FLX software and is seeing improvements," says Harkins. "When you're talking about 1-2% error down to 0.5%, that leads to tangible improvements for downstream analysis."
But Rhodes dismisses such criticism. With an instrument potentially pumping out 4 Megabases each second over three days, "People don't need the images, they need the data. What you really want is the result," says Rhodes.
During a panel discussion at CHI's Next Generation Sequencing conference, Rhodes said: "Back of the envelope calculations say that if you wanted to store the raw image data, it's 6 TB a week... that could require you to spend $1 million on storage, backup, and stuff. So unless you think you're going to want to go back to every image, it's cheaper to do the experiment again." Rhodes can see certain situations for storing images, say for a precious cDNA clone. "But as a routine workaday measure, no."
"Once you've got to that stage, you still have a large dataset — if you're going to generate 1 billion bases per run, you've got to have quite a lot of bytes as well as bases," says Smith. "But you're no longer in the TB of data, you're back down in the 100 GB or so. So you can reduce the data quantity by not storing the images." Smith says many customers already have the necessary compute infrastructure. ...
But Harkins says the market hasn't come to terms with the dilemma of paying $10,000-20,000 to save a single instrument run's data. "That's going to put the market into a bind," he says. "Throwing raw data away is a paradigm shift I don't think people are ready for yet."