PhD student Jacob Pritt just posted a preprint describing on Boiler, a novel and radically lossy tool for compression RNA sequencing alignments. A major computational burden for researchers today is the need to store enormous datasets. To ask even simple biological questions about a multi-sample RNA-seq dataset, one deals with up to billions of read alignments and terabytes of files. The problem is growing rapidly over time as sequencers improve and public archives fill with more datasets. The Sequence Read Archive already contains petabytes of data and has a doubling time of about 1 year.
Boiler uses a novel compression approach inspired by “transform coding.” It transforms alignment data from the “alignment domain” where each alignment is represented separately, to the “coverage domain,” where per-alignment data is inferred from a handful of empirical distributions. Unlike other compression tools, Boiler’s compression ratio improves rapidly with sequencing depth, making it an extremely appealing option for deeply sequenced samples.
Boiler software is available here: https://github.com/jpritt/boiler
Boiler preprint is available here:
http://biorxiv.org/content/early/2016/02/22/040634
Experiments for the Boiler paper are described here:
https://github.com/jpritt/boiler-experiments