Boiler: radically lossy compression

PhD student Jacob Pritt just posted a preprint describing on Boiler, a novel and radically lossy tool for compression RNA sequencing alignments. A major computational burden for researchers today is the need to store enormous datasets. To ask even simple biological questions about a multi-sample RNA-seq dataset, one deals with up to billions of read alignments and terabytes of files. The problem is growing rapidly over time as sequencers improve and public archives fill with more datasets. The Sequence Read Archive already contains petabytes of data and has a doubling time of about 1 year.

Boiler uses a novel compression approach inspired by “transform coding.” It transforms alignment data from the “alignment domain” where each alignment is represented separately, to the “coverage domain,” where per-alignment data is inferred from a handful of empirical distributions. Unlike other compression tools, Boiler’s compression ratio improves rapidly with sequencing depth, making it an extremely appealing option for deeply sequenced samples.

Boiler software is available here: https://github.com/jpritt/boiler

Boiler preprint is available here:
http://biorxiv.org/content/early/2016/02/22/040634

Experiments for the Boiler paper are described here:
https://github.com/jpritt/boiler-experiments

Date
Categories
Tags
Permalink
Status

Published:March 1, 2016

Uncategorized

Bookmark the permalink

Both comments and trackbacks are currently closed.