Student Nae-Chyun Chen and colleagues just posted a new preprint describing his work on the “reference flow” alignment framework. Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, made up of a single string per chromosome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, but that also incur major computational overhead. The reference flow framework uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, on both real and simulated data, but with 13% of the memory footprint and 6 times the speed.
The software is built on Snakemake and is available from GitHub. We have made human Bowtie 2 reference-genome indexes (~21 GB download) available corresponding to the RandFlow-LD method described in the paper.