Nae-Chyun’s paper describing the “reference flow” alignment framework appeared in the journal Genome Biology today. Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, consisting of one string per chromosome. But a linear reference is an arbitrary point of reference; using a single linear reference causes “reference bias,” a tendency to produce incorrect alignments or to miss alignments for reads that contain bases that different from the reference.
Some approaches seek to address this bias by replacing the linear reference with structures like graphs that can include genetic variation at the cost of adding computational overhead. Here we propose the “reference flow” framework, which uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, on both real and simulated data, but with 13% of the memory footprint and 6 times the speed. We describe and evaluate a few variations on this idea, demonstrating that our method can take linkage disequilibrium and can cover the 1000 Genomes project genotype space at either the superpopulation level or at the population level, improving on vg’s computational performance either way.