There has been a lot of interest in methods for building and aligning to graph genomes. Graph genomes differ from typical “linear” reference genomes because they additionally take genetic variation into account. But basic questions remain: Which variants should we include in the reference? Is including more variants always better? How close these promising new methods come to the ideal of aligning to a personalized genome? Jacob Pritt, Nae-Chyun Chen and PI Langmead just published a paper in Genome Biology addressing these and other important questions (https://doi.org/10.1186/s13059-018-1595-x). Congratulations to Jacob, who also graduated this month!
In the study, we introduce models for assessing the pros and cons of including particular genetic variants in the graph. We implement these methods in a new, open source software tool called FORGe, available at https://github.com/langmead-lab/FORGe. Our experiments show that accounting for both the pros and cons of including variants in the graph is crucial because simply adding more variants (in the extreme: adding all known variants) can have an adverse effect on alignment accuracy.
We also show that graph-genome alignment methods (HISAT2 in particular) can be computationally affordable, genuinely improve the “bottom-line” in terms of accuracy and bias, and approach the performance of an ideal personalized genome. FORGe-constructed graph genomes perform better on alignment accuracy measures than linear genomes or typical HISAT2-generated graphs. We make several such optimized HISAT2 graph genomes available as part of the study (ftp://ftp.ccb.jhu.edu/pub/langmead/forge).
In short: graph genomes can be usable and useful provided we take care in selecting which variants to include. Please give the study a read.