The Dashing study, “Dashing: fast and accurate genomic distances with HyperLogLog,” authored by Daniel Baker appeared in Genome Biology today. Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that specialize in set unions and intersections. Dashing sketches genomes more rapidly than previous MinHash-based methods such as Mash or BinDash while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in under 6 minutes.
The dashing software is available via Bioconda.
Ben made a brief video describing the approach in a way that combines concepts from MinHash and HyperLogLog sketching.