From: Streaming histogram sketching for rapid microbiome analytics
Term | Definition |
---|---|
Consistent weighted sampling | An efficient method of sub-sampling histogram data that takes into account the frequency of each bin |
De novo | Analyses based solely on the collected sequence data |
Dimensionality reduction | Representing the sequence data in a metagenome by a relatively small number of collective quantities |
Dissimilarity measure | A measure of how dissimilar two metagenomes are, typically used to identify significant changes in microbiome composition |
Feature vectors | A set of key quantities of a dataset that can be used as input to a machine learning algorithm |
Histosketch | A small approximate representation of histogram data, such as a k-mer spectrum. |
Jaccard similarity | A measure of the similarity of two datasets based on the proportion of shared members. |
K-mer | A short sub-sequence extracted from a read or genome |
K-mer spectrum | The set of all observed k-mers, together with their abundances in the sequence dataset |
Locality-sensitive hashing | A method of dimensionality reduction which hashes sequence data in such a way that similar sequences are kept together |
Reference-based | Making use of existing reference genomes to align and classify new sequencing data |