Skip to main content

Table 1 Summary of technical terms

From: Streaming histogram sketching for rapid microbiome analytics

Term

Definition

Consistent weighted sampling

An efficient method of sub-sampling histogram data that takes into account the frequency of each bin

De novo

Analyses based solely on the collected sequence data

Dimensionality reduction

Representing the sequence data in a metagenome by a relatively small number of collective quantities

Dissimilarity measure

A measure of how dissimilar two metagenomes are, typically used to identify significant changes in microbiome composition

Feature vectors

A set of key quantities of a dataset that can be used as input to a machine learning algorithm

Histosketch

A small approximate representation of histogram data, such as a k-mer spectrum.

Jaccard similarity

A measure of the similarity of two datasets based on the proportion of shared members.

K-mer

A short sub-sequence extracted from a read or genome

K-mer spectrum

The set of all observed k-mers, together with their abundances in the sequence dataset

Locality-sensitive hashing

A method of dimensionality reduction which hashes sequence data in such a way that similar sequences are kept together

Reference-based

Making use of existing reference genomes to align and classify new sequencing data