NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language

Background Shotgun metagenomes contain a sample of all the genomic material in an environment, allowing for the characterization of a microbial community. In order to understand these communities, bioinformatics methods are crucial. A common first step in processing metagenomes is to compute abundance estimates of different taxonomic or functional groups from the raw sequencing data. Given the breadth of the field, computational solutions need to be flexible and extensible, enabling the combination of different tools into a larger pipeline. Results We present NGLess and NG-meta-profiler. NGLess is a domain specific language for describing next-generation sequence processing pipelines. It was developed with the goal of enabling user-friendly computational reproducibility. It provides built-in support for many common operations on sequencing data and is extensible with external tools with configuration files. Using this framework, we developed NG-meta-profiler, a fast profiler for metagenomes which performs sequence preprocessing, mapping to bundled databases, filtering of the mapping results, and profiling (taxonomic and functional). It is significantly faster than either MOCAT2 or htseq-count and (as it builds on NGLess) its results are perfectly reproducible. Conclusions NG-meta-profiler is a high-performance solution for metagenomics processing built on NGLess. It can be used as-is to execute standard analyses or serve as the starting point for customization in a perfectly reproducible fashion. NGLess and NG-meta-profiler are open source software (under the liberal MIT license) and can be downloaded from https://ngless.embl.de or installed through bioconda. Electronic supplementary material The online version of this article (10.1186/s40168-019-0684-8) contains supplementary material, which is available to authorized users.


Data types
NGless supports the following basic types: In addition, it supports the composite type List of X where X is a basic type. Lists are built with square brackets (e.g., [1,2,3]). All elements of a list must have the same data type.

String
A string can start with either a quote (U+0022, ") or a single quote (U+0027,') or and end with the same character. They can contain any number of characters.
Special sequences start with \. Standard backslashed escapes can be used as LF and CR (\n and \r respectively), quotation marks (\') or slash (\\).

Double
Doubles are specified as decimals [0-9]+ with the decimal point serving as a separator. The prefixdenotes a negative number.
Doubles and Integers are considered numeric types.

Boolean
The two boolean constants are True and False (which can also be written true or false).

Symbol
A symbol is denoted as a token surrounded by curly braces (e.g., {symbol} or {drop}). Symbols are used as function arguments to indicate that there is only a limited set of allowed values for that argument. Additionally, unlike Strings, no operations can be performed with Symbols.

Variables
NGless is a statically typed language and variables are typed. Types are automatically inferred from context. Assignment is performed with = operator:

variable = value
A variable that is all uppercase is a constant and can only be assigned to once.

Operators Unary
The operator (-) returns the symmetric of its numeric argument.
The operator len returns the length of a ShortRead.
The operator not negates its boolean argument

Binary
All operators can only be applied to numeric types. Mixing integers and doubles returns a double. The following binary operators are used for arithmetic: The + operator can also perform concatenation of String objects.
The </> operator is used to concatenate two Strings while also adding a '/' character between them. This is useful for concatenating file paths.

Indexing
Can be used to access only one element or a range of elements in a ShortRead. To access one element, is required an identifier followed by an expression between brackets. (e.g, x[10]).
To obtain a range, is required an identifier and two expressions separated by a ':' and between brackets. Example: • Functions have a single positional parameter, all other must be given by name: The exception is constructs which take a block: they take a single positional parameter and a block. The block is passed using the using keyword: reads = preprocess(reads) using |read|: block ...
The |read| syntax defines an unnamed (lambda) function, which takes a variable called read. The function body is the following block.
There is no possibility of defining new functions within the language. Only built-in functions or those added by modules can be used.

Methods
Methods are called using the syntax object . methodName ( <ARGS> ). As with functions, one argument may be unnamed, all others must be passed by name.

Grammar
This is the extended Backus-Naur form grammar for the NGLess language (using the ISO 14977 conventions). Briefly, the comma (,) is used for concatenation, [x] denotes optional, and {x} denotes zero or more of x.  (* paired is a special-case function with two arguments *) paired = "paired", '(', innerexpression, ',', innerexpression, kwargs ; funcblock = "using", '|', [ variablelist ] If not specified, the encoding is guessed from the file. Gzip and bzip2 compressed files are transparently supported (determined by file extension, .gz and .bz2 for gzip and bzip2 respectively).

paired
Function to load a paired-end sample, from two FastQ files: paired() is an exceptional function which takes two unnamed arguments, specifying the two read files (first mate and second mate) and an optional singles file (which contains unpaired reads).

Return:
ReadSet Arguments by value: samfile Loads a SAM file: This function takes no keyword arguments. BAM files are also supported (determined by the filename), as are sam.gz files. This function takes no keyword arguments. If the filename ends with ".gz", it is assumed to be a gzipped file. When a paired-end input is being preprocessed in single-mode (i.e., each mate is preprocessed independently, it can happen that on eof the mates is discarded, while the other is kept). The default is to collect these into the singles pile. If keep_singles if false, however, they are discarded.

Returns
This function also performs quality control on its output. The user must provide either a path to a FASTA file in the fafile argument or the name of a builtin reference using the reference argument. The fafile argument supports search path expansion.
A list of datasets provided by NGLess can be found at Organisms.
To use any of these, pass in the name as the reference value: NGLess does not ship with any of these datasets, but they are downloaded lazily: i.e., the first time you use them, NGLess will download and cache them. NGLess will also index any database used the first time it is used.
The option block_size_megabases turns on low memory mode (see the corresponding section in the mapping documentation) The option mode_all=True can be passed to include all alignments of both single and paired-end reads in the output SAM/BAM.

mapstats
Computes some basic statistics from a set of mapped reads (number of reads, number mapped, number uniquely mapped). If keep_if is used, then reads are kept if they pass all the conditions. If drop_if they are discarded if they fail to any condition.

Return
By default, select operates on a paired-end read as a whole. If paired=False is passed, however, then link between the two mates is not considered and each read is processed independently.

count
Given a file with aligned sequencing reads (ReadSet), count() will produce a counts If the features to count are ['seqname'], then each read will be assigned to the name of reference it matched and only an input set of mapped reads is necessary. For other features, you will need extra information. This can be passed using the gff_file or functional_map arguments. If you had previously used a reference argument for the map() function, then you can also leave this argument empty and NGLess will use the corresponding annotation file.
The gff_file and functional_map arguments support search path expansion.
features: which features to count. If a GFF file is used, this refers to the "features" field.
subfeatures: this is useful in GFF-mode as the same feature can encode multiple attributes (or, in NGLess parlance, "subfeatures"). By default, NGLess will look for the "ID" or "gene_id" attributes. Consider the following illustration of the effect of different mode options: Reference ************************* -----Position 12345 12345 12345 Read position 1 2 3 4 5 Read_1 feature sets - How to handle multiple mappers (inserts which have more than one "hit" in the reference) is defined by the multiple argument: Argument strand represents whether the data are from a strand-specific (default is false). When the data is not strand-specific, a read is always overlapping with a feature independently of whether maps to the same or the opposite strand. For strand-specific data, the read has to be mapped to the same strand as the feature.
min defines the minimum amount of overlaps a given feature must have, at least, to be kept (default: 0, i.e., keep all counts). If you just want to discard features that are exactly zero, you should set the discard_zeros argument to True.
normalization specifies if and how to normalize to take into account feature size: Note: Before version 0.6, the default was to not include the -1 fraction.
substrim Given a read finds the longest substring, such that all bases are of at least the given quality. The name is a constraction of "substring trim". For example:

smoothtrim
This trims with the same algorithm as substrim but uses a sliding window to average base qualities. For example: read = smoothtrim (read, min_quality=15, window=4) Quality values of bases at the edges of each read are repeated to allow averaging with quality centered on each base. For instance a read: Sequence A T C G with a window A A T C G G Quality 28 25 14 12 of size 3 becomes 28 28 25 14 12 12 and is smoothed: Quality scores are returned to their original value after trimming.

Implementation
NGLess uses Prodigal as the underlying gene finder.

Short reads
Short reads have the following methods: • avg_quality(): the average quality (as a double) • fraction_at_least(q): the fraction of bases of quality greater or equal to q • n_to_zero_quality(): transform the quality scores so that any N (or n) bases in the sequence get a quality of zero.

Mapped reads
Mapped reads contain several methods. None of these methods changes its argument, they return new values. The typical approach is to reassign the result to the same variable as before (see examples above).
• pe_filter: only matches where both mates match are kept. • flag: Takes one of {mapped} or {unmapped} and returns true if the reads were mapped (in a paired-end setting, a read is considered mapped if at least one of the mates mapped). • some_match: Takes a reference name and returns True if the read mapped to that reference name. • allbest: eliminates matches that are not as good as the best. For NGLess, the number of errors (given by the NM field) divided by the length of the longest match is the fractional distance of a match. Thus, a match with 3 errors over 100 bp is considered better than a match with 0 errors over 90bps.
filter filter takes a mapped read and returns a mapped read according to several criteria: • min_match_size: minimum match size • min_identity_pc: minimum percent identity (considered over the matching location, trimming on the left and right are excluded). • max_trim: maximum number of bases trimmed off the ends. Use 0 to specify only global matches.
If more than one test is specified, then they are combined with the AND operation (i.e., all conditions have to be fulfilled for the test to be true).
The default is to discard mappings that do not pass the test, but it can be changed with the action argument, which must be one of {drop} (default: the read is excluded from the output), or {unmatch} (the read is changed so that it no longer reports matching).
You can pass the flag reverse (i.e., reverse=True) to reverse the sign of the test.

NGLess Constants
In NGLess, any variable written in uppercase is a constant, i.e., can only be assigned to once. In addition, there are builtin constants defined by NGLess.

Built in constants
• ARGV This file reads a sam stream from stdin, filters it (using the select call) and writes to standard output in bam format.

Parallel module
This module allows you to run several parallel computations. It provides two functions: lock1 and collect. lock1 :: [string] -> string takes a list of strings and returns a single element. It uses the filesystem to obtain a lock file so that if multiple processes are running at once, each one will return a different element. NGLess also marks results as finished once you have run a script to completion.
The intended usage is that you simply run as many processes as inputs that you have and NGLess will figure everything out.