Syntrophic acetate oxidation replaces acetoclastic methanogenesis during thermophilic digestion of biowaste

Background Anaerobic digestion (AD) is a globally important technology for effective waste and wastewater management. In AD, microorganisms interact in a complex food web for the production of biogas. Here, acetoclastic methanogens and syntrophic acetate-oxidizing bacteria (SAOB) compete for acetate, a major intermediate in the mineralization of organic matter. Although evidence is emerging that syntrophic acetate oxidation is an important pathway for methane production, knowledge about the SAOB is still very limited. Results A metabolic reconstruction of metagenome-assembled genomes (MAGs) from a thermophilic solid state biowaste digester covered the basic functions of the biogas microbial community. Firmicutes was the most abundant phylum in the metagenome (53%) harboring species that take place in various functions ranging from the hydrolysis of polymers to syntrophic acetate oxidation. The Wood-Ljungdahl pathway for syntrophic acetate oxidation and corresponding genes for energy conservation were identified in a Dethiobacteraceae MAG that is phylogenetically related to known SAOB. 16S rRNA gene amplicon sequencing and enrichment cultivation consistently identified the uncultured Dethiobacteraceae together with Syntrophaceticus, Tepidanaerobacter, and unclassified Clostridia as members of a potential acetate-oxidizing core community in nine full-scare digesters, whereas acetoclastic methanogens were barely detected. Conclusions Results presented here provide new insights into a remarkable anaerobic digestion ecosystem where acetate catabolism is mainly realized by Bacteria. Metagenomics and enrichment cultivation revealed a core community of diverse and novel uncultured acetate-oxidizing bacteria and point to a particular niche for them in dry fermentation of biowaste. Their genomic repertoire suggests metabolic plasticity besides the potential for syntrophic acetate oxidation. Video Abstract

see, whether adapter trimming was successful and how it impacted merging performance. While BBmerge insertion detail file was used for data analysis, it lacks information on the ambiguous joins, therefore these statistics and others were read with regular expressions using R with the stringR [12] library from the BBmerge log file.
The reads for annotation were further processed. BBmask [13] replaced low complexity sections with a minimum entropy of 0.75 of pentamers in an 80bp window in the unmergeable R1 and merged reads with undetermined base calls. Cutadapt then trimmed the reads with quality trimming at cutoff 20 and undetermined base call trimming at the ends. The minimum length for the annotation reads was set at 100 bp, shorter ones were not used. For the classification FASTQ file the quality and low complexity trimmed merged and unmerged R1 were combined into a single query file. Data on the quality trimming process is also read from the cutadapt minimal report file.

Read Classification
Reads were classified with three different methods to provide abundance and diversity data. The small subunit rRNA (SSU) non-redundant database version 132 from the SILVA [14] project was used as reference database for BBmap [15]. Reads were mapped using default options and a minimum identity (minid) of 76%. The pre-aligned ARB file from the SILVA database was used as reference data for classification using SINA [16]. The result BAM file from BBmap was parsed with the Rsamtools R library and merged with the SINA results. For coding gene classification with Kaiju [17] we used the NCBI non redundant (nr) RefSeq. Kaiju was run with the default "greedy" mode with default parameters. The available taxonomy is based on the NCBI taxonomy.
Sourmash [18] was used to calculate signatures for each sample. The k-mer size for this calculation is set to 31. The Genome Taxonomy Database (GTDB) taxonomy reference database 89 [19] is used during LCA classification. The results from all three tools are converted into a format compatible with Krona [20].

Read Annotation
Reads were further used as an input for DIAMOND [1] BLASTX against different protein databases. To minimize false positive results, the e-value filter was set to 10 -5 and only the best hit with the lowest e-value for every query read was kept during result processing. A custom database was constructed to find formate dehydrogenase enzymes. This has been accomplished using a UniProtKB [2] query that searches for proteins with the relevant enzyme classes and KEGG [21,22] orthology terms. Another database was based on the hydrogenase enzymes from the HydDB [23] project. Carbohydrate active enzymes were identified using the latest CAZyDB [24] database version (07-31-2019) of the dbCAN [25] project.

Assembly
The MEGAHIT [26] assembler was used with the trimmed and merged reads, separately passing the unmerged reads as paired and the merged reads as single reads. To optimize for low coverage, high diversity samples, the k-mer increase between iterations was set to 10, starting at 27 and ending at 127. Otherwise default parameters were used, including the minimum k-mer multiplicity of 2 (--min-count) and the minimum output contig length of 200 (--min-contig-len).
To calculate the coverage depth for binning, the trimmed reads were mapped to the assembly using BBmap using similar parameters recommended for binning with METABAT2 [27]. Assembly quality statistics were calculated using metaQUAST [28] in Prokarya mode. The minimum contig length (-m) was set to 200. rRNA finding with Barnap (--rna-finding) and single copy ortholog search, similar to CheckM [29] with BUSCO [30] (-b) was enabled. Contig classification was performed with the CAT/BAT [31] workflow and sourmash. Sourmash is also applied to calculate the signatures of every contig (K set to 31). The same database for sourmash read classification was used for contig and bin classification.

Binning
Binning was performed with METABAT2 [27] using default parameters. The pseudo-random seed was fixed to 42. The completion and contamination data were calculated with the lineage specific workflow in CheckM [29]. CheckM was also used to calculate the tetra-nucleotide frequencies, depth data and composition data. The CAT Prodigal and DIAMOND results were reformatted to use them for the multi bin BAT workflow to taxonomically classify the bins. The fraction threshold below 50% allows for multiple taxonomic assignments for the same bin. Each bin was also classified again using sourmash.

Reporting
Most data processing tasks were performed with the data.table [32] R library or simple Python scripts. The final data aggregation steps in the workflow combined the data files from the different samples into a single file. The reports were based on RMarkdown [33] files and additionally use the plotly [8] and FlexDashboard [34] libraries for interactive single file HTML outputs. All reports contain file information like size, modification date and MD5 hash sum on all input files to minimize the risk of mixing different results from different runs and to indicate changes. Details on individual reports is given on www.knutt.org.

KnuttAnnotation
The second pipeline combines the annotation tools which were performed on the reconstructed bins. It combined the output files from every tool into gene format files (GFFs) for the contigs, ORFs and different tabular summary files.

MetaErg
MetaErg [35] was used for annotation of the contigs. It was executed with the 132 version of the SILVA database for SSU and LSU rRNA classification. The default minimum contig length was 200 base pairs and minimum ORF length was 180 amino acids. The Prodigal output from MetaErg was used for further ORF annotation tools. The output ORF FASTA file was split into multiple chunks for easier parallelization.

dbCAN
The dbCAN annotation steps were reimplemented with Snakemake and R for workflow integration and stability. The dbCAN version was similar to the version used for read annotation. A DIAMOND search was also performed against the transporter class database with an e-value of 10 -10 . Hotpep [36] annotation was executed with the minimum number of unique k-mers set to 6 and the sum of the conserved k-mer frequencies to 2.6. The tool hmmscan [37] was executed with the CAZy models, the transcription factor models tf1/tf2 and a set of signal transduction protein models. Evalue and coverage parameter are listed in Table S5. The search for carbohydrate active enzyme gene clusters (CGCs) was also reimplemented. The maximum number of spacer genes not having any dbCAN within a cluster was set to 1.

HydDB
The HydDB online tool doesn't provide an official API. Therefore, the HTTP calls between the data entry forms were emulated. This was done in a Python script. HTTP communication was performed with the requests [38] package. The returned HTML pages were parsed with the BeautifulSoup [39] package. This was necessary to read the Cross-Site-Request-Forgery (CSRF) token and detect the fieldnames for resubmitting the downstream sequence of FeFehydrogenases for example. Data processing was done with the Pandas [40] python package and FASTA parsing with BioPython [41]. Candidates to be submitted to the HydDB classifier were selected with a RPS-BLAST search. An e-value of 10 -10 was used for the search. For every query the longest, non-overlapping annotations were filtered. The filtered set of sequences was chunked and then submitted to the classifier.

KofamKOALA, InterProScan and eggNOG-mapper
For KEGG orthology annotation KofamKOALA [42] was used with default parameters on the ORFs. Module mapping and reconstruction was performed with a custom R script using the KEGGREST [43] library. Only modules with at least three blocks were allowed to have one block ortholog search e-value threshold was set to 10 -3 and the bit score threshold to 60. All ortholog types were used for annotation and the automatic scope detection.

KnuttBinPhylo
The KnuttBinPhylo pipeline produces a phylogenetic tree based on a concatenated alignment of different marker proteins, extracted from the bins and reference proteins in UniProt Proteomes [2].
The marker candidates are configurable and given as Pfam [45] family entries. The default selection is based on the category A set of markers compiled in the PhyloM bacteria set [46].
The hidden Markov models were downloaded directly from the Pfam database. Proteins with those markers annotated were queried from UniProtKB using a search that only allows for proteins included in non-redundant, non-excluded proteomes belonging to the Archaea or Bacteria superkingdom. Each set of reference marker proteins was then aligned using MAFFT [47][48][49] with the automatic algorithm (--auto) selection and support for unusual characters (--anysymbol).
The markers were filtered, so that the selected markers cover at least 75% of the Archaea, Bacteria and "Unclassified" proteomes respectively. Proteomes were also filtered, requiring 50% of the selected markers. To resolve multiple marker occurrences in one proteome, the Jukes-Cantor [50] (JC) distances to a subsample of 1000 entries in the marker reference alignment were determined. If the fractional identity between two compared sequences was below or equal to the maximum for JC (95%), the distance was assumed to be infinite. Positions being gaps on both sequences were ignored. The distances to the 1000 sample entries were summarized by calculating the median. This median was then used to sort the conflicting entries for each proteome and the lowest scoring entry was finally selected.
To analyze the bins, protein encoding genes were predicted with Prodigal [51] in anonymous/meta mode. HMMERs hmmsearch [37] with the gathering cutoff enabled searches for marker candidates in those metaproteomes. The hit regions were extracted. The marker hits in the bins were added to the respective marker reference alignment with MAFFTs add/keeplength option and FFT-NS-1 method. Bins were also required to have 50% of the selected markers and the same multiple hit resolution method was applied to the them. Missing marker sequences are filled with gap symbols while concatenating the marker sequences for the alignment sequence.
The tree was built with close and distant references. Selection of these references started by using the concatenated alignment to calculate the JC distance between the bins and all selected reference proteomes. The close references were the five closest references to the bin. Distant reference selection first took the 500 closest references and then calculated the JC distance between all those entries to construct a distance matrix. This distance matrix was used to perform hierarchical clustering with the R hclust "ward.D2" method and the resulting dendrogram was cut into 15 groups. From each of these groups the closest reference to the bin was used as a distant reference. For RAxML [52] a partitioning file was generated, separating the concatenated sequences. Each region was marked to be modeled with the LG model. RAxML tree calculations were performed with the "classic" RAxML using 100 rapid bootstraps for confidence on the best scoring maximum likelihood tree.