From: The evolutionary signal in metagenome phyletic profiles predicts many gene functions

Increasing diversity rather than amount of metagenomes is crucial for accurate gene function prediction. a x-axes represent the number of sampled metagenomes. y-axes represent cross-validation AUPRC averaged over GO functions from a specific GO domain (column) and of a specific level of generality (row; stratified by IC). Error bars represent standard error of the mean. Maximum diversity sampling approximately retains the proportions of samples from the environments represented in the full data set. Minimum diversity sampling always begins with the largest environment (e.g., soil); in the second experiment ("sample 2") in a all samples representing soil were removed from the data and the sampling was started from the second largest environment. b slopes of the linear regression fits of accuracy scores against increasing data set size for the PP and MPP data sets (using different sampling approaches), showing the average slope of all segments connecting neighboring points in plot; complete data in Additional file 1: Table S6. c The number of environments contained in each data set. BP, Biological process; MF, Molecular function; CC, Cellular component; IC, Information content

