GlobDB & other custom resources
GlobDB - a preprint introducing the GlobDB (Speth et al. 2025) is available on arXiv
Data sources
Genome taxonomy database (GTDB) - Parks et al. 2022. (complete list of references here.)
mOTUs online database (mOTU) - Dmitrijeva et al. 2025
Searchable, planetary-scale microbiome resource (SPIRE) - Schmidt et al. 2023.
Rare biosphere genomes (RBG) - Aroney et al. 2024
Genomic catalog of earth's microbiomes (GEM) - Roux et al. 2021.
Mgnify Genomes - Gurbich et al. 2023
Global ocean microbiome genome catalogue (GOMC) - Chen et al. 2024
Genomic catalog of soil microbiomes (SMAG) - Ma et al. 2023.
Tibetan Plateau Microbial Catalog (TPMC) - Cheng et al. 2024
Mouse and Human Reference Gut Microbiome (MRGM & HRGM2) - Kim et al. 2024 & Ma et al. 2024
curated Food Metagenomic Data (cFMD) - Carlino et al. 2024
Sheep and Goat gut MAGs - Zhang et al. 2024
Anammox microbiota MAGs - Wang et al. 2024
Glacier-fed streams (GFS) MAGs - Michoud et al. 2025
Tools
fastANI - Dereplication of the datasets is done using fastANI, version 1.3.4, described in Jain et al. 2018.
dRep - The cFMD, SHGO, and AMXMAG datasets did not provide species dereplicated datasets for download. After sequential dereplication against the GlobDB, the remainder of genomes in these datasets was dereplicated using dRep, described in Olm et al. 2017
Anvi'o - Anvi'o contigs databases for the GlobDB genomes, as well as the protein files and GFF files were created using anvi'o, development version, described in Eren et al. 2021. In addition, anvi'o provides a wide range of functionalities, some of which come with their own citations. If you use the GlobDB anvi'o databases in your work please check here whether there's work that would be appropriate to cite.
Prodigal - The gene calling for protein coding genes in the GlobDB genomes were annotated using Prodigal, version 2.6.3, described in Hyatt et al. 2010. If you use the protein data files in your work, please cite prodigal.
GTDB-tk - Taxonomic assignment for the genomes in the GlobDB that are not derived from the GTDB is described on the methods page and uses GTDB-tk, described in Chaumeil et al. 2022.
BacDive - availability of GlobDB genomes in culture collections is assessed using BacDive, described in Schober et al. 2025
CheckM2 - Completeness, contamination, and basic statistics on the GlobDB genomes were calculated using CheckM2, described in Chklovski et al. 2023.
ProtTrans - Protein language model embeddings were calculated using the ProtT5-XL-U50 model described in Elnaggar et al. 2022