GlobDB & AASTK

GlobDB 
a manuscript introducing the GlobDB (Speth et al. 2025) is published in Bioinformatics Advances.

AASTK
There is no publication describing AASTK yet, so please cite the github repository (https://github.com/dspeth/aastk) when you use AASTK.

In addition, several parts of the software were developed independently and should be credited.

If you use AASTK with the GlobDB protein dataset, please cite:
Speth et al. (2025) GlobDB: a comprehensive species-dereplicated microbial genome resource
https://doi.org/10.1093/bioadv/vbaf280

If you use aastk pasr, please cite:
Speth and Orphan (2018) Metabolic marker gene mining provides insight in global mcrA diversity and, coupled with targeted genome reconstruction, sheds further light on metabolic potential of the Methanomassiliicoccales
https://doi.org/10.7717/peerj.5614

The environmental data from aastk meta is derived from the MetaCoOc software. A manuscript is in preparation, but in the meantime please cite:
https://github.com/bcoltman/metacooc

 

Data sources

Genome taxonomy database (GTDB) -  Parks et al. 2022. (complete list of references here.)
mOTUs online database (mOTU) -  Dmitrijeva et al. 2025
Searchable, planetary-scale microbiome resource (SPIRE) - Schmidt et al. 2023.
Rare biosphere genomes (RBG) - Aroney et al. 2025
Genomic catalog of earth's microbiomes (GEM) - Roux et al. 2021.
Mgnify Genomes -  Gurbich et al. 2023
Global ocean microbiome genome catalogue (GOMC) -  Chen et al. 2024
Genomic catalog of soil microbiomes (SMAG) -  Ma et al. 2023.
Tibetan Plateau Microbial Catalog (TPMC) - Cheng et al. 2024
Mouse and Human Reference Gut Microbiome (MRGM & HRGM2) - Kim et al. 2024 & Ma et al. 2025
curated Food Metagenomic Data (cFMD) - Carlino et al. 2024
Sheep and Goat gut MAGs - Zhang et al. 2024
Anammox microbiota MAGs - Wang et al. 2024
Glacier-fed streams (GFS) MAGs - Michoud et al. 2025

 

Tools

fastANI - Dereplication of the datasets is done using fastANI, version 1.3.4, described in Jain et al. 2018.

dRep - The cFMD, SHGO, and AMXMAG datasets did not provide species dereplicated datasets for download. After sequential dereplication against the GlobDB, the remainder of genomes in these datasets was dereplicated using dRep, described in Olm et al. 2017

Anvi'o - Anvi'o contigs databases for the GlobDB genomes, as well as the protein files and GFF files were created using anvi'o, development version, described in Eren et al. 2021. In addition, anvi'o provides a wide range of functionalities, some of which come with their own citations. If you use the GlobDB anvi'o databases in your work please check here whether there's work that would be appropriate to cite. 

Prodigal - The gene calling for protein coding genes in the GlobDB genomes were annotated using Prodigal, version 2.6.3, described in Hyatt et al. 2010. If you use the protein data files in your work, please cite prodigal.

GTDB-tk - Taxonomic assignment for the genomes in the GlobDB that are not derived from the GTDB is described on the methods page and uses GTDB-tk, described in Chaumeil et al. 2022.

BacDive - availability of GlobDB genomes in culture collections is assessed using BacDive, described in Schober et al. 2025

CheckM2 - Completeness, contamination, and basic statistics on the GlobDB genomes were calculated using CheckM2, described in Chklovski et al. 2023.

ProtTrans - Protein language model embeddings were calculated using the ProtT5-XL-U50 model described in Elnaggar et al. 2022