File descriptions (version 220)

genome fasta - Fasta files for all genomes in the GlobDB. Fasta headers have been renamed using the anvi'o script anvi-script-reformat-fasta to simplify the contig header lines and include the GlobDB genome ID at the start of every contig.

anvi'o databases - Anvi'o contigs databases for all GlobDB genomes, generated using the program anvi-gen-contigs-database from the fasta files provided in the "genome fasta". These anvi'o databases include the gene calling and annotation information of the genomes, and can be used to generate the protein fasta and gff files. Furthermore, the anvi'o databases can be used as a resource for phylogenomics, pangenomics, or other comparative analyses. Note that the anvi'o contigs databases are created with the development version of anvi'o and therefore can only be used with that version.

protein fasta - fasta files for all proteins in the GlobDB genomes annotated using prodigal as integrated in anvi'o, exported using the program anvi-get-sequences-for-gene-calls. Note that the headers of all proteins were renamed after export to include the GlobDB genome ID, and to be consistent with the IDs in the exported gff files.

gff cog - Anvi'o exported files in GFF format for all GlobDB genomes with COG annotation as done using the anvi'o program anvi-run-ncbi-cogs. See "Methods" page for more details.

gff kegg - Anvi'o exported files in GFF format for all GlobDB genomes with KEGG annotation as done using the anvi'o program anvi-run-kegg-kofams. See "Methods" page for more details.

gff pfam - Anvi'o exported files in GFF format for all GlobDB genomes with PFAM annotation as done using the anvi'o program anvi-run-pfams. See "Methods" page for more details. 

cugo - Twelve column tab delimited files derived from "gff cog" for all GlobDB genomes. Used for consensus genomic context visualization.

tmhmm - Two-column tab delimited files for each GlobDB genome, with the number of predicted transmembrane segments for each protein using TMHMM.

genome statistics - output of the `statswrapper.sh` program from the BBmap package for all genomes in the GlobDB

taxonomy - Tab delimited file containing GlobDB genome ID as well as 7 level taxonomy (assigned by GTDB-tk), each as separate field

md5sums - md5sums of the downloadable files

Download links

Dictionary files to translate the GlobDB IDs to the published IDs of each data source
GEM dictionary (1.7 Mb) - tab delimited file
SPIRE dictionary (6 Mb) - tab delimited file
SMAG dictionary (695 Kb) - tab delimited file
dictionary md5sums - md5sums for the dictionary files

These are the available files for version 220.
Each tar archive contains a directory with 202,601 files, one for each genome in the GlobDB.
genome fasta (159 Gb) - tar archive
anvi'o databases (382 Gb) - tar archive 
protein fasta (98 Gb) -  tar archive
gff cog (19 Gb) - tar archive
gff kegg (13 Gb) - tar archive
gff pfam (14 Gb) - tar archive
cugo (11 Gb) - tar archive
tmhmm (1.3 Gb) - tar archive
taxonomy (21 Mb) -  tab delimited file 
genome statistics (23 Mb) - tab delimted file 
md5sums (612 b) - md5sums of the other files

All version 220 files can also be accessed here

Previous releases

The current release of the GlobDB (GlobDB 220) is the first public release of this resource. In the future, links to previous releases will appear here.