GlobDB release 226

Downloads

All release 226 files described below can be accessed here

 

File descriptions (release 226)

globdb_r226_anvio_dbs.tar.gz
Anvi'o contigs databases for all GlobDB genomes, generated using the program anvi-gen-contigs-database from the fasta files provided in globdb_r226_genome_fasta.tar.gz. These anvi'o databases include the gene calling and annotation information of the genomes, and can be used to generate the protein fasta and gff files. Furthermore, the anvi'o databases can be used as a resource for phylogenomics, pangenomics, or other comparative analyses. Note that the anvi'o contigs databases are created with the development version of anvi'o and therefore can only be used with that version.

globdb_r226_dictionaries.tar.gz
Dictionaries of the genome identifiers in source datasets and the corresponding standardized GlobDB IDs. No dictionary is needed for GTDB, MGnify, and MRGM/HRGM2 as the genome identifiers in these datasets were already sufficiently standardized.

globdb_r226_genome_fasta.tar.gz
FastA files for all genomes in the GlobDB. Fasta headers have been renamed using the anvi'o script anvi-script-reformat-fasta to simplify the contig header lines and include the GlobDB genome ID at the start of every contig.

globdb_r226_gff_cog.tar.gz
Anvi'o exported files in GFF format for all GlobDB genomes with COG annotation as done using the anvi'o program anvi-run-ncbi-cogs. See the methods page for more details.

globdb_r226_md5sum.txt
md5sums of the downloadable files in the top level directory.

globdb_r226_protein_annotations.tar.gz
Tab delimited files of KEGG/COG/Pfam/dbCAN2 annotations of all GlobDB genomes.

globdb_r226_protein_fasta.tar.gz
FastA files for all proteins in the GlobDB genomes annotated using prodigal as integrated in anvi'o, exported using the program anvi-get-sequences-for-gene-calls. Note that the headers of all proteins were renamed after export to include the GlobDB genome ID, and to be consistent with the IDs in the exported gff files.

globdb_r226_taxonomy.tsv.gz
Two column Tab delimited file containing GlobDB genome ID as well as a full 7 level taxonomy. Taxonomy assignment of non-GTDB genomes is described on the methods page.

globdb_r226_tax_plus_stats.tsv.gz
Tab delimted file containing taxonomy, basic genome statistics, and contamination/completeness for all GlobDB genomes.

pLM_embeddings
Directory containing the files associated with the ProtT5 protein language model embeddings of 82,972,511 GlobDB protein cluster representatives

pLM_embeddings/globdb_r226_linclust_2plus_annotations.tsv.gz
Subset of the KEGG/COG/Pfam/dbCAN2 annotations, including only annotations for the protein cluster representatives of cluster size >= 2 in a single file

pLM_embeddings/globdb_r226_linclust_2plus.faa.gz
Protein fasta file including only the 82,972,511 protein cluster representatives of cluster size >= 2

pLM_embeddings/globdb_r226_linclust_2plus_IDs.txt.gz
GlobDB IDs of the protein cluster representatives of cluster size >= 2. Can be used to efficiently access the pLM embeddings of specific GlobDB ids

pLM_embeddings/globdb_r226_linclust_2plus_ProtT5_embeddings.h5
pLM embeddings of 82,972,511 GlobDB protein cluster representatives

pLM_embeddings/globdb_r226_linclust_cluster_members.tsv.gz
Tab delimited file listing the GlobDB protein IDs of the proteins included in each cluster

pLM_embeddings/globdb_r226_linclust_md5sums.txt
md5sums of the downloadable files in the pLM_embeddings directory

taxonomic_profiling
Directory containing the sylph database and SingleM metapackage that can be used to generate taxonomic profiles of metagenomes using the GlobDB taxonomy

taxonomic_profiling/globdb_r226_SingleM_metapackage.tar.gz
SingleM metapackage for taxonomic profiling of metagenome datatsets using SingleM and the GlobDB taxonomy

taxonomic_profiling/globdb_r226_sylph.syldb
Sylph database or taxonomic profiling of metagenome datatsets using Sylph and the GlobDB taxonomy. Requires the globdb_r226_taxonomy.tsv.gz file

taxonomic_profiling/globdb_r226_tax_profile_md5sums.txt
md5sums of the downloadable files in the taxonomic_profiling directory

 

Previous releases

The current release of the GlobDB (GlobDB 226) is the second public release of this resource. Files from previous releases can be accessed below:


Release 220

All release 220 files can be accessed here