faq | GlobDB

How are the genomes in the GlobDB selected?

The GlobDB is a dereplicated set of microbial genomes based 26 publicly available genome datasets. Please see the Home page for links to the underlying data sources, and the Methods page for details on how the datasets were dereplicated.

Can i freely use the GlobDB genomes?

The GlobDB is licensed under CC BY-SA 4.0, so can be freely shared and used, but attribution is required. See the How to cite page for information on how to cite the GlobDB and underlying resources. Any product built on the GlobDB also needs to be distributed under a CC BY-SA 4.0 license. See bottom of the Home page for more information on the license.

How is it possible that two GlobDB genomes have the same species assignment?

As indicated on the Methods page, the underlying datasets of the GlobDB are already dereplicated at 95 % average nucleotide identity (ANI). We chose to further dereplicate these already individually dereplicated sets at 96 % ANI, based on a crossplot of ANI and aligned fractions (AF). Any non-INSDC dataset genome that has between 95 and 96 % ANI to a GTDB genome will be assigned the same species as that GTDB genome.

Do i need a specific version of anvi'o to use the contigs databases?

Yes. Anvi'o versions its contigs databases, and the versions are not backwards compatible. Therefore, you need the version with which the databases were created, which is currently the version 9. In future versions of the GlobDB, the contigs databases will be migrated to keep current with the latest anvi'o release.

How does the GlobDB deal with selenoproteins?

Unfortunately prodigal, which is used for gene calling as part of creating anvi'o databases (see Methods), does not recognize selenoproteins. Therefore these will be split into two separate amino acid sequences in the GlobDB protein files.