Dereplication of the underlying databases
Each of the four data sources for the GlobDB (GTDB, GEM, SPIRE, SMAG) provides a dereplicated set of genomes, using a 95 % average nucleotide identity (ANI) dereplication criterion. Since the underlying data for each database is distinct, there may be centroids between datasets that belong to the same species by the operational definition of 95 % ANi similarity, but between them represent a cluster of genomes that extends beyond that operational species boundary. Thus it is possible that the genomes that two centroids represent wouldn't all be classified to the same species if all data was processed together.
The GlobDB therefore conservatively dereplicates the underlying (already dereplicated) datasets at 96 % ANI over 50 % aligned fractions (AF). We have chosen these cutoffs after inspection of a crossplot of ANI vs AF values, showing a clear ANI gap between the dereplicated datasets starting at approximately 96% ANI. Furthermore, the order of priority for inclusion in the GlobDB is as follows:
1) all GTDB species representatives are included.
2) GEM species representatives not present in the GTDB are included.
3) SPIRE species representatives not present in either GTDB or GEM are included.
4) SMAG species representatives not present in either of the three datasets are included.
Dereplication of the data sources is done using fastANI, version 1.3.4, run with the 3000bp fragment length default.
As the GTDB updates conservatively, minimizing the amount of genomes that get removed between versions, and the other three data sources are (thus far) static datasets, updates between versions are done by checking the GEM, SPIRE, and SMAG species representative genomes against the new GTDB releases, and dropping any genomes now represented in the GTDB.
For convenience in later use, the names of the genomes in the GEM, SPIRE, and SMAG datasets are standardised to GEMOTU, SPIREOTU and SMAGOTU respectively. In addition, the SPIRE dataset includes genomes from the proGenomes3 resource, identified as SPECIV4.
Dictionary files to relate the GlobDB identifiers back to the identifiers from the original publications are available in the "Downloads" page.
Generation of anvi'o databases and annotation
After dereplication, the resulting genome fasta files are turned into anvi'o databases and basic annotation is performed. This is done using a SLURM script (specific for the life science compute cluster, Vienna) with key steps shown below. The variables ${INPUT_LOCAL} and ${INPUT_BASE} correspond to the genome fasta file and the GlobDB ID respectively. The variables $ARRAYTMPDIR and $SLURM_CPUS_PER_TASK are SLURM variables designating a data directory and the number of threads to be used.
Genome fasta files are processed to standardise contig names, and anvi'o contigs databases with GlobDB ID as database name are created.anvi-script-reformat-fasta --simplify-names --overwrite-input --prefix ${INPUT_BASE} -r ${INPUT_BASE}_report.txt --seq-type NT ${INPUT_LOCAL}
anvi-gen-contigs-database -f ${INPUT_LOCAL} -n ${INPUT_BASE} -T $SLURM_CPUS_PER_TASK -o ${INPUT_BASE}.db
Then basic annotations are run using HMMs integrated with anvi'o. Information on the sources of HMMs that are included in anvi'o can be found here. Addditionally, KEGG, COG, Pfam and CAZy annotations are performed. Please see the links in the file descriptions on the Downloads page for more information on the anvi'o implementation of these annotations. One specific thing to note is that for KEGG annotations, anvi'o employs a heuristic to add (likely) true positive hits that were missed during the first pass, described in more detail here.anvi-run-hmms -T $SLURM_CPUS_PER_TASK --also-scan-trnas -c ${INPUT_BASE}.db --just-do-it
anvi-run-kegg-kofams -T $SLURM_CPUS_PER_TASK -c ${INPUT_LOCAL} --kegg-data-dir $ARRAYTMPDIR/kegg/ --just-do-it
anvi-run-pfams -T $SLURM_CPUS_PER_TASK -c ${INPUT_LOCAL} --pfam-data-dir $ARRAYTMPDIR/pfam/
anvi-run-cazymes -T $SLURM_CPUS_PER_TASK -c ${INPUT_LOCAL} --cazyme-data-dir $ARRAYTMPDIR/cazy/
anvi-run-ncbi-cogs -T $SLURM_CPUS_PER_TASK -c ${INPUT_LOCAL} --cog-data-dir $ARRAYTMPDIR/cog/
Next, protein fasta files and gff files for KEGG, COG and Pfam annotations are exported, and tar archives containing these files are available under Downloads. These files can also be generated from the anni'o databases directly. By default, the GlobDB ID is not included in the protein fasta file header, but is added after export. anvi-get-sequences-for-gene-calls -c ${INPUT_BASE}.db --get-aa-sequences --wrap 0 -o ${INPUT_BASE}.faa
anvi-get-sequences-for-gene-calls --annotation-source COG20_FUNCTION --export-gff3 -c ${INPUT_BASE}.db -o ${INPUT_BASE}_cog.gff
anvi-get-sequences-for-gene-calls --annotation-source KOfam --export-gff3 -c ${INPUT_BASE}.db -o ${INPUT_BASE}_kegg.gff
anvi-get-sequences-for-gene-calls --annotation-source Pfam --export-gff3 -c ${INPUT_BASE}.db -o ${INPUT_BASE}_pfam.gff
Further processing of protein data files
TMHMM version 2.0c is run on the protein fasta files, and the results are processed to a two column tab-delimited file with headers "prot_ID" and "no_TMH".