GUNC Output¶
Output files¶
- Normal Output
GUNC.{DB version}maxCSS_level.tsv - output file with scores for a taxonomic level with the highest CSS score (or the level closest to kingdom if multiple maxima).
- Detailed Output
{PREFIX}.all_levels.tsv - output file with results for each taxonomic level.
Output Columns¶
- genome
name of input genome
- n_genes_called
number of genes called by prodigal or directly provided by the user.
- n_genes_mapped
number of genes mapped by diamond into GUNC refDB.
- n_contigs
number of contigs containing mapped genes.
- taxonomic_level
taxonomic clade labels at this taxonomic level were used to calculate values in all following columns. For each genome, all scores at six levels (species level can be added using a command-line option) are calculated.
- proportion_genes_retained_in_major_clades
only major clades that have >2% of all mapped genes assigned to them are retained to calculate other scores. Value of this column is n_genes_retained/n_genes_mapped.
- genes_retained_index
n_genes_mapped/n_genes_called * proportion_genes_retained_in_major_clades
, i.e. a portion of all called genes retained in major clades.- clade_separation_score
a result of applying a formula explained in GUNC paper to taxonomy and contig labels of genes retained in major clades. Ranges from 0 to 1 and is set to 0 when genes_retained index is <0.4 because that is too few genes left.
- contamination_portion
a portion of genes retained in major clades assigned to all clades except the one clade with the highest proportion of genes assigned to it.
- n_effective_surplus_clades
an Inverse Simpson Index of fractions of all clades - 1 (as 1 genome is expected). It is a score describing the extent of chimerism, i.e. the effective number of surplus clades represented at a taxlevel.
- mean_hit_identity
the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.
- reference_representation_score
genes_retained_index * mean_hit_identity
Estimates how well a genome is represented in the GUNC DB.- pass.GUNC
If a genome passes GUNC analysis it means it is likely to not be chimeric (or that chimerism cannot be detected especially when its reference representation (RRS) is low). A genome passes if clade_separation_score <= 0.45, a cutoff benchmarked using simulated genomes.
Note
Please note that most of genomes having reference_representation_score >0.5 (roughly) are labelled as passing GUNC filters not necessarily because they are non-chimeric but rather because they are so poorly represented in the reference that it is much more difficult to judge.
Note
MIMAG_medium and MIMAG_high filters are incomplete; the MIMAG standard additionally requires data on rRNA and tRNA counts