Key Functions

get_scores.py

gunc.get_scores.get_n_effective_surplus_clades(counts)[source]

Calculate Inverse Simpson Index.

Inverse Simpson Index of all clade labels distribution -1 (as 1 genome is expected)

Parameters

counts (pandas.Series) – Counts of taxons in a taxlevel.

Returns

Score describing the extent of chimerism, i.e. the effective number of surplus clades represented at a taxlevel.

Return type

float

gunc.get_scores.expected_entropy_estimate(probabilities, sample_count)[source]

Compute the expected entropy estimate sampling N elements underlying probabilities p

Parameters
  • probabilities (numpy.ndarray) – probabilities (p) (assumed to sum to 1.0)

  • sample_count (int) – number of samples (N)

Returns

expected entropy

Return type

entropy (float)

gunc.get_scores.calc_expected_conditional_entropy(contigs, taxons)[source]

Compute the expected measured conditional entropy under the null hypothesis that there is no relationship between contig membership and taxonomy

When the bucket is large enough, the estimates are expected to be close enough to the global estimate that we will no longer compute the estimate via the more costly expected_entropy_estimate function

Parameters
Returns

expected measured conditional entropy

Return type

float

gunc.get_scores.get_abundant_lineages_cutoff(sensitive, genes_mapped)[source]

Determine cutoff for abundant lineages.

Removal of all genes coming from clades consisting of <2% of all mapped genes is intended to reduce noise introduced by genes mapping to a wide range of clades due to their poor representation in the reference. In sensitive mode that value is reduced to just 10 genes.

Parameters
  • sensitive (bool) – sets cutoff to 10 if true

  • genes_mapped (int) – total number of genes mapped by diamond to GUNC DB

Returns

cutoff used to determine an abundant lineage

Return type

float

gunc.get_scores.calc_contamination_portion(counts)[source]

Calculate contamination portion

Parameters

counts (pandas.Series) – Counts of taxons in a taxlevel.

Returns

portion of genes assigning to all clades except the one with most genes.

Return type

float

gunc.get_scores.calc_mean_hit_identity(identity_scores)[source]

Calculate mean hit identity score.

Calculates the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.

Parameters

identity_scores (list) – list of identity scores

Returns

the mean of identity scores

Return type

float

gunc.get_scores.calc_conditional_entropy(contigs, taxons)[source]

Compute conditional entropy

Parameters
Returns

measured conditional entropy

Return type

float

gunc.get_scores.calc_clade_separation_score(contamination_portion, conditional_entropy, expected_conditional_entropy)[source]

Get clade separation score (CSS).

CSS = 0, if contamination_portion = 0 or H(T|C) <= H(T|R) CSS = 1 - H(T|C)/H(T|R), else

Parameters
  • contamination_portion (float) – GUNC contamination portion, when equal to 0 means all genes map to the same taxonomic clade.

  • conditional_entropy (float) – H(T|C), the entropy of taxonomic clade labels given their contig assignment.

  • expected_conditional_entropy (float) – H(T|R), the expected value of H(T|C) given identical contig size distribution and given there is no relationship between taxonomic clade and contig labels.

Returns

GUNC CSS

Return type

float

gunc.get_scores.determine_adjustment(genes_retained_index)[source]

Determine if adjustment is necessary.

Adjustment of GUNC CSS score is done by setting it to 0 when there are <40% of all called genes retained in abundant lineages/clades representing >2% of all mapped genes.

Parameters

genes_retained_index (float) – proportion of all called genes retained in abundant lineages (>2%).

Returns

1 or 0. If 1, keeps CSS as it was. If 0, sets CSS to zero.

Return type

int

gunc.get_scores.is_chimeric(clade_separation_score_adjusted)[source]

Determine if chimeric.

The cutoff of 0.45 was identified using benchmarks and is used to call a genome chimeric/contaminated if CSS is higher than this cutoff.

Parameters

clade_separation_score_adjusted (float) – score

Returns

is genome chimeric

Return type

bool

gunc.get_scores.get_scores_for_taxlevel(base_data, tax_level, abundant_lineages_cutoff, genome_name, genes_called, genes_mapped, contig_count, min_mapped_genes)[source]

Run chimerism check.

Calculates the various scores needed to determine if genome ic chimeric.

Parameters
  • base_data (pandas.DataFrame) – Diamond output merged with taxonomy table

  • tax_level (str) – tax level to run

  • abundant_lineages_cutoff (float) – Cutoff value for abundant lineages

  • genome_name (str) – Name of input genome

  • genes_called (int) – Number of genes called by prodigal and used by diamond for mapping to GUNC DB

  • genes_mapped (int) – Number of genes mapped to GUNC DB by diamond

  • contig_count (int) – Count of contigs

  • min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores

Returns

scores for chosen taxlevel

Return type

OrderedDict

gunc.get_scores.chim_score(diamond_file_path, genes_called=0, sensitive=False, min_mapped_genes=11, use_species_level=False, db='progenomes_2.1', plot=False)[source]

Get chimerism scores for a genome.

Parameters

diamond_file_path (str) – Full path to diamond output

Keyword Arguments
  • genes_called (int) –

  • sensitive (bool) – Run in sensitive mode (default: (False))

  • min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores (default: (11)

  • use_species_level (bool) – Allow species level to be selected for maxCSS (default: (False))

  • plot (bool) – Return data needed for plotting (default: (False))

  • db (str) – Which db to use: progenomes or gtdb (default: (progenomes)

Returns

GUNC scores

Return type

pandas.DataFrame

gunc.py

gunc.gunc.merge_genecalls(genecall_files, out_dir, file_suffix)[source]

Merge genecall files.

Merges fastas together to run diamond more efficiently. Adds the name of the file to each record (delimiter ‘_-_’) so they can be separated after diamond mapping.

Parameters
  • genecall_files (list) – Paths of genecall fastas to merge

  • out_dir (str) – Directory to put the merged file

  • file_suffix (str) – Suffix of input files

Returns

path of the merged file

Return type

str

gunc.gunc.split_diamond_output(diamond_outfile, out_dir, db)[source]

Split diamond output into per-sample files.

Separate diamond output file into the constituent sample files. This uses the identifiers that were added by merge_genecalls()

Parameters
  • diamond_outfile (str) – path to the diamond file to be split

  • out_dir (str) – Directory in which to put the split files.

  • db (str) – Which db is used: progenomes or gtdb

Returns

Of the split file paths

Return type

list

checkm_merge.py

gunc.checkm_merge.merge_checkm_gunc(checkm_file, gunc_file)[source]

Merge checkM and gunc outputs.

Parameters
  • checkm_file (str) – Path of qa.tsv file form checkm

  • gunc_file (str) – Path of gunc_scores.tsv file

Returns

Merged scores

Return type

pandas.DataFrame

external_tools.py

gunc.external_tools.prodigal(input_file, out_file)[source]

Run prodigal

Parameters
  • input_file (str) – full path of input fasta

  • out_file (str) – fullpath of output file

gunc.external_tools.diamond(input_file, threads, temp_dir, database_file, out_file)[source]

Run diamond.

Parameters
  • input_file (str) – full path of gene calls

  • threads (int) – number of threads to use

  • temp_dir (str) – path of directory to use for tmp files

  • database_file (str) – full path fof diamond db file

  • out_file (str) – full path of output file

gunc.external_tools.get_record_count_in_fasta(fasta_file)[source]

Count number of records in a fasta file.

Parameters

fasta_file (str) – full path of fasta file

Returns

count of records

Return type

str