Key Functions

get_scores.py

gunc.get_scores.get_n_effective_surplus_clades(counts)[source]

Calculate Inverse Simpson Index.

Inverse Simpson Index of all clade labels distribution -1 (as 1 genome is expected)

Parameters:

counts (pandas.Series) – Counts of taxons in a taxlevel.

Returns:

Score describing the extent of chimerism, i.e. the effective number of surplus clades represented at a taxlevel.

Return type:

float

gunc.get_scores.expected_entropy_estimate(probabilities, sample_count)[source]

Compute the expected entropy estimate sampling N elements underlying probabilities p

Parameters:
  • probabilities (numpy.ndarray) – probabilities (p) (assumed to sum to 1.0)

  • sample_count (int) – number of samples (N)

Returns:

expected entropy

Return type:

entropy (float)

gunc.get_scores.calc_expected_conditional_entropy(contigs, taxons)[source]

Compute the expected measured conditional entropy under the null hypothesis that there is no relationship between contig membership and taxonomy

When the bucket is large enough, the estimates are expected to be close enough to the global estimate that we will no longer compute the estimate via the more costly expected_entropy_estimate function

Parameters:
Returns:

expected measured conditional entropy

Return type:

float

gunc.get_scores.get_abundant_lineages_cutoff(sensitive, genes_mapped)[source]

Determine cutoff for abundant lineages.

Removal of all genes coming from clades consisting of <2% of all mapped genes is intended to reduce noise introduced by genes mapping to a wide range of clades due to their poor representation in the reference. In sensitive mode that value is reduced to just 10 genes.

Parameters:
  • sensitive (bool) – sets cutoff to 10 if true

  • genes_mapped (int) – total number of genes mapped by diamond to GUNC DB

Returns:

cutoff used to determine an abundant lineage

Return type:

float

gunc.get_scores.calc_contamination_portion(counts)[source]

Calculate contamination portion

Parameters:

counts (pandas.Series) – Counts of taxons in a taxlevel.

Returns:

portion of genes assigning to all clades except the one with most genes.

Return type:

float

gunc.get_scores.calc_mean_hit_identity(identity_scores)[source]

Calculate mean hit identity score.

Calculates the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.

Parameters:

identity_scores (list) – list of identity scores

Returns:

the mean of identity scores

Return type:

float

gunc.get_scores.calc_conditional_entropy(contigs, taxons)[source]

Compute conditional entropy

Parameters:
Returns:

measured conditional entropy

Return type:

float

gunc.get_scores.calc_clade_separation_score(contamination_portion, conditional_entropy, expected_conditional_entropy)[source]

Get clade separation score (CSS).

CSS = 0, if contamination_portion = 0 or H(T|C) <= H(T|R) CSS = 1 - H(T|C)/H(T|R), else

Parameters:
  • contamination_portion (float) – GUNC contamination portion, when equal to 0 means all genes map to the same taxonomic clade.

  • conditional_entropy (float) – H(T|C), the entropy of taxonomic clade labels given their contig assignment.

  • expected_conditional_entropy (float) – H(T|R), the expected value of H(T|C) given identical contig size distribution and given there is no relationship between taxonomic clade and contig labels.

Returns:

GUNC CSS

Return type:

float

gunc.get_scores.determine_adjustment(genes_retained_index)[source]

Determine if adjustment is necessary.

Adjustment of GUNC CSS score is done by setting it to 0 when there are <40% of all called genes retained in abundant lineages/clades representing >2% of all mapped genes.

Parameters:

genes_retained_index (float) – proportion of all called genes retained in abundant lineages (>2%).

Returns:

1 or 0. If 1, keeps CSS as it was. If 0, sets CSS to zero.

Return type:

int

gunc.get_scores.is_chimeric(clade_separation_score_adjusted)[source]

Determine if chimeric.

The cutoff of 0.45 was identified using benchmarks and is used to call a genome chimeric/contaminated if CSS is higher than this cutoff.

Parameters:

clade_separation_score_adjusted (float) – score

Returns:

is genome chimeric

Return type:

bool

gunc.get_scores.get_scores_for_taxlevel(base_data, tax_level, abundant_lineages_cutoff, genome_name, genes_called, genes_mapped, contig_count, min_mapped_genes)[source]

Run chimerism check.

Calculates the various scores needed to determine if genome ic chimeric.

Parameters:
  • base_data (pandas.DataFrame) – Diamond output merged with taxonomy table

  • tax_level (str) – tax level to run

  • abundant_lineages_cutoff (float) – Cutoff value for abundant lineages

  • genome_name (str) – Name of input genome

  • genes_called (int) – Number of genes called by prodigal and used by diamond for mapping to GUNC DB

  • genes_mapped (int) – Number of genes mapped to GUNC DB by diamond

  • contig_count (int) – Count of contigs

  • min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores

Returns:

scores for chosen taxlevel

Return type:

OrderedDict

gunc.get_scores.chim_score(diamond_file_path, genes_called=0, sensitive=False, min_mapped_genes=11, use_species_level=False, db='progenomes_2.1', plot=False)[source]

Get chimerism scores for a genome.

Parameters:

diamond_file_path (str) – Full path to diamond output

Keyword Arguments:
  • genes_called (int) –

  • sensitive (bool) – Run in sensitive mode (default: (False))

  • min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores (default: (11)

  • use_species_level (bool) – Allow species level to be selected for maxCSS (default: (False))

  • plot (bool) – Return data needed for plotting (default: (False))

  • db (str) – Which db to use: progenomes or gtdb (default: (progenomes)

Returns:

GUNC scores

Return type:

pandas.DataFrame

gunc.py

gunc.gunc.merge_genecalls(genecall_files, out_dir, file_suffix)[source]

Merge genecall files.

Merges fastas together to run diamond more efficiently. Adds the name of the file to each record (delimiter ‘/’) so they can be separated after diamond mapping.

Parameters:
  • genecall_files (list) – Paths of genecall fastas to merge

  • out_dir (str) – Directory to put the merged file

  • file_suffix (str) – Suffix of input files

Returns:

path of the merged file

Return type:

str

gunc.gunc.split_diamond_output(diamond_outfile, out_dir, db)[source]

Split diamond output into per-sample files.

Separate diamond output file into the constituent sample files. This uses the identifiers that were added by merge_genecalls()

Parameters:
  • diamond_outfile (str) – path to the diamond file to be split

  • out_dir (str) – Directory in which to put the split files.

  • db (str) – Which db is used: progenomes or gtdb

Returns:

Of the split file paths

Return type:

list

checkm_merge.py

gunc.checkm_merge.merge_checkm_gunc(checkm_file, gunc_file)[source]

Merge checkM and gunc outputs.

Parameters:
  • checkm_file (str) – Path of qa.tsv file form checkm

  • gunc_file (str) – Path of gunc_scores.tsv file

Returns:

Merged scores

Return type:

pandas.DataFrame

external_tools.py

gunc.external_tools.prodigal(input_file, out_file)[source]

Run prodigal

Parameters:
  • input_file (str) – full path of input fasta

  • out_file (str) – fullpath of output file

gunc.external_tools.diamond(input_file, threads, temp_dir, database_file, out_file)[source]

Run diamond.

Parameters:
  • input_file (str) – full path of gene calls

  • threads (int) – number of threads to use

  • temp_dir (str) – path of directory to use for tmp files

  • database_file (str) – full path fof diamond db file

  • out_file (str) – full path of output file

gunc.external_tools.get_record_count_in_fasta(fasta_file)[source]

Count number of records in a fasta file.

Parameters:

fasta_file (str) – full path of fasta file

Returns:

count of records

Return type:

str