Key Functions¶
get_scores.py¶
- gunc.get_scores.get_n_effective_surplus_clades(counts)[source]¶
Calculate Inverse Simpson Index.
Inverse Simpson Index of all clade labels distribution -1 (as 1 genome is expected)
- Parameters
counts (pandas.Series) – Counts of taxons in a taxlevel.
- Returns
Score describing the extent of chimerism, i.e. the effective number of surplus clades represented at a taxlevel.
- Return type
- gunc.get_scores.expected_entropy_estimate(probabilities, sample_count)[source]¶
Compute the expected entropy estimate sampling N elements underlying probabilities p
- Parameters
probabilities (numpy.ndarray) – probabilities (p) (assumed to sum to 1.0)
sample_count (int) – number of samples (N)
- Returns
expected entropy
- Return type
entropy (float)
- gunc.get_scores.calc_expected_conditional_entropy(contigs, taxons)[source]¶
Compute the expected measured conditional entropy under the null hypothesis that there is no relationship between contig membership and taxonomy
When the bucket is large enough, the estimates are expected to be close enough to the global estimate that we will no longer compute the estimate via the more costly expected_entropy_estimate function
- Parameters
contigs (pandas.Series) – contig names in data
taxons (pandas.Series) – taxons in data
- Returns
expected measured conditional entropy
- Return type
- gunc.get_scores.get_abundant_lineages_cutoff(sensitive, genes_mapped)[source]¶
Determine cutoff for abundant lineages.
Removal of all genes coming from clades consisting of <2% of all mapped genes is intended to reduce noise introduced by genes mapping to a wide range of clades due to their poor representation in the reference. In sensitive mode that value is reduced to just 10 genes.
- gunc.get_scores.calc_contamination_portion(counts)[source]¶
Calculate contamination portion
- Parameters
counts (pandas.Series) – Counts of taxons in a taxlevel.
- Returns
portion of genes assigning to all clades except the one with most genes.
- Return type
- gunc.get_scores.calc_mean_hit_identity(identity_scores)[source]¶
Calculate mean hit identity score.
Calculates the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.
- gunc.get_scores.calc_conditional_entropy(contigs, taxons)[source]¶
Compute conditional entropy
- Parameters
contigs (pandas.Series) – IDs of contigs
taxons (pandas.Series) – IDs of taxonomic clade assignments
- Returns
measured conditional entropy
- Return type
- gunc.get_scores.calc_clade_separation_score(contamination_portion, conditional_entropy, expected_conditional_entropy)[source]¶
Get clade separation score (CSS).
CSS = 0, if contamination_portion = 0 or H(T|C) <= H(T|R) CSS = 1 - H(T|C)/H(T|R), else
- Parameters
contamination_portion (float) – GUNC contamination portion, when equal to 0 means all genes map to the same taxonomic clade.
conditional_entropy (float) – H(T|C), the entropy of taxonomic clade labels given their contig assignment.
expected_conditional_entropy (float) – H(T|R), the expected value of H(T|C) given identical contig size distribution and given there is no relationship between taxonomic clade and contig labels.
- Returns
GUNC CSS
- Return type
- gunc.get_scores.determine_adjustment(genes_retained_index)[source]¶
Determine if adjustment is necessary.
Adjustment of GUNC CSS score is done by setting it to 0 when there are <40% of all called genes retained in abundant lineages/clades representing >2% of all mapped genes.
- gunc.get_scores.is_chimeric(clade_separation_score_adjusted)[source]¶
Determine if chimeric.
The cutoff of 0.45 was identified using benchmarks and is used to call a genome chimeric/contaminated if CSS is higher than this cutoff.
- gunc.get_scores.get_scores_for_taxlevel(base_data, tax_level, abundant_lineages_cutoff, genome_name, genes_called, genes_mapped, contig_count, min_mapped_genes)[source]¶
Run chimerism check.
Calculates the various scores needed to determine if genome ic chimeric.
- Parameters
base_data (pandas.DataFrame) – Diamond output merged with taxonomy table
tax_level (str) – tax level to run
abundant_lineages_cutoff (float) – Cutoff value for abundant lineages
genome_name (str) – Name of input genome
genes_called (int) – Number of genes called by prodigal and used by diamond for mapping to GUNC DB
genes_mapped (int) – Number of genes mapped to GUNC DB by diamond
contig_count (int) – Count of contigs
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores
- Returns
scores for chosen taxlevel
- Return type
OrderedDict
- gunc.get_scores.chim_score(diamond_file_path, genes_called=0, sensitive=False, min_mapped_genes=11, use_species_level=False, db='progenomes_2.1', plot=False)[source]¶
Get chimerism scores for a genome.
- Parameters
diamond_file_path (str) – Full path to diamond output
- Keyword Arguments
genes_called (int) –
sensitive (bool) – Run in sensitive mode (default: (False))
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores (default: (11)
use_species_level (bool) – Allow species level to be selected for maxCSS (default: (False))
plot (bool) – Return data needed for plotting (default: (False))
db (str) – Which db to use: progenomes or gtdb (default: (progenomes)
- Returns
GUNC scores
- Return type
gunc.py¶
- gunc.gunc.merge_genecalls(genecall_files, out_dir, file_suffix)[source]¶
Merge genecall files.
Merges fastas together to run diamond more efficiently. Adds the name of the file to each record (delimiter ‘_-_’) so they can be separated after diamond mapping.
- gunc.gunc.split_diamond_output(diamond_outfile, out_dir, db)[source]¶
Split diamond output into per-sample files.
Separate diamond output file into the constituent sample files. This uses the identifiers that were added by
merge_genecalls()