Key Functions¶
get_scores.py¶
- gunc.get_scores.get_n_effective_surplus_clades(counts)[source]¶
Calculate Inverse Simpson Index.
Inverse Simpson Index of all clade labels distribution -1 (as 1 genome is expected)
- Parameters:
counts (pandas.Series) – Counts of taxons in a taxlevel.
- Returns:
Score describing the extent of chimerism, i.e. the effective number of surplus clades represented at a taxlevel.
- Return type:
- gunc.get_scores.expected_entropy_estimate(probabilities, sample_count)[source]¶
Compute the expected entropy estimate sampling N elements underlying probabilities p
- Parameters:
probabilities (numpy.ndarray) – probabilities (p) (assumed to sum to 1.0)
sample_count (int) – number of samples (N)
- Returns:
expected entropy
- Return type:
entropy (float)
- gunc.get_scores.calc_expected_conditional_entropy(contigs, taxons)[source]¶
Compute the expected measured conditional entropy under the null hypothesis that there is no relationship between contig membership and taxonomy
When the bucket is large enough, the estimates are expected to be close enough to the global estimate that we will no longer compute the estimate via the more costly expected_entropy_estimate function
- Parameters:
contigs (pandas.Series) – contig names in data
taxons (pandas.Series) – taxons in data
- Returns:
expected measured conditional entropy
- Return type:
- gunc.get_scores.get_abundant_lineages_cutoff(sensitive, genes_mapped)[source]¶
Determine cutoff for abundant lineages.
Removal of all genes coming from clades consisting of <2% of all mapped genes is intended to reduce noise introduced by genes mapping to a wide range of clades due to their poor representation in the reference. In sensitive mode that value is reduced to just 10 genes.
- gunc.get_scores.calc_contamination_portion(counts)[source]¶
Calculate contamination portion
- Parameters:
counts (pandas.Series) – Counts of taxons in a taxlevel.
- Returns:
portion of genes assigning to all clades except the one with most genes.
- Return type:
- gunc.get_scores.calc_mean_hit_identity(identity_scores)[source]¶
Calculate mean hit identity score.
Calculates the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.
- gunc.get_scores.calc_conditional_entropy(contigs, taxons)[source]¶
Compute conditional entropy
- Parameters:
contigs (pandas.Series) – IDs of contigs
taxons (pandas.Series) – IDs of taxonomic clade assignments
- Returns:
measured conditional entropy
- Return type:
- gunc.get_scores.calc_clade_separation_score(contamination_portion, conditional_entropy, expected_conditional_entropy)[source]¶
Get clade separation score (CSS).
CSS = 0, if contamination_portion = 0 or H(T|C) <= H(T|R) CSS = 1 - H(T|C)/H(T|R), else
- Parameters:
contamination_portion (float) – GUNC contamination portion, when equal to 0 means all genes map to the same taxonomic clade.
conditional_entropy (float) – H(T|C), the entropy of taxonomic clade labels given their contig assignment.
expected_conditional_entropy (float) – H(T|R), the expected value of H(T|C) given identical contig size distribution and given there is no relationship between taxonomic clade and contig labels.
- Returns:
GUNC CSS
- Return type:
- gunc.get_scores.determine_adjustment(genes_retained_index)[source]¶
Determine if adjustment is necessary.
Adjustment of GUNC CSS score is done by setting it to 0 when there are <40% of all called genes retained in abundant lineages/clades representing >2% of all mapped genes.
- gunc.get_scores.is_chimeric(clade_separation_score_adjusted)[source]¶
Determine if chimeric.
The cutoff of 0.45 was identified using benchmarks and is used to call a genome chimeric/contaminated if CSS is higher than this cutoff.
- gunc.get_scores.get_scores_for_taxlevel(base_data, tax_level, abundant_lineages_cutoff, genome_name, genes_called, genes_mapped, contig_count, min_mapped_genes)[source]¶
Run chimerism check.
Calculates the various scores needed to determine if genome ic chimeric.
- Parameters:
base_data (pandas.DataFrame) – Diamond output merged with taxonomy table
tax_level (str) – tax level to run
abundant_lineages_cutoff (float) – Cutoff value for abundant lineages
genome_name (str) – Name of input genome
genes_called (int) – Number of genes called by prodigal and used by diamond for mapping to GUNC DB
genes_mapped (int) – Number of genes mapped to GUNC DB by diamond
contig_count (int) – Count of contigs
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores
- Returns:
scores for chosen taxlevel
- Return type:
OrderedDict
- gunc.get_scores.chim_score(diamond_file_path, genes_called=0, sensitive=False, min_mapped_genes=11, use_species_level=False, db='progenomes_2.1', plot=False)[source]¶
Get chimerism scores for a genome.
- Parameters:
diamond_file_path (str) – Full path to diamond output
- Keyword Arguments:
genes_called (int) –
sensitive (bool) – Run in sensitive mode (default: (False))
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores (default: (11)
use_species_level (bool) – Allow species level to be selected for maxCSS (default: (False))
plot (bool) – Return data needed for plotting (default: (False))
db (str) – Which db to use: progenomes or gtdb (default: (progenomes)
- Returns:
GUNC scores
- Return type:
gunc.py¶
- gunc.gunc.merge_genecalls(genecall_files, out_dir, file_suffix)[source]¶
Merge genecall files.
Merges fastas together to run diamond more efficiently. Adds the name of the file to each record (delimiter ‘/’) so they can be separated after diamond mapping.
- gunc.gunc.split_diamond_output(diamond_outfile, out_dir, db)[source]¶
Split diamond output into per-sample files.
Separate diamond output file into the constituent sample files. This uses the identifiers that were added by
merge_genecalls()