Calculate Inverse Simpson Index.
Inverse Simpson Index of all clade labels distribution -1 (as 1 genome is expected)
- gunc.get_scores.expected_entropy_estimate(probabilities, sample_count)¶
Compute the expected entropy estimate sampling N elements underlying probabilities p
- gunc.get_scores.calc_expected_conditional_entropy(contigs, taxons)¶
Compute the expected measured conditional entropy under the null hypothesis that there is no relationship between contig membership and taxonomy
When the bucket is large enough, the estimates are expected to be close enough to the global estimate that we will no longer compute the estimate via the more costly expected_entropy_estimate function
- gunc.get_scores.get_abundant_lineages_cutoff(sensitive, genes_mapped)¶
Determine cutoff for abundant lineages.
Removal of all genes coming from clades consisting of <2% of all mapped genes is intended to reduce noise introduced by genes mapping to a wide range of clades due to their poor representation in the reference. In sensitive mode that value is reduced to just 10 genes.
Calculate contamination portion
Calculate mean hit identity score.
Calculates the mean identity with which genes in abundant lineages (>2%) hit genes in the reference.
- gunc.get_scores.calc_conditional_entropy(contigs, taxons)¶
Compute conditional entropy
- gunc.get_scores.calc_clade_separation_score(contamination_portion, conditional_entropy, expected_conditional_entropy)¶
Get clade separation score (CSS).
CSS = 0, if contamination_portion = 0 or H(T|C) <= H(T|R) CSS = 1 - H(T|C)/H(T|R), else
contamination_portion (float) – GUNC contamination portion, when equal to 0 means all genes map to the same taxonomic clade.
conditional_entropy (float) – H(T|C), the entropy of taxonomic clade labels given their contig assignment.
expected_conditional_entropy (float) – H(T|R), the expected value of H(T|C) given identical contig size distribution and given there is no relationship between taxonomic clade and contig labels.
- Return type
Determine if adjustment is necessary.
Adjustment of GUNC CSS score is done by setting it to 0 when there are <40% of all called genes retained in abundant lineages/clades representing >2% of all mapped genes.
Determine if chimeric.
The cutoff of 0.45 was identified using benchmarks and is used to call a genome chimeric/contaminated if CSS is higher than this cutoff.
- gunc.get_scores.get_scores_for_taxlevel(base_data, tax_level, abundant_lineages_cutoff, genome_name, genes_called, genes_mapped, contig_count, min_mapped_genes)¶
Run chimerism check.
Calculates the various scores needed to determine if genome ic chimeric.
base_data (pandas.DataFrame) – Diamond output merged with taxonomy table
tax_level (str) – tax level to run
abundant_lineages_cutoff (float) – Cutoff value for abundant lineages
genome_name (str) – Name of input genome
genes_called (int) – Number of genes called by prodigal and used by diamond for mapping to GUNC DB
genes_mapped (int) – Number of genes mapped to GUNC DB by diamond
contig_count (int) – Count of contigs
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores
scores for chosen taxlevel
- Return type
- gunc.get_scores.chim_score(diamond_file_path, genes_called=0, sensitive=False, min_mapped_genes=11, use_species_level=False, db='progenomes_2.1', plot=False)¶
Get chimerism scores for a genome.
diamond_file_path (str) – Full path to diamond output
- Keyword Arguments
genes_called (int) –
sensitive (bool) – Run in sensitive mode (default: (False))
min_mapped_genes (int) – Minimum number of mapped genes at which to calculate scores (default: (11)
use_species_level (bool) – Allow species level to be selected for maxCSS (default: (False))
plot (bool) – Return data needed for plotting (default: (False))
db (str) – Which db to use: progenomes or gtdb (default: (progenomes)
- Return type
- gunc.gunc.merge_genecalls(genecall_files, out_dir, file_suffix)¶
Merge genecall files.
Merges fastas together to run diamond more efficiently. Adds the name of the file to each record (delimiter ‘_-_’) so they can be separated after diamond mapping.
- gunc.gunc.split_diamond_output(diamond_outfile, out_dir, db)¶
Split diamond output into per-sample files.
Separate diamond output file into the constituent sample files. This uses the identifiers that were added by
- gunc.external_tools.prodigal(input_file, out_file)¶
- gunc.external_tools.diamond(input_file, threads, temp_dir, database_file, out_file)¶