Analysis Module API Reference¶

Complete reference for biodbs.analysis module.

Summary¶

Classes¶

Class	Description
`ORAResult`	Container for over-representation analysis results
`ORATermResult`	Single term result from ORA
`Pathway`	Represents a biological pathway with gene sets

Enums¶

Enum	Description
`Species`	Supported species for ORA (human, mouse, rat, etc.)
`GOAspect`	Gene Ontology aspects (BP, MF, CC)
`CorrectionMethod`	Multiple testing correction methods (FDR, Bonferroni)
`TranslationDatabase`	Databases for automatic ID translation
`PathwayDatabase`	Pathway database sources (KEGG, GO, Reactome)

Core ORA Functions¶

Function	Description
`ora`	Generic ORA against any pathway database
`ora_kegg`	ORA against KEGG pathways
`ora_go`	ORA against Gene Ontology terms
`ora_reactome`	ORA against Reactome pathways (via API)
`ora_reactome_local`	ORA against Reactome pathways (local calculation)
`ora_enrichr`	ORA via EnrichR web service

Utility Functions¶

Function	Description
`hypergeometric_test`	Compute hypergeometric p-value
`multiple_test_correction`	Apply multiple testing correction

GMT Functions¶

Function	Description
`load_gmt`	Load a GMT file into a dict of Pathway objects
`save_gmt`	Write Pathway objects to a GMT file
`fetch_gmt`	Fetch gene sets from KEGG or EnrichR as GMT

Enums¶

Species¶

Supported species for ORA. Each member contains: (taxon_id, common_name, kegg_code, scientific_name).

Member	Taxon ID	KEGG Code	Scientific Name
`HUMAN`	9606	`hsa`	Homo sapiens
`MOUSE`	10090	`mmu`	Mus musculus
`RAT`	10116	`rno`	Rattus norvegicus
`ZEBRAFISH`	7955	`dre`	Danio rerio
`FLY`	7227	`dme`	Drosophila melanogaster
`WORM`	6239	`cel`	Caenorhabditis elegans
`YEAST`	559292	`sce`	Saccharomyces cerevisiae

Species ¶

Species(
    taxon_id: int,
    common_name: str,
    kegg_code: str,
    scientific_name: str,
)

Bases: Enum

Species with their NCBI taxon IDs and common names.

Each member carries four pieces of metadata so that any naming convention (common name, KEGG code, taxon ID, scientific name) can be used interchangeably everywhere in biodbs.

Attributes:

Name	Type	Description
`taxon_id`		NCBI taxonomy ID.
`common_name`		Lower-case common name (e.g. `"human"`).
`kegg_code`		Three-letter KEGG organism code (e.g. `"hsa"`).
`scientific_name`		Binomial scientific name (e.g. `"Homo sapiens"`).

Examples:

>>> from biodbs import Species
>>> translate_gene_ids(["TP53"], from_type=GeneIDType.GENE_SYMBOL,
...                    to_type=GeneIDType.ENSEMBL_GENE_ID,
...                    species=Species.HUMAN)
>>> ora_kegg(genes, species=Species.MOUSE)
>>> ora_go(genes, species=Species.HUMAN)

Source code in biodbs/_funcs/_species.py

def __init__(
    self,
    taxon_id: int,
    common_name: str,
    kegg_code: str,
    scientific_name: str,
):
    self.taxon_id = taxon_id
    self.common_name = common_name
    self.kegg_code = kegg_code
    self.scientific_name = scientific_name

from_taxon_id `classmethod` ¶

from_taxon_id(taxon_id: int) -> Species

Look up a Species by its NCBI taxonomy ID.

Raises:

Type	Description
`ValueError`	If taxon_id is not in the supported set.

Source code in biodbs/_funcs/_species.py

@classmethod
def from_taxon_id(cls, taxon_id: int) -> "Species":
    """Look up a Species by its NCBI taxonomy ID.

    Raises:
        ValueError: If *taxon_id* is not in the supported set.
    """
    for sp in cls:
        if sp.taxon_id == taxon_id:
            return sp
    raise ValueError(
        f"Unknown taxon ID: {taxon_id}. "
        f"Supported: {', '.join(f'{s.name}={s.taxon_id}' for s in cls)}"
    )

from_kegg_code `classmethod` ¶

from_kegg_code(kegg_code: str) -> Species

Look up a Species by its KEGG three-letter organism code.

Raises:

Type	Description
`ValueError`	If kegg_code is not recognised.

Source code in biodbs/_funcs/_species.py

@classmethod
def from_kegg_code(cls, kegg_code: str) -> "Species":
    """Look up a Species by its KEGG three-letter organism code.

    Raises:
        ValueError: If *kegg_code* is not recognised.
    """
    for sp in cls:
        if sp.kegg_code == kegg_code:
            return sp
    raise ValueError(
        f"Unknown KEGG code: {kegg_code!r}. "
        f"Supported: {', '.join(s.kegg_code for s in cls)}"
    )

from_name `classmethod` ¶

from_name(name: str) -> Species

Look up a Species from any of its names.

Accepts the common name ("human"), scientific name ("Homo sapiens"), KEGG code ("hsa"), or the enum member name ("HUMAN"), all case-insensitive.

Raises:

Type	Description
`ValueError`	If name does not match any known species.

Source code in biodbs/_funcs/_species.py

@classmethod
def from_name(cls, name: str) -> "Species":
    """Look up a Species from any of its names.

    Accepts the common name (``"human"``), scientific name
    (``"Homo sapiens"``), KEGG code (``"hsa"``), or the enum
    member name (``"HUMAN"``), all case-insensitive.

    Raises:
        ValueError: If *name* does not match any known species.
    """
    name_lower = name.lower().strip()
    for sp in cls:
        if name_lower in (
            sp.common_name.lower(),
            sp.scientific_name.lower(),
            sp.kegg_code.lower(),
            sp.name.lower(),
        ):
            return sp
    raise ValueError(
        f"Unknown species: {name!r}. "
        f"Supported: {', '.join(s.common_name for s in cls)}"
    )

GOAspect¶

Gene Ontology aspects for filtering GO terms.

Member	Value	Description
`BIOLOGICAL_PROCESS`	`"biological_process"`	BP - Biological processes
`MOLECULAR_FUNCTION`	`"molecular_function"`	MF - Molecular functions
`CELLULAR_COMPONENT`	`"cellular_component"`	CC - Cellular components
`ALL`	`"all"`	All GO aspects

GOAspect ¶

Bases: str, Enum

Gene Ontology aspects.

CorrectionMethod¶

Multiple testing correction methods.

Member	Value	Description
`BONFERRONI`	`"bonferroni"`	Bonferroni correction (conservative)
`BH`	`"benjamini_hochberg"`	Benjamini-Hochberg FDR (recommended)
`BY`	`"benjamini_yekutieli"`	Benjamini-Yekutieli FDR
`HOLM`	`"holm"`	Holm-Bonferroni method
`NONE`	`"none"`	No correction

CorrectionMethod ¶

Bases: str, Enum

Multiple testing correction methods.

TranslationDatabase¶

Databases for automatic ID translation.

Member	Value	Description
`NCBI`	`"ncbi"`	NCBI Datasets API — default; best for symbol ↔ Entrez ↔ Ensembl
`ENSEMBL`	`"ensembl"`	Ensembl REST xrefs — natural choice for Ensembl IDs
`UNIPROT`	`"uniprot"`	UniProt ID mapping — best for protein-centric translations
`BIOMART`	`"biomart"`	BioMart — widest ID type range, but less reliable
`HGNC`	`"hgnc"`	HGNC REST API — authoritative for human nomenclature (human only)

TranslationDatabase ¶

Bases: str, Enum

Databases available for gene ID translation.

Use these as the database parameter in :func:translate_gene_ids and the translation_database parameter in ORA functions.

Members

NCBI: NCBI Datasets API. Most stable; best for symbol ↔ Entrez ↔ Ensembl translations. Default for translate_gene_ids. ENSEMBL: Ensembl REST API (xrefs endpoint). More stable than BioMart; natural choice when working with Ensembl IDs. UNIPROT: UniProt ID-mapping API. Best for protein-centric translations (UniProt accession, PDB, RefSeq protein). BIOMART: BioMart / Ensembl query interface. Supports the widest range of ID types but is less reliable than the other options. HGNC: HGNC REST API. Authoritative for human gene nomenclature; best for translations involving HGNC IDs, approved symbols, aliases, and previous symbols. Human only.

Examples:

>>> from biodbs.translate import TranslationDatabase, translate_gene_ids
>>> translate_gene_ids(["TP53"], from_type="gene_symbol",
...                    to_type="ensembl_gene_id",
...                    database=TranslationDatabase.NCBI)
>>> # Raw strings still work for backwards compatibility
>>> translate_gene_ids(["TP53"], from_type="gene_symbol",
...                    to_type="ensembl_gene_id",
...                    database="ncbi")

PathwayDatabase¶

Pathway database sources.

Member	Value	Description
`KEGG`	`"kegg"`	KEGG pathways
`GO`	`"go"`	Gene Ontology terms
`ENRICHR`	`"enrichr"`	EnrichR libraries
`REACTOME`	`"reactome"`	Reactome pathways

PathwayDatabase ¶

Bases: str, Enum

Supported pathway databases.

Result Classes¶

ORAResult¶

ORAResult `dataclass` ¶

ORAResult(
    results: List[ORATermResult],
    query_genes: List[str],
    mapped_genes: List[str],
    unmapped_genes: List[str],
    background_size: int,
    database: str,
    parameters: Dict[str, Any] = dict(),
)

Result container for over-representation analysis.

significant_terms ¶

significant_terms(
    p_threshold: float = 0.05, use_adjusted: bool = True
) -> "ORAResult"

Filter to only significant terms.

top_terms ¶

top_terms(n: int = 10) -> 'ORAResult'

Get top N terms by adjusted p-value.

as_dataframe ¶

as_dataframe(
    engine: Literal["pandas", "polars"] = "pandas",
) -> "pd.DataFrame"

Convert results to a DataFrame.

summary ¶

summary() -> str

Get a text summary of the results.

ORATermResult¶

ORATermResult `dataclass` ¶

ORATermResult(
    term_id: str,
    term_name: str,
    p_value: float,
    adjusted_p_value: float,
    overlap_count: int,
    term_size: int,
    query_size: int,
    background_size: int,
    fold_enrichment: float,
    overlap_genes: List[str],
    database: str,
)

Result for a single term/pathway in ORA.

odds_ratio `property` ¶

odds_ratio: float

Calculate odds ratio for enrichment.

to_dict ¶

to_dict() -> Dict[str, Any]

Convert to dictionary.

Pathway¶

Pathway `dataclass` ¶

Pathway(
    id: str,
    name: str,
    genes: FrozenSet[str],
    database: str,
    species: Optional[str] = None,
    url: Optional[str] = None,
)

A biological pathway or gene set.

Attributes:

Name	Type	Description
`id`	`str`	Unique pathway identifier (e.g., "hsa04110", "R-HSA-69278", "GO:0006915")
`name`	`str`	Human-readable pathway name
`genes`	`FrozenSet[str]`	Set of gene identifiers in this pathway
`database`	`str`	Source database (KEGG, Reactome, GO, etc.)
`species`	`Optional[str]`	Species this pathway belongs to (optional)
`url`	`Optional[str]`	URL to pathway page (optional)

overlap ¶

overlap(gene_list: Set[str]) -> Set[str]

Get genes that overlap with a query gene list.

to_tuple ¶

to_tuple() -> Tuple[str, Set[str]]

Convert to legacy tuple format (name, genes).

from_tuple `classmethod` ¶

from_tuple(
    pathway_id: str,
    data: Tuple[str, Set[str]],
    database: str,
) -> "Pathway"

Create Pathway from legacy tuple format.

Core ORA Functions¶

ora¶

ora ¶

ora(
    genes: List[str],
    gene_sets: Union[
        Dict[str, Tuple[str, Set[str]]],
        Dict[str, Pathway],
        str,
        Path,
    ],
    background: Optional[Set[str]] = None,
    min_overlap: int = 3,
    correction_method: Union[str, CorrectionMethod] = BH,
    database_name: str = "custom",
) -> ORAResult

Perform over-representation analysis with custom gene sets.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of query genes.	required
`gene_sets`	`Union[Dict[str, Tuple[str, Set[str]]], Dict[str, Pathway], str, Path]`	Gene sets to test — one of: `Dict[str, Pathway]` — as returned by :func:`fetch_gmt`. `Dict[str, Tuple[str, Set[str]]]` — legacy tuple format. `str` or :class:`~pathlib.Path` — path to a `.gmt` file, which will be loaded automatically via :func:`load_gmt`.	required
`background`	`Optional[Set[str]]`	Background gene set (universe). If None, uses union of all genes.	`None`
`min_overlap`	`int`	Minimum overlap required to test a gene set.	`3`
`correction_method`	`Union[str, CorrectionMethod]`	Multiple testing correction method.	`BH`
`database_name`	`str`	Name of the database for result annotation.	`'custom'`

Returns:

Type	Description
`ORAResult`	ORAResult with enrichment results.

ora_kegg¶

ora_kegg ¶

ora_kegg(
    genes: List[str],
    species: Union[Species, str, int] = HUMAN,
    from_id_type: str = "entrez",
    background: Optional[Set[str]] = None,
    min_overlap: int = 3,
    correction_method: Union[str, CorrectionMethod] = BH,
    translation_database: Union[
        str, TranslationDatabase
    ] = BIOMART,
    use_cache: bool = True,
    cache_dir: Optional[str] = None,
    organism: Optional[str] = None,
) -> ORAResult

Perform KEGG pathway over-representation analysis.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of query genes.	required
`species`	`Union[Species, str, int]`	Species to analyse. Accepts a :class:`Species` member, a common name (`"human"`), a KEGG code (`"hsa"`), a scientific name, or an NCBI taxon ID (`9606`). Defaults to :attr:`Species.HUMAN`.	`HUMAN`
`from_id_type`	`str`	Input gene ID type. Automatically translates to Entrez IDs. Supported: "entrez", "symbol", "ensembl", "uniprot", "kegg"	`'entrez'`
`background`	`Optional[Set[str]]`	Background gene set. If None, uses all genes in KEGG.	`None`
`min_overlap`	`int`	Minimum overlap required to test a pathway.	`3`
`correction_method`	`Union[str, CorrectionMethod]`	Multiple testing correction method.	`BH`
`translation_database`	`Union[str, TranslationDatabase]`	Database for ID translation ("biomart", "uniprot", "ncbi").	`BIOMART`
`use_cache`	`bool`	Whether to use cached pathway data.	`True`
`cache_dir`	`Optional[str]`	Directory for cache files.	`None`
`organism`	`Optional[str]`	Deprecated — pass `species` instead. A raw KEGG organism code (e.g. `"hsa"`) still works via this argument for backwards compatibility but will be removed in a future version.	`None`

Returns:

Type	Description
`ORAResult`	ORAResult with KEGG pathway enrichment results.

Example

from biodbs import Species

genes = ["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"]
# Preferred — use Species enum
result = ora_kegg(genes, species=Species.HUMAN, from_id_type="symbol")
# Also accepted — KEGG code, common name, or taxon ID
result = ora_kegg(genes, species="hsa",   from_id_type="symbol")
result = ora_kegg(genes, species="human", from_id_type="symbol")
result = ora_kegg(genes, species=9606,    from_id_type="symbol")
print(result.summary())

ora_go¶

ora_go ¶

ora_go(
    genes: List[str],
    species: Union[Species, str, int] = HUMAN,
    from_id_type: str = "uniprot",
    aspect: Union[str, GOAspect] = BIOLOGICAL_PROCESS,
    evidence_codes: Optional[List[str]] = None,
    background: Optional[Set[str]] = None,
    min_overlap: int = 3,
    min_term_size: int = 5,
    max_term_size: int = 500,
    correction_method: Union[str, CorrectionMethod] = BH,
    translation_database: Union[
        str, TranslationDatabase
    ] = BIOMART,
    use_cache: bool = True,
    cache_dir: Optional[str] = None,
) -> ORAResult

Perform Gene Ontology over-representation analysis using QuickGO.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of query genes.	required
`species`	`Union[Species, str, int]`	Species to analyse. Accepts a :class:`Species` member, a common name (`"human"`), a KEGG code (`"hsa"`), a scientific name, or an NCBI taxon ID (`9606`). Defaults to :attr:`Species.HUMAN`.	`HUMAN`
`from_id_type`	`str`	Input gene ID type. Automatically translates to UniProt IDs. Supported: "uniprot", "symbol", "ensembl", "entrez"	`'uniprot'`
`aspect`	`Union[str, GOAspect]`	GO aspect to analyze.	`BIOLOGICAL_PROCESS`
`evidence_codes`	`Optional[List[str]]`	Evidence codes to include. Default excludes IEA.	`None`
`background`	`Optional[Set[str]]`	Background gene set. If None, uses all genes in GO.	`None`
`min_overlap`	`int`	Minimum overlap required.	`3`
`min_term_size`	`int`	Minimum genes per GO term.	`5`
`max_term_size`	`int`	Maximum genes per GO term.	`500`
`correction_method`	`Union[str, CorrectionMethod]`	Multiple testing correction method.	`BH`
`translation_database`	`Union[str, TranslationDatabase]`	Database for ID translation.	`BIOMART`
`use_cache`	`bool`	Whether to use cached GO data.	`True`
`cache_dir`	`Optional[str]`	Directory for cache files.	`None`

Returns:

Type	Description
`ORAResult`	ORAResult with GO term enrichment results.

Raises:

Type	Description
`ValueError`	If the species value is not recognised.

Example

from biodbs import Species

genes = ["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"]
# Preferred — use Species enum
result = ora_go(genes, species=Species.HUMAN, from_id_type="symbol")
# Also accepted — taxon ID, common name, or KEGG code
result = ora_go(genes, species=9606,    from_id_type="symbol")
result = ora_go(genes, species="human", from_id_type="symbol")
result = ora_go(genes, species="hsa",   from_id_type="symbol")
print(result.significant_terms().as_dataframe().head())

ora_reactome¶

ora_reactome ¶

ora_reactome(
    genes: List[str],
    species: str = "Homo sapiens",
    from_id_type: str = "symbol",
    interactors: bool = False,
    include_disease: bool = True,
    min_entities: Optional[int] = None,
    max_entities: Optional[int] = None,
    fetch_overlap_genes: bool = False,
    translation_database: Union[
        str, TranslationDatabase
    ] = BIOMART,
) -> ORAResult

Perform over-representation analysis using Reactome pathway database.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of gene identifiers.	required
`species`	`str`	Species name (e.g., "Homo sapiens", "Mus musculus").	`'Homo sapiens'`
`from_id_type`	`str`	Input gene ID type. Automatically translates to gene symbols. Supported: "symbol", "ensembl", "entrez", "uniprot"	`'symbol'`
`interactors`	`bool`	Include interactors in the analysis.	`False`
`include_disease`	`bool`	Include disease pathways.	`True`
`min_entities`	`Optional[int]`	Minimum pathway size.	`None`
`max_entities`	`Optional[int]`	Maximum pathway size.	`None`
`fetch_overlap_genes`	`bool`	If True, fetch specific overlap genes (slower).	`False`
`translation_database`	`Union[str, TranslationDatabase]`	Database for ID translation.	`BIOMART`

Returns:

Type	Description
`ORAResult`	ORAResult with Reactome pathway enrichment results.

Example

genes = ["TP53", "BRCA1", "BRCA2", "ATM"]
result = ora_reactome(genes, species="Homo sapiens")
print(result.summary())
# ORA Results Summary (Reactome)
# ========================================
# Query genes: 4
# Mapped genes: 4
# Significant (adj.p <= 0.05): 15

ora_reactome_local¶

ora_reactome_local ¶

ora_reactome_local(
    genes: List[str],
    species: str = "Homo sapiens",
    from_id_type: str = "symbol",
    background: Optional[Set[str]] = None,
    min_overlap: int = 3,
    min_term_size: int = 5,
    max_term_size: int = 500,
    correction_method: Union[str, CorrectionMethod] = BH,
    translation_database: Union[
        str, TranslationDatabase
    ] = BIOMART,
    use_cache: bool = True,
    cache_dir: Optional[str] = None,
) -> ORAResult

Perform local over-representation analysis using Reactome pathway data.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of gene identifiers.	required
`species`	`str`	Species name (e.g., "Homo sapiens", "Mus musculus").	`'Homo sapiens'`
`from_id_type`	`str`	Input gene ID type. Automatically translates to gene symbols. Supported: "symbol", "ensembl", "entrez", "uniprot"	`'symbol'`
`background`	`Optional[Set[str]]`	Background gene set. If None, uses all genes in pathways.	`None`
`min_overlap`	`int`	Minimum overlap required to test a pathway.	`3`
`min_term_size`	`int`	Minimum genes per pathway.	`5`
`max_term_size`	`int`	Maximum genes per pathway.	`500`
`correction_method`	`Union[str, CorrectionMethod]`	Multiple testing correction method.	`BH`
`translation_database`	`Union[str, TranslationDatabase]`	Database for ID translation.	`BIOMART`
`use_cache`	`bool`	Cache pathway data (recommended).	`True`
`cache_dir`	`Optional[str]`	Directory for cache files.	`None`

Returns:

Type	Description
`ORAResult`	ORAResult with Reactome pathway enrichment results.

Example

genes = ["TP53", "BRCA1", "BRCA2"]
result = ora_reactome_local(genes, species="Homo sapiens")
print(result)
# ORAResult(database='Reactome', num_significant=15, query_genes=3, mapped_genes=3)

# Get top enriched pathways
top_pathways = result.top_terms(n=5)
for term in top_pathways:
    print(f"{term.name}: p={term.p_value:.2e}")
# Cell Cycle: p=1.23e-05
# DNA Repair: p=2.45e-04
# ...

ora_enrichr¶

ora_enrichr ¶

ora_enrichr(
    genes: List[str],
    gene_set_library: str = "KEGG_2021_Human",
    organism: str = "human",
    from_id_type: str = "symbol",
    translation_database: Union[
        str, TranslationDatabase
    ] = BIOMART,
) -> ORAResult

Perform over-representation analysis using EnrichR web service.

Parameters:

Name	Type	Description	Default
`genes`	`List[str]`	List of gene identifiers.	required
`gene_set_library`	`str`	EnrichR library to use.	`'KEGG_2021_Human'`
`organism`	`str`	Organism ("human", "mouse", "fly", "yeast", "worm", "fish").	`'human'`
`from_id_type`	`str`	Input gene ID type. Automatically translates to gene symbols. Supported: "symbol", "ensembl", "entrez", "uniprot"	`'symbol'`
`translation_database`	`Union[str, TranslationDatabase]`	Database for ID translation.	`BIOMART`

Returns:

Type	Description
`ORAResult`	ORAResult with EnrichR enrichment results.

Example

genes = ["TP53", "BRCA1", "BRCA2", "ATM"]
result = ora_enrichr(genes, "KEGG_2021_Human")
print(result.as_dataframe()[["term_name", "adjusted_p_value"]].head())
#                               term_name  adjusted_p_value
# 0  Homologous recombination_Homo sapiens         0.00012
# 1           Breast cancer_Homo sapiens         0.00045

Utility Functions¶

hypergeometric_test¶

hypergeometric_test ¶

hypergeometric_test(
    k: int, K: int, n: int, N: int
) -> float

Perform hypergeometric test for over-representation.

Calculates P(X >= k) where X follows a hypergeometric distribution. This is a one-sided test for enrichment (over-representation).

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of genes in both query and term (successes in sample).	required
`K`	`int`	Total genes in the term (successes in population).	required
`n`	`int`	Number of query genes (sample size).	required
`N`	`int`	Total genes in background/universe (population size).	required

Returns:

Type	Description
`float`	P-value for the hypergeometric test.

multiple_test_correction¶

multiple_test_correction ¶

multiple_test_correction(
    p_values: List[float],
    method: Union[str, CorrectionMethod] = BH,
) -> List[float]

Apply multiple testing correction to p-values.

GMT Functions¶

load_gmt¶

load_gmt ¶

load_gmt(
    path: Union[str, Path],
    database: str = "",
    species: Optional[str] = None,
) -> "Dict[str, Any]"

Load a GMT file and return a dict of :class:Pathway objects.

Each non-empty line becomes one Pathway:

id ← column 1 (gene set name / pathway ID)
name ← column 2 (description; falls back to id if blank or "na")
genes ← columns 3 onwards
database ← database parameter (default "" — set to the source name)
species ← species parameter (optional)

Parameters:

Name	Type	Description	Default
`path`	`Union[str, Path]`	Path to the `.gmt` file.	required
`database`	`str`	Database label to attach to every :class:`Pathway` (e.g. `"KEGG"`, `"MSigDB_H"`).	`''`
`species`	`Optional[str]`	Species string to attach to every :class:`Pathway` (e.g. `"Homo sapiens"`).	`None`

Returns:

Type	Description
`'Dict[str, Any]'`	`Dict[pathway_id, Pathway]` ready to pass to :func:`ora`.

Raises:

Type	Description
`FileNotFoundError`	If path does not exist.

Example::

gene_sets = load_gmt("h.all.v2023.1.Hs.symbols.gmt", database="MSigDB_H")
result = ora(my_genes, gene_sets)

save_gmt¶

save_gmt ¶

save_gmt(
    gene_sets: Union["Dict[str, Any]", "Dict[str, Tuple]"],
    path: Union[str, Path],
) -> Path

Save gene sets to a GMT file.

Accepts the same dict types that :func:ora accepts:

Dict[str, Pathway]
Dict[str, Tuple[str, Set[str]]] (legacy tuple format)

Parameters:

Name	Type	Description	Default
`gene_sets`	`Union['Dict[str, Any]', 'Dict[str, Tuple]']`	Gene sets to write.	required
`path`	`Union[str, Path]`	Output file path (created with parent directories if needed).	required

Returns:

Name	Type	Description
`Resolved`	`Path`	class:`~pathlib.Path` of the written file.

Example::

gene_sets = fetch_gmt("hsa", database="kegg")
save_gmt(gene_sets, "kegg_hsa.gmt")

fetch_gmt¶

fetch_gmt ¶

fetch_gmt(
    name: str,
    database: Literal[
        "kegg",
        "go",
        "gene ontology",
        "reactome",
        "enrichr",
        "msigdb",
    ] = "kegg",
    save_at: Optional[str] = None,
    species: Union[str, "Species"] = "human",
    aspect: str = "biological_process",
    use_cache: bool = True,
    min_term_size: int = 5,
    max_term_size: int = 500,
) -> "Dict[str, Any]"

Fetch a gene set collection from a pathway database and return as Dict[str, Pathway].

The returned dict is immediately usable with :func:ora. Pass save_at to also write a GMT file for use with gseapy / GSEA Desktop.

Parameters:

Name	Type	Description	Default
`name`	`str`	Database-specific identifier: kegg — KEGG organism code (e.g. `"hsa"`, `"mmu"`). reactome — species name (e.g. `"human"`, `"mouse"`). Can be any value accepted by :class:`Species`. go — GO aspect: `"biological_process"` (default), `"molecular_function"`, `"cellular_component"`, or `"all"`. Combined with the species parameter. enrichr — EnrichR library name (e.g. `"KEGG_2021_Human"`, `"MSigDB_Hallmark_2020"`). Call :func:`~biodbs.fetch.EnrichR.funcs.enrichr_get_libraries` to list all available libraries.	required
`database`	`Literal['kegg', 'go', 'gene ontology', 'reactome', 'enrichr', 'msigdb']`	Source database. One of `"kegg"`, `"reactome"`, `"go"`, `"enrichr"`. Case-insensitive.	`'kegg'`
`save_at`	`Optional[str]`	Optional file path for the GMT output. The placeholder `{name}` is replaced with the sanitised name argument. Examples: `"./kegg_hsa.gmt"`, `"./{name}.gmt"`. Pass `None` (default) to skip writing.	`None`
`species`	`Union[str, 'Species']`	Species for KEGG / GO / Reactome lookups. Ignored for EnrichR (library names are already species-specific). Accepts anything that :func:`~biodbs._funcs._species.resolve_species` understands.	`'human'`
`aspect`	`str`	GO aspect when `database="go"`. Overridden by name when name is a valid aspect string.	`'biological_process'`
`use_cache`	`bool`	Whether to use and populate the pathway cache.	`True`
`min_term_size`	`int`	Minimum genes per pathway (KEGG / GO / Reactome only).	`5`
`max_term_size`	`int`	Maximum genes per pathway (KEGG / GO / Reactome only).	`500`

Returns:

Type	Description
`'Dict[str, Any]'`	`Dict[pathway_id, Pathway]` — same type accepted by :func:`ora`.

Raises:

Type	Description
`ValueError`	For unknown database names.
`RuntimeError`	If the EnrichR download fails.

Examples::

# All KEGG human pathways → save GMT
ks = fetch_gmt("hsa", database="kegg", save_at="./{name}.gmt")

# Reactome mouse pathways
rs = fetch_gmt("mouse", database="reactome")

# GO Biological Process (human, cached)
gs = fetch_gmt("biological_process", database="go")

# EnrichR Hallmark gene sets
hs = fetch_gmt("MSigDB_Hallmark_2020", database="enrichr",
               save_at="./hallmark.gmt")

# Run ORA immediately
result = ora(my_genes, fetch_gmt("hsa", database="kegg"))

DataFrame Columns¶

When using ORAResult.as_dataframe():

Column	Type	Description
`term_id`	str	Pathway/term ID
`term_name`	str	Pathway/term name
`p_value`	float	Raw p-value
`adjusted_p_value`	float	FDR-adjusted p-value
`overlap_count`	int	Overlapping genes
`term_size`	int	Total genes in term
`query_size`	int	Number of query genes
`background_size`	int	Universe size
`fold_enrichment`	float	Enrichment score
`odds_ratio`	float	Odds ratio
`overlap_genes`	str	Comma-separated gene IDs
`database`	str	Source database

EnrichR Libraries¶

Popular gene set libraries available in EnrichR:

Library	Description
`KEGG_2021_Human`	KEGG pathways
`GO_Biological_Process_2021`	GO biological process
`GO_Molecular_Function_2021`	GO molecular function
`GO_Cellular_Component_2021`	GO cellular component
`Reactome_2022`	Reactome pathways
`WikiPathways_2019_Human`	WikiPathways
`MSigDB_Hallmark_2020`	MSigDB Hallmark
`GWAS_Catalog_2019`	GWAS Catalog
`DisGeNET`	Disease-gene associations
`DrugMatrix`	Drug signatures

Get all available libraries:

from biodbs.fetch import enrichr_get_libraries
libraries = enrichr_get_libraries()

Analysis Module API Reference¶

Summary¶

Classes¶

Enums¶

Core ORA Functions¶

Utility Functions¶

GMT Functions¶

Enums¶

Species¶

Species ¶

from_taxon_id classmethod ¶

from_kegg_code classmethod ¶

from_name classmethod ¶

GOAspect¶

GOAspect ¶

CorrectionMethod¶

CorrectionMethod ¶

TranslationDatabase¶

TranslationDatabase ¶

PathwayDatabase¶

PathwayDatabase ¶

Result Classes¶

ORAResult¶

ORAResult dataclass ¶

significant_terms ¶

top_terms ¶

as_dataframe ¶

summary ¶

ORATermResult¶

ORATermResult dataclass ¶

odds_ratio property ¶

to_dict ¶

Pathway¶

Pathway dataclass ¶

overlap ¶

to_tuple ¶

from_tuple classmethod ¶

Core ORA Functions¶

ora¶

ora ¶

ora_kegg¶

ora_kegg ¶

ora_go¶

ora_go ¶

ora_reactome¶

ora_reactome ¶

ora_reactome_local¶

ora_reactome_local ¶

ora_enrichr¶

ora_enrichr ¶

Utility Functions¶

hypergeometric_test¶

hypergeometric_test ¶

multiple_test_correction¶

multiple_test_correction ¶

GMT Functions¶

load_gmt¶

load_gmt ¶

save_gmt¶

save_gmt ¶

fetch_gmt¶

fetch_gmt ¶

DataFrame Columns¶

EnrichR Libraries¶

from_taxon_id `classmethod` ¶

from_kegg_code `classmethod` ¶

from_name `classmethod` ¶

ORAResult `dataclass` ¶

ORATermResult `dataclass` ¶

odds_ratio `property` ¶

Pathway `dataclass` ¶

from_tuple `classmethod` ¶