Over-Representation Analysis (ORA)¶

Perform pathway and gene set enrichment analysis.

Overview¶

ORA tests whether your gene list is enriched for genes from specific pathways or functional categories compared to a background.

KEGG Pathway Enrichment¶

from biodbs.analysis import ora_kegg

# Using Entrez IDs (default)
result = ora_kegg(
    genes=["7157", "672", "675", "580", "581"],  # Entrez IDs
    organism="hsa"
)

# View significant pathways
significant = result.significant_terms(alpha=0.05)
df = significant.as_dataframe()
print(df[["term_id", "term_name", "p_value", "adjusted_p_value"]])

Using Different ID Types¶

The from_id_type parameter enables automatic ID translation:

from biodbs.analysis import ora_kegg

# Using gene symbols - automatic translation to Entrez
result = ora_kegg(
    genes=["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"],
    organism="hsa",
    from_id_type="symbol"
)

# Using Ensembl IDs
result = ora_kegg(
    genes=["ENSG00000141510", "ENSG00000012048"],
    organism="hsa",
    from_id_type="ensembl"
)

Parameters¶

Parameter	Type	Default	Description
`genes`	List[str]	required	Genes to analyze
`organism`	str	required	KEGG organism code (e.g., "hsa")
`from_id_type`	str	"entrez"	Input ID type (see below)
`background`	Set[str]	None	Background gene set
`min_overlap`	int	3	Minimum overlap required
`correction_method`	str	"bh"	Multiple testing correction

Supported ID Types¶

ID Type	Aliases	Example
`entrez`	entrezgene, ncbi_gene, gene_id	"7157"
`symbol`	gene_symbol, gene_name	"TP53"
`ensembl`	ensembl_gene_id	"ENSG00000141510"
`uniprot`	uniprot_id, swissprot	"P04637"

Organism Codes¶

Code	Organism
`hsa`	Human
`mmu`	Mouse
`rno`	Rat
`dme`	Fruit fly
`sce`	Yeast

Gene Ontology Enrichment¶

from biodbs.analysis import ora_go

# Using UniProt IDs (default)
result = ora_go(
    genes=["P04637", "P38398", "P51587"],  # UniProt IDs
    taxon_id=9606,  # Human
    aspect="biological_process"
)

# Using gene symbols - automatic translation
result = ora_go(
    genes=["TP53", "BRCA1", "BRCA2"],
    taxon_id=9606,
    from_id_type="symbol"
)

Parameters¶

Parameter	Type	Default	Description
`genes`	List[str]	required	Genes to analyze
`taxon_id`	int	required	NCBI taxonomy ID
`from_id_type`	str	"uniprot"	Input ID type
`aspect`	str	"biological_process"	GO aspect

GO Aspects¶

Aspect	Description
`biological_process`	BP - What the gene does
`molecular_function`	MF - Biochemical activity
`cellular_component`	CC - Where in the cell

Reactome Pathway Enrichment¶

Reactome provides curated, peer-reviewed pathway analysis. Two methods are available:

ora_reactome - Uses Reactome's Analysis Service API (recommended for most cases)
ora_reactome_local - Performs local ORA with custom backgrounds

API-Based Analysis (ora_reactome)¶

from biodbs.analysis import ora_reactome

result = ora_reactome(
    genes=["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"],
    species="Homo sapiens"
)

# View results - __repr__ shows summary
print(result)
# ReactomeFetchedData(847 pathways, 52 significant (FDR≤0.05), query=5 ids)
#   Top: R-HSA-69620 'Cell Cycle Checkpoints' (FDR=1.23e-08)

# Detailed summary
print(result.summary())

# Get significant pathways
significant = result.significant_pathways(fdr_threshold=0.05)
df = significant.as_dataframe()

Local Analysis with Custom Background (ora_reactome_local)¶

Use ora_reactome_local when you need:

Custom background gene set (e.g., only expressed genes)
Different multiple testing correction methods
Offline analysis capability

from biodbs.analysis import ora_reactome_local

# Define your gene list
genes = ["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"]

# Define custom background (e.g., all expressed genes in your experiment)
background = load_expressed_genes()  # Your background set

result = ora_reactome_local(
    genes=genes,
    background=background,
    species="Homo sapiens",
    correction_method="fdr_bh",  # Benjamini-Hochberg
    min_size=5,
    max_size=500
)

print(result.summary())
df = result.significant_terms(alpha=0.05).as_dataframe()

Parameters (ora_reactome)¶

Parameter	Type	Default	Description
`genes`	List[str]	required	Genes to analyze
`species`	str	"Homo sapiens"	Species name
`from_id_type`	str	"symbol"	Input ID type (symbol, ensembl, entrez, uniprot)
`interactors`	bool	False	Include interactors in analysis
`include_disease`	bool	True	Include disease pathways
`min_entities`	int	None	Minimum pathway size
`max_entities`	int	None	Maximum pathway size
`fetch_overlap_genes`	bool	False	Fetch specific overlap genes (slower)

Parameters (ora_reactome_local)¶

Parameter	Type	Default	Description
`genes`	List[str]	required	Genes to analyze
`species`	str	"Homo sapiens"	Species name
`from_id_type`	str	"symbol"	Input ID type (symbol, ensembl, entrez, uniprot)
`background`	Set[str]	None	Background gene set (None = all pathway genes)
`correction_method`	str	"fdr_bh"	Multiple testing correction
`min_size`	int	5	Minimum pathway size
`max_size`	int	500	Maximum pathway size
`use_cache`	bool	True	Cache pathway-gene mappings

Correction Methods¶

Method	Description
`fdr_bh`	Benjamini-Hochberg FDR (default)
`bonferroni`	Bonferroni correction
`holm`	Holm-Bonferroni
`fdr_by`	Benjamini-Yekutieli FDR

Supported Species¶

Species	Description
`Homo sapiens`	Human
`Mus musculus`	Mouse
`Rattus norvegicus`	Rat
`Danio rerio`	Zebrafish
`Drosophila melanogaster`	Fruit fly
`Saccharomyces cerevisiae`	Yeast

Example with Different Identifiers¶

# Using gene symbols (default)
result = ora_reactome(["TP53", "BRCA1", "EGFR"])

# Using UniProt IDs
result = ora_reactome(["P04637", "P38398", "P00533"], from_id_type="uniprot")

# Using Ensembl IDs
result = ora_reactome(
    ["ENSG00000141510", "ENSG00000012048"],
    from_id_type="ensembl"
)

# Using Entrez IDs
result = ora_reactome(["7157", "672", "675"], from_id_type="entrez")

# Mouse genes
result = ora_reactome(
    ["Trp53", "Brca1", "Brca2"],
    species="Mus musculus"
)

When to Use Local vs API¶

Use Case	Recommended
Standard enrichment	`ora_reactome`
Custom background	`ora_reactome_local`
RNA-seq DEG analysis	`ora_reactome_local`
Quick exploratory analysis	`ora_reactome`
Reproducible offline analysis	`ora_reactome_local`

EnrichR Analysis¶

EnrichR provides access to 100+ gene set libraries.

from biodbs.analysis import ora_enrichr

# Using gene symbols (default)
result = ora_enrichr(
    genes=["TP53", "BRCA1", "BRCA2", "ATM"],
    gene_set_library="KEGG_2021_Human"
)

# Using Ensembl IDs - automatic translation
result = ora_enrichr(
    genes=["ENSG00000141510", "ENSG00000012048"],
    gene_set_library="KEGG_2021_Human",
    from_id_type="ensembl"
)

# Using Entrez IDs
result = ora_enrichr(
    genes=["7157", "672", "675"],
    gene_set_library="GO_Biological_Process_2023",
    from_id_type="entrez"
)

Convenience Functions¶

from biodbs.analysis import (
    enrichr_kegg,
    enrichr_go_bp,
    enrichr_go_mf,
    enrichr_go_cc,
    enrichr_reactome,
    enrichr_wikipathways,
)

# KEGG
result = enrichr_kegg(["TP53", "BRCA1", "BRCA2"])

# GO Biological Process
result = enrichr_go_bp(["TP53", "BRCA1", "BRCA2"])

# Reactome
result = enrichr_reactome(["TP53", "BRCA1", "BRCA2"])

Popular Libraries¶

Library	Description
`KEGG_2021_Human`	KEGG pathways
`GO_Biological_Process_2021`	GO BP
`GO_Molecular_Function_2021`	GO MF
`GO_Cellular_Component_2021`	GO CC
`Reactome_2022`	Reactome pathways
`WikiPathways_2019_Human`	WikiPathways
`MSigDB_Hallmark_2020`	Hallmark gene sets

List All Libraries¶

from biodbs.fetch import enrichr_get_libraries

libraries = enrichr_get_libraries()

Working with Results¶

ORAResult Object¶

result = ora_kegg(gene_list, organism="hsa")

# Summary statistics
print(result.summary())

# Total terms tested
print(f"Terms tested: {len(result)}")

# As DataFrame
df = result.as_dataframe()

# Significant terms only
significant = result.significant_terms(alpha=0.05)
significant = result.significant_terms(alpha=0.1, use_fdr=True)

DataFrame Columns¶

df = result.as_dataframe()

# Available columns
print(df.columns.tolist())
# ['term_id', 'term_name', 'p_value', 'q_value',
#  'overlap_count', 'term_size', 'overlap_genes', 'fold_enrichment']

Examples¶

Complete Workflow¶

from biodbs.analysis import ora_kegg
from biodbs.translate import translate_gene_to_uniprot

# Your differentially expressed genes
deg_genes = ["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2", "RAD51", "PALB2"]

# KEGG enrichment
kegg_result = ora_kegg(
    gene_list=deg_genes,
    organism="hsa",
    id_type="symbol"
)

# View results
print(kegg_result.summary())

# Get significant pathways
significant = kegg_result.significant_terms(alpha=0.05)
df = significant.as_dataframe()

# Export
df.to_csv("kegg_enrichment.csv", index=False)

Compare Multiple Gene Lists¶

from biodbs.analysis import ora_kegg

gene_sets = {
    "upregulated": ["TP53", "BRCA1", "ATM"],
    "downregulated": ["MYC", "CCND1", "CDK4"],
}

results = {}
for name, genes in gene_sets.items():
    results[name] = ora_kegg(genes, organism="hsa", id_type="symbol")
    print(f"\n{name}:")
    print(results[name].summary())

Multi-Database Enrichment¶

from biodbs.analysis import ora_kegg, ora_reactome, ora_enrichr

genes = ["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"]

# KEGG
kegg = ora_kegg(genes, organism="hsa", id_type="symbol")

# Reactome (direct API)
reactome = ora_reactome(genes, species="Homo sapiens")

# GO BP (via EnrichR)
go_bp = ora_enrichr(genes, gene_set_library="GO_Biological_Process_2021")

# Compare results
for name, result in [("KEGG", kegg), ("Reactome", reactome), ("GO_BP", go_bp)]:
    sig = result.significant_terms(p_threshold=0.05)
    print(f"{name}: {len(sig)} significant terms")

Custom Gene Sets with GMT Files¶

GMT (Gene Matrix Transposed) is the standard interchange format used by GSEA, MSigDB, and most enrichment tools. biodbs can load, save, and fetch GMT gene sets, and ora() accepts a GMT path directly.

Loading a GMT File¶

from biodbs.analysis import load_gmt, ora

gene_sets = load_gmt("h.all.v2023.1.Hs.symbols.gmt")
result = ora(["TP53", "BRCA1", "BRCA2"], gene_sets)

You can also pass the path straight to ora():

result = ora(["TP53", "BRCA1", "BRCA2"], "h.all.v2023.1.Hs.symbols.gmt")

Fetching GMT Gene Sets from a Database¶

from biodbs.analysis import fetch_gmt, ora

# Fetch KEGG human pathways as a GMT dict (and save locally)
gene_sets = fetch_gmt("hsa", database="kegg", save_at="./kegg_hsa.gmt")
result = ora(my_genes, gene_sets)

# Fetch an EnrichR library
gene_sets = fetch_gmt("KEGG_2021_Human", database="enrichr")

Saving Gene Sets to GMT¶

from biodbs.analysis import save_gmt, fetch_gmt

gene_sets = fetch_gmt("hsa", database="kegg")
save_gmt(gene_sets, "my_kegg.gmt")

Data Fetching¶

KEGG - Fetch pathway data, gene lists, and compound information.
Reactome - Access curated pathway data and pathway-gene mappings.
QuickGO - Retrieve GO terms and gene annotations.
EnrichR - Direct EnrichR access with 100+ gene set libraries.

ID Translation¶

Gene ID Translation - Convert gene identifiers before enrichment analysis.
Protein ID Translation - Map UniProt accessions to gene symbols.

Knowledge Graphs¶

Building Graphs - Build pathway-based knowledge graphs from ORA results.

Over-Representation Analysis (ORA)¶

Overview¶

KEGG Pathway Enrichment¶

Using Different ID Types¶

Parameters¶

Supported ID Types¶

Organism Codes¶

Gene Ontology Enrichment¶

Parameters¶

GO Aspects¶

Reactome Pathway Enrichment¶

API-Based Analysis (ora_reactome)¶

Local Analysis with Custom Background (ora_reactome_local)¶

Parameters (ora_reactome)¶

Parameters (ora_reactome_local)¶

Correction Methods¶

Supported Species¶

Example with Different Identifiers¶

When to Use Local vs API¶

EnrichR Analysis¶

Convenience Functions¶

Popular Libraries¶

List All Libraries¶

Working with Results¶

ORAResult Object¶

DataFrame Columns¶

Examples¶

Complete Workflow¶

Compare Multiple Gene Lists¶

Multi-Database Enrichment¶

Custom Gene Sets with GMT Files¶

Loading a GMT File¶

Fetching GMT Gene Sets from a Database¶

Saving Gene Sets to GMT¶

Related Resources¶

Data Fetching¶

ID Translation¶

Knowledge Graphs¶