Quick Start¶

This guide will help you get started with biodbs in just a few minutes.

The Four Namespaces¶

biodbs is organized into four main namespaces:

# Data fetching - retrieve data from databases
from biodbs.fetch import uniprot_get_entry, pubchem_get_compound

# ID translation - map between identifier systems
from biodbs.translate import translate_gene_ids, translate_protein_ids

# Analysis - enrichment and statistics
from biodbs.analysis import ora_kegg, ora_go

# Knowledge graph - build and export graphs (requires optional dependencies)
from biodbs.graph import build_disease_graph, to_networkx

Fetching Data¶

Protein Data (UniProt)¶

Use uniprot_get_entry to fetch protein data, uniprot_search_by_gene to search by gene name, and gene_to_uniprot to map gene symbols.

from biodbs.fetch import (
    uniprot_get_entry,
    uniprot_search_by_gene,
    gene_to_uniprot,
)

# Get a protein by UniProt accession
protein = uniprot_get_entry("P04637")  # TP53
print(f"Name: {protein.entries[0].protein_name}")
print(f"Gene: {protein.entries[0].gene_name}")
print(f"Organism: {protein.entries[0].organism_name}")

# Search by gene name
results = uniprot_search_by_gene("BRCA1", organism=9606)
for entry in results.entries:
    print(f"{entry.primaryAccession}: {entry.protein_name}")

# Map gene names to UniProt accessions
mapping = gene_to_uniprot(["TP53", "BRCA1", "EGFR"])
print(mapping)
# {'TP53': 'P04637', 'BRCA1': 'P38398', 'EGFR': 'P00533'}

For more control, use the UniProt_Fetcher class directly.

Chemical Data (PubChem)¶

Use pubchem_get_compound to get compound data, pubchem_search_by_name to search by name, and pubchem_get_properties to get specific properties.

from biodbs.fetch import (
    pubchem_get_compound,
    pubchem_search_by_name,
    pubchem_get_properties,
)

# Get compound by CID
compound = pubchem_get_compound(2244)  # Aspirin
print(compound.results)

# Search by name
results = pubchem_search_by_name("caffeine")
cids = results.get_cids()

# Get specific properties
props = pubchem_get_properties(
    2244,
    properties=["MolecularWeight", "MolecularFormula", "CanonicalSMILES"]
)

For more control, use the PubChem_Fetcher class directly.

Gene Data (Ensembl/BioMart)¶

Use ensembl_lookup for gene lookups, biomart_get_genes for batch queries, and biomart_convert_ids for ID conversion.

from biodbs.fetch import (
    ensembl_lookup,
    biomart_get_genes,
    biomart_convert_ids,
)

# Lookup gene in Ensembl
gene = ensembl_lookup("ENSG00000141510")  # TP53

# Get genes via BioMart
genes = biomart_get_genes(["ENSG00000141510", "ENSG00000012048"])
df = genes.as_dataframe()

# Convert IDs
converted = biomart_convert_ids(
    ["TP53", "BRCA1"],
    from_type="external_gene_name",
    to_type="ensembl_gene_id"
)

For more control, use the Ensembl_Fetcher or BioMart_Fetcher classes directly.

ID Translation¶

Translate between different identifier systems using functions from the translate module:

translate_gene_ids - Gene ID conversion via BioMart
translate_protein_ids - Protein ID mapping via UniProt
translate_chemical_ids - Chemical ID translation via PubChem
translate_gene_to_uniprot - Gene symbols to UniProt accessions

from biodbs.translate import (
    translate_gene_ids,
    translate_protein_ids,
    translate_chemical_ids,
    translate_gene_to_uniprot,
)

# Gene symbols to Ensembl IDs
result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="external_gene_name",
    to_type="ensembl_gene_id",
    return_dict=True
)
# {'TP53': 'ENSG00000141510', 'BRCA1': 'ENSG00000012048'}

# Gene symbols to UniProt accessions
mapping = translate_gene_to_uniprot(["TP53", "BRCA1", "EGFR"])
# {'TP53': 'P04637', 'BRCA1': 'P38398', 'EGFR': 'P00533'}

# UniProt to NCBI Gene IDs
result = translate_protein_ids(
    ["P04637", "P00533"],
    from_type="UniProtKB_AC-ID",
    to_type="GeneID",
    return_dict=True
)
# {'P04637': '7157', 'P00533': '1956'}

# Chemical name to PubChem CID
result = translate_chemical_ids(
    ["aspirin", "caffeine"],
    from_type="name",
    to_type="cid"
)

Enrichment Analysis¶

Perform over-representation analysis using the analysis module:

ora_kegg - KEGG pathway enrichment
ora_go - Gene Ontology enrichment
ora_reactome - Reactome pathway enrichment
ORAResult - Result container with filtering and export methods

from biodbs.analysis import ora_kegg, ora_go

# KEGG pathway enrichment
result = ora_kegg(
    gene_list=["TP53", "BRCA1", "BRCA2", "ATM", "CHEK2"],
    organism="hsa",
    id_type="symbol"  # Auto-converts to Entrez IDs
)

# View results
print(result.summary())
df = result.as_dataframe()
print(df[["term_id", "term_name", "p_value", "overlap_genes"]])

# Get significant terms
significant = result.significant_terms(alpha=0.05)

Output Formats¶

All fetch operations return data objects with multiple export options:

from biodbs.fetch import uniprot_get_entries

data = uniprot_get_entries(["P04637", "P00533", "P38398"])

# As dictionary
records = data.as_dict()

# As pandas DataFrame
df = data.as_dataframe(engine="pandas")

# As Polars DataFrame
df = data.as_dataframe(engine="polars")

# Filter and transform
reviewed = data.filter_reviewed()
human_only = data.filter_by_organism(9606)

# Get specific data
accessions = data.get_accessions()
gene_names = data.get_gene_names()
sequences = data.get_sequences()

Next Steps¶

Learn about core concepts
Explore specific databases in Data Fetching
See all ID translation options
Perform enrichment analysis
Build knowledge graphs in Graph module