ID Translation Overview¶
The biodbs.translate module provides functions for mapping between different biological identifier systems.
Related sections:
- API Reference - Complete function documentation
- Data Fetching - Fetch data using translated IDs
- Analysis - Use translated IDs in ORA analysis
- Knowledge Graph - Build graphs with cross-referenced entities
Available Translators¶
| Category | Functions | Description |
|---|---|---|
| Gene IDs | translate_gene_ids |
Map between gene identifier systems |
| Protein IDs | translate_protein_ids, translate_gene_to_uniprot |
UniProt-based protein mapping |
| Chemical IDs | translate_chemical_ids |
Map between chemical identifiers |
Quick Start¶
from biodbs.translate import (
translate_gene_ids,
translate_protein_ids,
translate_chemical_ids,
translate_gene_to_uniprot,
)
# Gene symbols to Ensembl IDs
genes = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name",
to_type="ensembl_gene_id",
return_dict=True
)
# {'TP53': 'ENSG00000141510', 'BRCA1': 'ENSG00000012048'}
# Gene symbols to UniProt accessions
proteins = translate_gene_to_uniprot(["TP53", "BRCA1", "EGFR"])
# {'TP53': 'P04637', 'BRCA1': 'P38398', 'EGFR': 'P00533'}
# UniProt to NCBI Gene ID
mapping = translate_protein_ids(
["P04637", "P00533"],
from_type="UniProtKB_AC-ID",
to_type="GeneID",
return_dict=True
)
# {'P04637': '7157', 'P00533': '1956'}
# Chemical names to PubChem CIDs
chemicals = translate_chemical_ids(
["aspirin", "caffeine"],
from_type="name",
to_type="cid"
)
Multiple Target Types¶
All main translation functions can return multiple target ID types in a single call:
# Get multiple ID types at once (more efficient than separate calls)
result = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name",
to_type=["ensembl_gene_id", "entrezgene_id", "hgnc_id"],
)
# external_gene_name ensembl_gene_id entrezgene_id hgnc_id
# 0 TP53 ENSG00000141510 7157 HGNC:11998
# 1 BRCA1 ENSG00000012048 672 HGNC:1100
# Chemical IDs
result = translate_chemical_ids(
["aspirin"],
from_type="name",
to_type=["cid", "smiles", "inchikey"],
)
# Protein IDs
result = translate_protein_ids(
["P04637"],
from_type="UniProtKB_AC-ID",
to_type=["GeneID", "Ensembl", "Gene_Name"],
)
Output Formats¶
All translation functions support two output formats:
Dictionary (return_dict=True)¶
mapping = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name",
to_type="ensembl_gene_id",
return_dict=True
)
# {'TP53': 'ENSG00000141510', 'BRCA1': 'ENSG00000012048'}
DataFrame (return_dict=False, default)¶
df = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name",
to_type="ensembl_gene_id"
)
# external_gene_name ensembl_gene_id
# 0 TP53 ENSG00000141510
# 1 BRCA1 ENSG00000012048
Database Selection¶
Many translators support multiple backend databases:
# Using BioMart (default)
result = translate_gene_ids(
["TP53"],
from_type="external_gene_name",
to_type="ensembl_gene_id",
database="biomart"
)
# Using Ensembl REST API
result = translate_gene_ids(
["ENSG00000141510"],
from_type="ensembl_gene_id",
to_type="HGNC",
database="ensembl"
)
# Using NCBI
result = translate_gene_ids(
["TP53"],
from_type="symbol",
to_type="entrez_id",
database="ncbi"
)
# Using UniProt
result = translate_gene_ids(
["TP53"],
from_type="Gene_Name",
to_type="UniProtKB",
database="uniprot"
)
Species Support¶
Specify species for organism-specific translations:
# Human (default)
result = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name",
to_type="ensembl_gene_id",
species="human"
)
# Mouse
result = translate_gene_ids(
["Trp53", "Brca1"],
from_type="external_gene_name",
to_type="ensembl_gene_id",
species="mouse"
)
Error Handling¶
Missing or unmappable IDs return None or NaN:
mapping = translate_gene_to_uniprot(
["TP53", "NOT_A_GENE", "BRCA1"]
)
# {'TP53': 'P04637', 'BRCA1': 'P38398'}
# Note: 'NOT_A_GENE' is not in the result
Next Steps¶
- Gene ID Translation - Detailed gene ID mapping guide
- Protein ID Translation - UniProt-based protein mapping
- Chemical ID Translation - Chemical identifier mapping
- ORA Analysis - Use translated IDs in enrichment analysis