Translate Module API Reference¶
Complete reference for biodbs.translate module.
Key Features¶
- Universal ID aliases: Use
GeneIDTypeenum values (e.g."gene_symbol","entrez_id") instead of database-native field names — the correct name is resolved per backend automatically. - Multiple Target Types: All main translation functions accept either a single target type or a list. When a list is provided, all target IDs are returned in one call.
from biodbs.translate import translate_gene_ids, GeneIDType
# Universal alias (works with any database)
result = translate_gene_ids(["TP53"], from_type="gene_symbol", to_type="ensembl_gene_id")
# Enum members are equivalent
result = translate_gene_ids(["TP53"], from_type=GeneIDType.GENE_SYMBOL,
to_type=GeneIDType.ENSEMBL_GENE_ID)
# Multiple target types — more efficient than multiple calls
result = translate_gene_ids(
["TP53"],
from_type="gene_symbol",
to_type=["ensembl_gene_id", "entrez_id", "hgnc_id"]
)
Enums¶
GeneIDType¶
GeneIDType
¶
Bases: str, Enum
Universal gene / protein identifier types.
Use these members as from_type / to_type in :func:translate_gene_ids
instead of database-specific field names. The correct native name for the
chosen database is resolved automatically.
Raw native strings (e.g. "external_gene_name", "Gene_Name") are
still accepted everywhere and passed through unchanged.
Examples:
>>> from biodbs.translate import GeneIDType
>>> translate_gene_ids(["TP53"], from_type=GeneIDType.GENE_SYMBOL,
... to_type=GeneIDType.ENSEMBL_GENE_ID)
>>> translate_gene_ids(["TP53"], from_type="gene_symbol", # same
... to_type="ensembl_gene_id")
TranslationDatabase¶
TranslationDatabase
¶
Bases: str, Enum
Databases available for gene ID translation.
Use these as the database parameter in :func:translate_gene_ids and
the translation_database parameter in ORA functions.
Members
NCBI: NCBI Datasets API. Most stable; best for symbol ↔ Entrez ↔ Ensembl translations. Default for translate_gene_ids. ENSEMBL: Ensembl REST API (xrefs endpoint). More stable than BioMart; natural choice when working with Ensembl IDs. UNIPROT: UniProt ID-mapping API. Best for protein-centric translations (UniProt accession, PDB, RefSeq protein). BIOMART: BioMart / Ensembl query interface. Supports the widest range of ID types but is less reliable than the other options. HGNC: HGNC REST API. Authoritative for human gene nomenclature; best for translations involving HGNC IDs, approved symbols, aliases, and previous symbols. Human only.
Examples:
>>> from biodbs.translate import TranslationDatabase, translate_gene_ids
>>> translate_gene_ids(["TP53"], from_type="gene_symbol",
... to_type="ensembl_gene_id",
... database=TranslationDatabase.NCBI)
>>> # Raw strings still work for backwards compatibility
>>> translate_gene_ids(["TP53"], from_type="gene_symbol",
... to_type="ensembl_gene_id",
... database="ncbi")
Functions Summary¶
Gene Translation¶
| Function | Description |
|---|---|
translate_gene_ids |
Translate gene IDs between databases |
translate_gene_ids_kegg |
Translate gene IDs using KEGG API |
Chemical Translation¶
| Function | Description |
|---|---|
translate_chemical_ids |
Translate chemical IDs via PubChem |
translate_chemical_ids_kegg |
Translate chemical IDs using KEGG API |
translate_chembl_to_pubchem |
Map ChEMBL IDs to PubChem CIDs |
translate_pubchem_to_chembl |
Map PubChem CIDs to ChEMBL IDs |
Protein Translation¶
| Function | Description |
|---|---|
translate_protein_ids |
Translate protein IDs via UniProt ID mapping |
translate_gene_to_uniprot |
Map gene symbols to UniProt accessions |
translate_uniprot_to_gene |
Map UniProt accessions to gene symbols |
translate_uniprot_to_pdb |
Map UniProt accessions to PDB IDs |
translate_uniprot_to_ensembl |
Map UniProt accessions to Ensembl gene IDs |
translate_uniprot_to_refseq |
Map UniProt accessions to RefSeq protein IDs |
Gene Translation¶
translate_gene_ids¶
translate_gene_ids
¶
translate_gene_ids(
ids: List[str],
from_type: Union[GeneIDType, str],
to_type: Union[
GeneIDType, str, List[Union[GeneIDType, str]]
],
species: Union[Species, str, int] = HUMAN,
database: Union[
TranslationDatabase,
Literal[
"ncbi", "ensembl", "uniprot", "biomart", "hgnc"
],
] = NCBI,
return_dict: bool = False,
) -> Union[
Dict[str, str], Dict[str, Dict[str, str]], DataFrame
]
Translate gene IDs between different identifier types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
List[str]
|
List of gene IDs to translate. |
required |
from_type
|
Union[GeneIDType, str]
|
Source ID type. |
required |
to_type
|
Union[GeneIDType, str, List[Union[GeneIDType, str]]]
|
Target ID type(s). Can be a single string or a list of strings. When a list is provided, multiple target IDs are returned. |
required |
species
|
Union[Species, str, int]
|
Species to translate for. Accepts a :class: |
HUMAN
|
database
|
Union[TranslationDatabase, Literal['ncbi', 'ensembl', 'uniprot', 'biomart', 'hgnc']]
|
Database backend for translation. Accepts a
:class:
|
NCBI
|
return_dict
|
bool
|
If True, return a dict mapping from_id -> to_id (or dict of to_ids when to_type is a list). If False (default), return a DataFrame. |
False
|
Supported ID types for NCBI
- symbol / gene_symbol: Gene symbol (e.g., "TP53")
- entrez_id / gene_id: NCBI Gene ID (e.g., "7157")
- refseq_accession: RefSeq accession (e.g., "NM_000546.6")
- ensembl_gene_id: Ensembl gene ID (output only)
- uniprot / swiss_prot: UniProt accession (output only)
Supported ID types for BioMart
- ensembl_gene_id: Ensembl gene ID (e.g., "ENSG00000141510")
- ensembl_transcript_id: Ensembl transcript ID
- ensembl_peptide_id: Ensembl protein ID
- external_gene_name: Gene symbol (e.g., "TP53")
- hgnc_symbol: HGNC symbol
- hgnc_id: HGNC ID (e.g., "HGNC:11998")
- entrezgene_id: NCBI Entrez gene ID
- uniprot_gn_id: UniProt gene name
- refseq_mrna: RefSeq mRNA ID
- refseq_peptide: RefSeq protein ID
Supported ID types for Ensembl REST
- Input (from_type): Ensembl stable IDs (ENSG, ENST, ENSP*)
- Output (to_type): Filter by external_db name (e.g., "HGNC", "EntrezGene", "Uniprot_gn", "RefSeq_mRNA", "RefSeq_peptide")
Supported ID types for UniProt
- UniProtKB_AC-ID: UniProt accession (e.g., "P04637")
- Gene_Name: Gene symbol (e.g., "TP53")
- GeneID: NCBI Gene ID (e.g., "7157")
- Ensembl: Ensembl gene ID
- RefSeq_Protein: RefSeq protein ID
- PDB: PDB structure ID
Supported ID types for HGNC (human only): - symbol / gene_symbol / hgnc_symbol: Approved gene symbol - hgnc_id: HGNC ID (e.g., "HGNC:11998") - entrez_id: NCBI Gene ID - ensembl_gene_id: Ensembl stable gene ID - uniprot_id → uniprot_ids field (first accession returned) - refseq_mrna / refseq_protein → refseq_accession field (first returned)
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a string: Dict mapping source IDs to target IDs, or DataFrame with both columns. |
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a list: Dict mapping source IDs to dicts of {target_type: target_id}, or DataFrame with from_type column and one column per target type. |
Example
Gene symbols to Ensembl IDs using the universal enum:
from biodbs.translate import GeneIDType
result = translate_gene_ids(
["TP53", "BRCA1", "EGFR"],
from_type=GeneIDType.GENE_SYMBOL,
to_type=GeneIDType.ENSEMBL_GENE_ID,
)
print(result)
# external_gene_name ensembl_gene_id
# 0 TP53 ENSG00000141510
# 1 BRCA1 ENSG00000012048
# 2 EGFR ENSG00000146648
Raw database-native strings still work (backwards compatible):
result = translate_gene_ids(
["TP53", "BRCA1"],
from_type="external_gene_name", # BioMart native
to_type="ensembl_gene_id",
)
Ensembl IDs to HGNC (using Ensembl REST API):
result = translate_gene_ids(
["ENSG00000141510", "ENSG00000012048"],
from_type=GeneIDType.ENSEMBL_GENE_ID,
to_type=GeneIDType.GENE_SYMBOL,
database="ensembl",
)
Multiple target types (BioMart):
translate_gene_ids_kegg¶
translate_gene_ids_kegg
¶
Translate gene IDs using KEGG database.
Useful for converting between KEGG gene IDs and external databases.
Supported databases
- KEGG organism codes: "hsa" (human), "mmu" (mouse), "rno" (rat), etc.
- ncbi-geneid: NCBI Entrez Gene ID
- ncbi-proteinid: NCBI Protein ID
- uniprot: UniProt accession
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
List[str]
|
List of gene IDs to translate (e.g., ["hsa:7157", "hsa:672"]). |
required |
from_db
|
str
|
Source database. Use KEGG entry IDs or external DB name. |
required |
to_db
|
str
|
Target database name. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with source and target ID columns. |
Chemical Translation¶
translate_chemical_ids¶
translate_chemical_ids
¶
translate_chemical_ids(
ids: List[str],
from_type: str,
to_type: Union[str, List[str]],
return_dict: bool = False,
) -> Union[
Dict[str, str], Dict[str, Dict[str, str]], DataFrame
]
Translate chemical/compound IDs between different identifier types.
Uses PubChem for ID conversion.
Supported ID types
- cid: PubChem Compound ID
- name: Compound name
- smiles: SMILES string (canonical)
- inchikey: InChIKey
- inchi: InChI string
- formula: Molecular formula
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
List[str]
|
List of compound identifiers to translate. |
required |
from_type
|
str
|
Source ID type ("cid", "name", "smiles", "inchikey"). |
required |
to_type
|
Union[str, List[str]]
|
Target ID type(s). Can be a single string or a list of strings. When a list is provided, multiple target IDs are returned. Valid types: "cid", "name", "smiles", "inchikey", "inchi", "formula". |
required |
return_dict
|
bool
|
If True, return dict mapping from_id -> to_id (or dict of to_ids when to_type is a list). |
False
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a string: Dict or DataFrame with translated IDs. |
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a list: Dict mapping source IDs to dicts of {target_type: target_id}, or DataFrame with from_type column and one column per target type. |
Example
Names to CIDs:
result = translate_chemical_ids(
["aspirin", "ibuprofen"],
from_type="name",
to_type="cid",
)
print(result)
# name cid cid
# 0 aspirin 2244 2244
# 1 ibuprofen 3672 3672
CIDs to SMILES:
result = translate_chemical_ids(
["2244", "3672"],
from_type="cid",
to_type="smiles",
return_dict=True,
)
print(result)
# {'2244': 'CC(=O)OC1=CC=CC=C1C(=O)O', '3672': 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O'}
Multiple target types:
translate_chemical_ids_kegg¶
translate_chemical_ids_kegg
¶
Translate chemical/compound IDs using KEGG database.
Useful for converting between KEGG compound/drug IDs and external databases.
Supported databases
- compound: KEGG Compound
- drug: KEGG Drug
- pubchem: PubChem CID
- chebi: ChEBI ID
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
List[str]
|
List of compound IDs to translate (e.g., ["cpd:C00022", "dr:D00001"]). If empty, converts entire database. |
required |
from_db
|
str
|
Source database (compound, drug, or entries). |
required |
to_db
|
str
|
Target database name. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with source and target ID columns. |
translate_chembl_to_pubchem¶
translate_chembl_to_pubchem
¶
translate_chembl_to_pubchem(
chembl_ids: List[str], return_dict: bool = False
) -> Union[Dict[str, int], DataFrame]
Translate ChEMBL molecule IDs to PubChem CIDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chembl_ids
|
List[str]
|
List of ChEMBL IDs (e.g., ["CHEMBL25", "CHEMBL1201585"]). |
required |
return_dict
|
bool
|
If True, return dict mapping ChEMBL ID -> PubChem CID. |
False
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, int], DataFrame]
|
Dict or DataFrame with ChEMBL IDs and corresponding PubChem CIDs. |
translate_pubchem_to_chembl¶
translate_pubchem_to_chembl
¶
translate_pubchem_to_chembl(
cids: List[int], return_dict: bool = False
) -> Union[Dict[int, str], DataFrame]
Translate PubChem CIDs to ChEMBL molecule IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cids
|
List[int]
|
List of PubChem CIDs (e.g., [2244, 3672]). |
required |
return_dict
|
bool
|
If True, return dict mapping CID -> ChEMBL ID. |
False
|
Returns:
| Type | Description |
|---|---|
Union[Dict[int, str], DataFrame]
|
Dict or DataFrame with PubChem CIDs and corresponding ChEMBL IDs. |
Protein Translation¶
translate_protein_ids¶
translate_protein_ids
¶
translate_protein_ids(
ids: List[str],
from_type: str,
to_type: Union[str, List[str]],
organism: int = 9606,
return_dict: bool = False,
) -> Union[
Dict[str, str], Dict[str, Dict[str, str]], DataFrame
]
Translate protein/gene IDs using UniProt ID mapping service.
This function provides comprehensive ID translation between various biological databases using the UniProt ID mapping API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
List[str]
|
List of IDs to translate. |
required |
from_type
|
str
|
Source ID type. Common options: - "UniProtKB_AC-ID": UniProt accession (e.g., "P04637") - "Gene_Name": Gene name/symbol (e.g., "TP53") - "GeneID": NCBI Gene ID (e.g., "7157") - "Ensembl": Ensembl gene ID (e.g., "ENSG00000141510") - "RefSeq_Protein": RefSeq protein ID - "PDB": PDB structure ID |
required |
to_type
|
Union[str, List[str]]
|
Target ID type(s). Can be a single string or a list of strings. When a list is provided, multiple target IDs are returned. Common options: - "UniProtKB": UniProt entry (returns accession) - "UniProtKB_AC-ID": UniProt accession - "GeneID": NCBI Gene ID - "Ensembl": Ensembl gene ID - "Ensembl_Protein": Ensembl protein ID - "RefSeq_Protein": RefSeq protein ID - "PDB": PDB structure ID - "STRING": STRING database ID - "ChEMBL": ChEMBL target ID |
required |
organism
|
int
|
NCBI taxonomy ID (default: 9606 for human). Only used for Gene_Name -> UniProt mapping. |
9606
|
return_dict
|
bool
|
If True, return dict mapping from_id -> to_id (or dict of to_ids when to_type is a list). If False, return DataFrame. |
False
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a string: Dict mapping source IDs to target IDs, or DataFrame with mapping. |
Union[Dict[str, str], Dict[str, Dict[str, str]], DataFrame]
|
When to_type is a list: Dict mapping source IDs to dicts of {target_type: target_id}, or DataFrame with from column and one column per target type. |
Example
UniProt to NCBI Gene ID:
result = translate_protein_ids(
["P04637", "P00533"],
from_type="UniProtKB_AC-ID",
to_type="GeneID",
)
print(result)
# from to
# 0 P04637 7157
# 1 P00533 1956
Gene names to UniProt:
result = translate_protein_ids(
["TP53", "EGFR"],
from_type="Gene_Name",
to_type="UniProtKB",
)
print(result)
# from to
# 0 TP53 P04637
# 1 EGFR P00533
Multiple target types:
translate_gene_to_uniprot¶
translate_gene_to_uniprot
¶
translate_gene_to_uniprot(
gene_names: List[str],
organism: int = 9606,
reviewed_only: bool = True,
return_dict: bool = True,
) -> Union[Dict[str, str], DataFrame]
Translate gene names/symbols to UniProt accessions.
This is a convenience function for the common use case of mapping gene symbols to their canonical UniProt protein accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gene_names
|
List[str]
|
List of gene names/symbols (e.g., ["TP53", "BRCA1"]). |
required |
organism
|
int
|
NCBI taxonomy ID (default: 9606 for human). |
9606
|
reviewed_only
|
bool
|
Only return reviewed (Swiss-Prot) entries. |
True
|
return_dict
|
bool
|
If True, return dict. If False, return DataFrame. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], DataFrame]
|
Dict or DataFrame mapping gene names to UniProt accessions. |
translate_uniprot_to_gene¶
translate_uniprot_to_gene
¶
translate_uniprot_to_gene(
accessions: List[str], return_dict: bool = True
) -> Union[Dict[str, str], DataFrame]
Translate UniProt accessions to gene names/symbols.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
accessions
|
List[str]
|
List of UniProt accessions (e.g., ["P04637", "P00533"]). |
required |
return_dict
|
bool
|
If True, return dict. If False, return DataFrame. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], DataFrame]
|
Dict or DataFrame mapping UniProt accessions to gene names. |
translate_uniprot_to_pdb¶
translate_uniprot_to_pdb
¶
translate_uniprot_to_pdb(
accessions: List[str], return_dict: bool = True
) -> Union[Dict[str, List[str]], DataFrame]
Translate UniProt accessions to PDB structure IDs.
Note: One protein may have multiple PDB structures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
accessions
|
List[str]
|
List of UniProt accessions. |
required |
return_dict
|
bool
|
If True, return dict. If False, return DataFrame. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, List[str]], DataFrame]
|
Dict mapping accessions to lists of PDB IDs, or DataFrame. |
translate_uniprot_to_ensembl¶
translate_uniprot_to_ensembl
¶
translate_uniprot_to_ensembl(
accessions: List[str], return_dict: bool = True
) -> Union[Dict[str, str], DataFrame]
Translate UniProt accessions to Ensembl gene IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
accessions
|
List[str]
|
List of UniProt accessions. |
required |
return_dict
|
bool
|
If True, return dict. If False, return DataFrame. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, str], DataFrame]
|
Dict or DataFrame mapping UniProt accessions to Ensembl IDs. |
translate_uniprot_to_refseq¶
translate_uniprot_to_refseq
¶
translate_uniprot_to_refseq(
accessions: List[str], return_dict: bool = True
) -> Union[Dict[str, List[str]], DataFrame]
Translate UniProt accessions to RefSeq protein IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
accessions
|
List[str]
|
List of UniProt accessions. |
required |
return_dict
|
bool
|
If True, return dict. If False, return DataFrame. |
True
|
Returns:
| Type | Description |
|---|---|
Union[Dict[str, List[str]], DataFrame]
|
Dict mapping accessions to lists of RefSeq IDs, or DataFrame. |
ID Type Reference¶
Universal Gene ID Aliases¶
Use these values for from_type / to_type in translate_gene_ids. The correct
database-native name is resolved automatically per backend.
| Universal alias | BioMart | NCBI | UniProt | Ensembl REST | HGNC |
|---|---|---|---|---|---|
gene_symbol |
external_gene_name |
symbol |
Gene_Name |
HGNC |
symbol |
ensembl_gene_id |
ensembl_gene_id |
ensembl_gene_id |
Ensembl |
ensembl_gene_id |
ensembl_gene_id |
ensembl_transcript_id |
ensembl_transcript_id |
— | — | — | — |
ensembl_protein_id |
ensembl_peptide_id |
— | — | — | — |
entrez_id |
entrezgene_id |
gene_id |
GeneID |
EntrezGene |
entrez_id |
hgnc_id |
hgnc_id |
— | — | — | hgnc_id |
hgnc_symbol |
hgnc_symbol |
— | — | — | symbol |
uniprot_id |
uniprot_gn_id |
uniprot |
UniProtKB_AC-ID |
Uniprot_gn |
uniprot_ids |
refseq_mrna |
refseq_mrna |
refseq_accession |
— | RefSeq_mRNA |
refseq_accession |
refseq_protein |
refseq_peptide |
refseq_accession |
RefSeq_Protein |
RefSeq_peptide |
refseq_accession |
pdb_id |
— | — | PDB |
— | — |
Native database strings (e.g. "external_gene_name") are also accepted and passed through unchanged.
Protein ID Types (UniProt mapping)¶
| ID Type | Description |
|---|---|
UniProtKB_AC-ID |
UniProt accession |
Gene_Name |
Gene symbol |
GeneID |
NCBI Gene ID |
Ensembl |
Ensembl gene ID |
RefSeq_Protein |
RefSeq protein ID |
PDB |
PDB structure ID |
Chemical ID Types (PubChem)¶
| ID Type | Description |
|---|---|
name |
Compound name |
cid |
PubChem CID |
smiles |
SMILES string |
inchikey |
InChIKey |
formula |
Molecular formula |