Skip to content

Gene ID Translation

Translate between different gene identifier systems using translate_gene_ids.

Quick Start

from biodbs.translate import translate_gene_ids

# Gene symbols → Ensembl IDs (using universal aliases)
result = translate_gene_ids(
    ["TP53", "BRCA1", "EGFR"],
    from_type="gene_symbol",
    to_type="ensembl_gene_id",
)

# Gene symbols → Entrez IDs
result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",
    to_type="entrez_id",
    database="ncbi",      # default
)

Universal ID Type Aliases

Every database has its own field names for the same concept. biodbs provides a set of universal aliases that work regardless of which backend you choose — the correct native name is resolved automatically.

GeneIDType enum

from biodbs.translate import GeneIDType

# Use enum members as from_type / to_type
result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type=GeneIDType.GENE_SYMBOL,
    to_type=GeneIDType.ENSEMBL_GENE_ID,
)

# Plain strings with the same values work identically
result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",   # same as GeneIDType.GENE_SYMBOL
    to_type="ensembl_gene_id", # same as GeneIDType.ENSEMBL_GENE_ID
)

Universal aliases and their values

GeneIDType member String value Description Example
GENE_SYMBOL "gene_symbol" Approved gene symbol "TP53"
ENSEMBL_GENE_ID "ensembl_gene_id" Ensembl stable gene ID "ENSG00000141510"
ENSEMBL_TRANSCRIPT_ID "ensembl_transcript_id" Ensembl transcript ID "ENST00000269305"
ENSEMBL_PROTEIN_ID "ensembl_protein_id" Ensembl protein ID "ENSP00000269305"
ENTREZ_ID "entrez_id" NCBI Entrez Gene ID "7157"
HGNC_ID "hgnc_id" HGNC identifier "HGNC:11998"
HGNC_SYMBOL "hgnc_symbol" HGNC-curated symbol "TP53"
UNIPROT_ID "uniprot_id" UniProt accession "P04637"
REFSEQ_MRNA "refseq_mrna" RefSeq mRNA accession "NM_000546"
REFSEQ_PROTEIN "refseq_protein" RefSeq protein accession "NP_000537"
PDB_ID "pdb_id" PDB structure ID "2OCJ"

How aliases resolve per database

When you pass a universal alias, it is automatically mapped to the native field name required by the chosen backend. Native field names are also accepted and passed through unchanged — so existing code keeps working.

Universal alias BioMart NCBI UniProt Ensembl REST HGNC
gene_symbol external_gene_name symbol Gene_Name HGNC symbol
ensembl_gene_id ensembl_gene_id ensembl_gene_id Ensembl ensembl_gene_id ensembl_gene_id
ensembl_transcript_id ensembl_transcript_id
ensembl_protein_id ensembl_peptide_id
entrez_id entrezgene_id gene_id GeneID EntrezGene entrez_id
hgnc_id hgnc_id hgnc_id
hgnc_symbol hgnc_symbol symbol
uniprot_id uniprot_gn_id uniprot UniProtKB_AC-ID Uniprot_gn uniprot_ids
refseq_mrna refseq_mrna refseq_accession RefSeq_mRNA refseq_accession
refseq_protein refseq_peptide refseq_accession RefSeq_Protein RefSeq_peptide refseq_accession
pdb_id PDB

Native strings are always accepted

If you pass a value that is not in the alias map (e.g. "external_gene_name" or "Gene_Name"), it is forwarded to the database unchanged. This means database-native field names still work, but universal aliases are preferred for portability.

Choosing a Database

from biodbs.translate import TranslationDatabase

result = translate_gene_ids(ids, from_type=..., to_type=...,
                             database=TranslationDatabase.NCBI)   # or "ncbi"
Database String Best for Human only?
NCBI (default) "ncbi" symbol ↔ Entrez ↔ Ensembl; most stable No
Ensembl REST "ensembl" Ensembl ID lookups; more stable than BioMart No
UniProt "uniprot" UniProt accession, PDB, RefSeq protein No
BioMart "biomart" Widest range of ID types; batch queries No
HGNC "hgnc" HGNC IDs, approved symbols, aliases Yes

Per-Database Details

NCBI (default)

Queries the NCBI Datasets API. Best starting point for most symbol ↔ ID translations.

result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",   # resolves to "symbol"
    to_type="entrez_id",       # resolves to "gene_id"
    database="ncbi",
)

Supported ID types (universal alias → native):

Universal alias Native field
gene_symbol symbol
ensembl_gene_id ensembl_gene_id
entrez_id gene_id
uniprot_id uniprot
refseq_mrna / refseq_protein refseq_accession

Ensembl REST

Uses the Ensembl /xrefs endpoint. Natural choice when starting from Ensembl IDs.

result = translate_gene_ids(
    ["ENSG00000141510", "ENSG00000012048"],
    from_type="ensembl_gene_id",
    to_type="entrez_id",        # resolves to "EntrezGene"
    database="ensembl",
)

Supported ID types (universal alias → native):

Universal alias Native field
gene_symbol HGNC
ensembl_gene_id ensembl_gene_id
entrez_id EntrezGene
uniprot_id Uniprot_gn
refseq_mrna RefSeq_mRNA
refseq_protein RefSeq_peptide

UniProt

Uses the UniProt ID-mapping API. Best for anything involving UniProt accessions, PDB IDs, or RefSeq protein IDs.

result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",   # resolves to "Gene_Name"
    to_type="uniprot_id",      # resolves to "UniProtKB_AC-ID"
    database="uniprot",
)

Supported ID types (universal alias → native):

Universal alias Native field
gene_symbol Gene_Name
ensembl_gene_id Ensembl
entrez_id GeneID
uniprot_id UniProtKB_AC-ID
refseq_protein RefSeq_Protein
pdb_id PDB

BioMart

Uses Ensembl BioMart. Supports the widest variety of ID types but is slower and less reliable than the other options for simple symbol translations.

result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",        # resolves to "external_gene_name"
    to_type="ensembl_transcript_id", # unique to BioMart
    database="biomart",
)

Supported ID types (universal alias → native):

Universal alias Native field
gene_symbol external_gene_name
ensembl_gene_id ensembl_gene_id
ensembl_transcript_id ensembl_transcript_id
ensembl_protein_id ensembl_peptide_id
entrez_id entrezgene_id
hgnc_id hgnc_id
hgnc_symbol hgnc_symbol
uniprot_id uniprot_gn_id
refseq_mrna refseq_mrna
refseq_protein refseq_peptide

HGNC

Uses the HGNC REST API. Authoritative source for approved human gene symbols, HGNC IDs, and aliases. Human genes only.

result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",   # resolves to "symbol"
    to_type="hgnc_id",         # resolves to "hgnc_id"
    database="hgnc",
)

result = translate_gene_ids(
    ["HGNC:11998", "HGNC:1100"],
    from_type="hgnc_id",
    to_type="ensembl_gene_id",
    database="hgnc",
)

Supported ID types (universal alias → native):

Universal alias Native field
gene_symbol / hgnc_symbol symbol
hgnc_id hgnc_id
entrez_id entrez_id
ensembl_gene_id ensembl_gene_id
uniprot_id uniprot_ids
refseq_mrna / refseq_protein refseq_accession

Multiple Target Types

Pass a list to to_type to retrieve several ID types in one call:

result = translate_gene_ids(
    ["TP53", "BRCA1", "EGFR"],
    from_type="gene_symbol",
    to_type=["ensembl_gene_id", "entrez_id", "hgnc_id"],
    database="biomart",
)
#   gene_symbol    ensembl_gene_id  entrez_id     hgnc_id
# 0        TP53  ENSG00000141510       7157  HGNC:11998
# 1       BRCA1  ENSG00000012048        672   HGNC:1100
# 2        EGFR  ENSG00000146648       1956   HGNC:3236

# As nested dict
result = translate_gene_ids(
    ["TP53", "BRCA1"],
    from_type="gene_symbol",
    to_type=["ensembl_gene_id", "entrez_id"],
    return_dict=True,
)
# {'TP53': {'ensembl_gene_id': 'ENSG00000141510', 'entrez_id': '7157'}, ...}

KEGG Translation

translate_gene_ids_kegg uses KEGG's conv endpoint, which maps between KEGG organism-specific gene IDs and external databases.

from biodbs.translate import translate_gene_ids_kegg

# KEGG IDs → Entrez Gene IDs
result = translate_gene_ids_kegg(
    ["hsa:7157", "hsa:672"],
    from_db="hsa",
    to_db="ncbi-geneid",
)

# KEGG IDs → UniProt accessions
result = translate_gene_ids_kegg(
    ["hsa:7157"],
    from_db="hsa",
    to_db="uniprot",
)

KEGG database codes

Code Description
hsa Human genes
mmu Mouse genes
rno Rat genes
ncbi-geneid NCBI Entrez Gene ID
ncbi-proteinid NCBI Protein ID
uniprot UniProt accession

Species Support

All databases except HGNC support multiple species:

result = translate_gene_ids(ids, from_type="gene_symbol",
                             to_type="ensembl_gene_id", species="mouse")

Accepted values for species: "human" (default), "mouse", "rat", "zebrafish", "fly", "worm", "yeast" — or a Species enum member.

Return Formats

# DataFrame (default) — one row per input ID
df = translate_gene_ids(ids, from_type="gene_symbol", to_type="entrez_id")

# Dict — {input_id: translated_id} for single target
mapping = translate_gene_ids(ids, from_type="gene_symbol",
                              to_type="entrez_id", return_dict=True)
# {'TP53': '7157', 'BRCA1': '672', ...}

# Dict — {input_id: {target: value, ...}} for multiple targets
mapping = translate_gene_ids(ids, from_type="gene_symbol",
                              to_type=["entrez_id", "ensembl_gene_id"],
                              return_dict=True)
# {'TP53': {'entrez_id': '7157', 'ensembl_gene_id': 'ENSG00000141510'}, ...}
  • HGNC — Direct HGNC API access (symbol search, cross-reference lookup).
  • BioMart — Batch gene annotation queries.
  • Ensembl — REST API for detailed gene lookups.
  • NCBI — NCBI Gene database.
  • UniProt — Protein-centric ID mapping.
  • KEGG — KEGG gene identifiers.
  • Protein ID Translation — Gene ↔ UniProt mapping convenience functions.
  • Over-Representation Analysis — Translate IDs before pathway enrichment.