ClinVar¶
Access clinical variant data via the NCBI ClinVar API.
Overview¶
ClinVar is NCBI's archive of relationships between genomic variation and human health. biodbs provides:
- Gene and condition search — find variants by gene symbol or disease name in one step
- Batch fetch — retrieve full variant summaries by variation UID
- Clinical significance filtering — easily isolate pathogenic variants
- PubMed linking — find publications associated with a variant
- VCV / RCV XML access — retrieve full structured records
Quick Start¶
from biodbs.fetch import clinvar_search_gene, clinvar_search_condition
# Pathogenic variants in BRCA1
data = clinvar_search_gene("BRCA1", clinical_significance="pathogenic")
df = data.as_dataframe()
print(df[["variation_id", "title", "clinical_significance"]])
# Variants linked to Lynch syndrome
data = clinvar_search_condition("Lynch syndrome", retmax=100)
print(len(data))
Searching¶
By Gene Symbol¶
from biodbs.fetch import clinvar_search_gene
# All variants for TP53
data = clinvar_search_gene("TP53", retmax=500)
# Pathogenic only
data = clinvar_search_gene("TP53", clinical_significance="pathogenic")
# Include multi-gene variants
data = clinvar_search_gene("TP53", single_gene=False)
By Condition / Disease¶
from biodbs.fetch import clinvar_search_condition
data = clinvar_search_condition("Breast cancer", retmax=200)
data = clinvar_search_condition("Lynch syndrome",
clinical_significance="pathogenic")
Custom Entrez Query¶
from biodbs.fetch import clinvar_search, clinvar_count, clinvar_fetch_by_id
# Count first
n = clinvar_count("BRCA2[gene] AND pathogenic[clnsig]")
print(f"{n} pathogenic BRCA2 variants")
# Then fetch
uids = clinvar_search("BRCA2[gene] AND pathogenic[clnsig]", retmax=50)
data = clinvar_fetch_by_id(uids)
Fetching by ID¶
from biodbs.fetch import clinvar_fetch_by_id
# By variation UID
data = clinvar_fetch_by_id([65533, 14206, 31124])
print(data.as_dataframe())
Filtering Results¶
data = clinvar_search_gene("BRCA1", retmax=1000)
# Keep only pathogenic / likely pathogenic variants
path = data.pathogenic()
print(f"{len(path)} pathogenic variants")
# Access individual variants
for variant in path:
print(variant.variation_id, variant.title, variant.clinical_significance)
Working with Results¶
The ClinVarFetchedData container¶
Printing the container shows a brief summary:
data = clinvar_search_gene("BRCA1", clinical_significance="pathogenic", retmax=50)
print(data)
# <ClinVarFetchedData: 50 variants (of 1243 found)>
It supports standard sequence operations:
len(data) # 50
data[0] # first variant
data[-1] # last variant
data[0:5] # slice → list of ClinVarVariant
# Iterate
for variant in data:
print(variant.accession, variant.clinical_significance)
Lookup by identifier¶
Any of the standard ClinVar accession formats work as a key:
data["65533"] # by variation UID (string or int not accepted — must be str)
data["VCV000065533"] # by VCV accession (with or without version suffix)
data["VCV000065533.5"] # versioned VCV accession
data["RCV000031124"] # by RCV accession
The ClinVarVariant record¶
Printing a single variant shows its key details:
All available fields:
| Field | Type | Example |
|---|---|---|
variation_id |
str |
"65533" |
accession |
str |
"VCV000065533" |
accession_version |
str |
"VCV000065533.5" |
title |
str |
"NM_007294.4(BRCA1):c.5266dupC (p.Gln1756ProfsTer25)" |
variation_type |
str \| None |
"single nucleotide variant" |
clinical_significance |
str \| None |
"Pathogenic" |
review_status |
str \| None |
"reviewed by expert panel" |
last_evaluated |
str \| None |
"2018-09-21" |
gene_symbols |
List[str] |
["BRCA1"] |
gene_ids |
List[str] |
["672"] |
conditions |
List[str] |
["Hereditary breast ovarian cancer syndrome"] |
rcv_accessions |
List[str] |
["RCV000031124"] |
protein_change |
str \| None |
"Gln1756ProfsTer25" |
chromosome |
str \| None |
"17" |
start |
int \| None |
43045703 |
stop |
int \| None |
43045703 |
assembly |
str \| None |
"GRCh38" |
Computed properties:
v.is_pathogenic # True if Pathogenic or Likely pathogenic
v.primary_gene # first entry of gene_symbols, or None
Collection-level accessors¶
data.variation_ids() # ["65533", "14206", ...] — all variation UIDs
data.accessions() # ["VCV000065533", ...] — all VCV accessions
data.gene_symbols() # ["BRCA1", "BRCA2", ...] — unique symbols across all variants
data.total_count # total hits from the original search (may exceed len(data))
Convert to DataFrame¶
df = data.as_dataframe()
print(df.columns.tolist())
# ['variation_id', 'accession', 'accession_version', 'title', 'variation_type',
# 'clinical_significance', 'review_status', 'last_evaluated', 'gene_symbols',
# 'gene_ids', 'conditions', 'rcv_accessions', 'protein_change',
# 'chromosome', 'start', 'stop', 'assembly']
List-valued fields (gene_symbols, gene_ids, conditions, rcv_accessions) are joined with ";" in the DataFrame for easy CSV export.
XML Records¶
from biodbs.fetch import clinvar_fetch_vcv, clinvar_fetch_rcv
# Full VCV XML (variation-centric record)
xml = clinvar_fetch_vcv("VCV000065533")
# Full RCV XML (variation-condition pair)
xml = clinvar_fetch_rcv("RCV000031124")
PubMed Links¶
from biodbs.fetch import clinvar_link_pubmed
pmids = clinvar_link_pubmed(65533)
print(f"Found {len(pmids)} linked PubMed articles")
Using the Fetcher Class¶
from biodbs.fetch.ClinVar import ClinVar_Fetcher
fetcher = ClinVar_Fetcher()
uids = fetcher.search("BRCA1[gene] AND pathogenic[clnsig]")
data = fetcher.fetch_summary(uids[:20])
Related Resources¶
- NCBI — Gene summaries, taxonomy, and genome assemblies from NCBI.
- Disease Ontology — Structured disease terms to complement ClinVar condition names.
- UniProt — Fetch protein-level functional data for genes identified in ClinVar.