Working with Entrez Databases in Biopython

Biological data is scattered across many specialized databases: papers in PubMed, DNA sequences in Nucleotide, proteins in Protein, and gene records in Gene. NCBI’s Entrez system ties these resources together, and Biopython gives you a clean Python interface for working with them. Once you know the basics, you can automate literature searches, fetch sequence records, and connect related records across databases without leaving your script.

In this tutorial, you will learn how to use `Bio.Entrez` to search databases, inspect summary information, download full records, and follow links between databases.

## What `Bio.Entrez` does

`Bio.Entrez` is Biopython’s interface to NCBI’s Entrez Programming Utilities, often called E-utilities. These are web services, so your Python code sends requests over the internet and gets back structured data such as XML, plain text, or sequence files.

A typical workflow looks like this:

1. Search a database with `esearch`
2. Read the list of matching IDs
3. Fetch summaries with `esummary` or full records with `efetch`
4. Optionally follow links between databases with `elink`

Before making requests, you should identify yourself with an email address.

## Your first Entrez request: searching PubMed

Let’s start by searching PubMed for articles about CRISPR and counting how many matches were found.

from Bio import Entrez

# Always identify yourself when using Entrez
Entrez.email = "your_email@example.com"

with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=5) as handle:
    record = Entrez.read(handle)

print("Total matches:", record["Count"])
print("Returned IDs:", record["IdList"])

Total matches: 59639
Returned IDs: ['41811196', '41811192', '41811082', '41810550', '41810388']

This example sends a search request to the PubMed database. The `term` argument uses PubMed’s query syntax, and `retmax=5` limits how many IDs are returned in this request. `Entrez.read()` parses the XML response into Python data structures such as dictionaries and lists.

## Understanding the search result

The object returned by `Entrez.read()` is usually a nested dictionary-like structure. It is helpful to inspect its keys so you know what data is available.

from Bio import Entrez

Entrez.email = "your_email@example.com"

with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=3) as handle:
    record = Entrez.read(handle)

print("Keys in the result:")
for key in record:
    print("-", key)

print("\nID list:")
for pmid in record["IdList"]:
    print(pmid)

Keys in the result:
- Count
- RetMax
- RetStart
- IdList
- TranslationSet
- QueryTranslation

ID list:
41811196
41811192
41811082

This code shows the main fields returned by `esearch`. The most important ones are usually `Count`, which tells you the total number of matches, and `IdList`, which contains the IDs returned in the current request.

## Fetching article summaries with `esummary`

A list of PubMed IDs is useful, but usually you want article titles and publication details. That is what `esummary` is for.

from Bio import Entrez

Entrez.email = "your_email@example.com"

with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=3) as search_handle:
    search_results = Entrez.read(search_handle)

id_list = search_results["IdList"]
id_string = ",".join(id_list)

with Entrez.esummary(db="pubmed", id=id_string) as summary_handle:
    summaries = Entrez.read(summary_handle)

for article in summaries:
    print("Title:", article.get("Title", "No title available"))
    print("PubDate:", article.get("PubDate", "No date available"))
    print("Source:", article.get("Source", "No source available"))
    print("-" * 60)

Title: Glycosomal Phosphoenolpyruvate Carboxykinase CRISPR/Cas9-Deletion and Its Role in Trypanosoma cruzi Metacyclogenesis and Infectivity in Mammalian Host.
PubDate: 2026 Mar 31
Source: FASEB J
------------------------------------------------------------
Title: A genome-wide MAGIC kit for recombinase-independent mosaic analysis in Drosophila.
PubDate: 2026 Mar 11
Source: Elife
------------------------------------------------------------
Title: Characterization of the cell division-associated peptidoglycan amidase AmiA of Chlamydia trachomatis.
PubDate: 2026 Mar 11
Source: J Bacteriol
------------------------------------------------------------

This example first searches PubMed, then sends those PubMed IDs to `esummary`. The summary records are much lighter than full article records, so they are often a good first step when exploring results.

## Downloading full sequence records with `efetch`

Entrez is not only for literature. You can also retrieve biological sequences from NCBI databases such as `nucleotide` and `protein`.

Here is an example that fetches a GenBank record from the nucleotide database and saves it to a local file.

from Bio import Entrez

Entrez.email = "your_email@example.com"

accession = "NM_000546"  # TP53 transcript example

with Entrez.efetch(
    db="nucleotide",
    id=accession,
    rettype="gb",
    retmode="text"
) as fetch_handle:
    genbank_text = fetch_handle.read()

with open("tp53_record.gb", "w", encoding="utf-8") as output_handle:
    output_handle.write(genbank_text)

print("Saved GenBank record to tp53_record.gb")

Saved GenBank record to tp53_record.gb

`efetch` downloads the full record. Here, `rettype="gb"` asks for GenBank format, and `retmode="text"` requests the plain-text version of that format. Saving records to files is useful when you want to inspect them later or parse them with other Biopython tools.

## Parsing a fetched GenBank record with `SeqIO`

Once you have a GenBank file, you can use `Bio.SeqIO` to work with the sequence and annotations in Python.

from Bio import Entrez, SeqIO

Entrez.email = "your_email@example.com"

accession = "NM_000546"

with Entrez.efetch(
    db="nucleotide",
    id=accession,
    rettype="gb",
    retmode="text"
) as handle:
    record = SeqIO.read(handle, "genbank")

print("ID:", record.id)
print("Name:", record.name)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print("Number of features:", len(record.features))

ID: NM_000546.6
Name: NM_000546
Description: Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA
Sequence length: 2512
Number of features: 70

In this version, we do not save the file first. Instead, we pass the network handle directly to `SeqIO.read()`, which parses the GenBank data into a `SeqRecord` object. This is a common Biopython pattern and makes scripts shorter and cleaner.

## Searching the nucleotide database directly

Let’s search NCBI’s nucleotide database for a specific organism and gene name, then fetch the first matching FASTA record.

from Bio import Entrez

Entrez.email = "your_email@example.com"

query = 'BRCA1[Gene] AND Homo sapiens[Organism]'

with Entrez.esearch(db="nucleotide", term=query, retmax=1) as search_handle:
    search_results = Entrez.read(search_handle)

id_list = search_results["IdList"]

if id_list:
    first_id = id_list[0]
    with Entrez.efetch(
        db="nucleotide",
        id=first_id,
        rettype="fasta",
        retmode="text"
    ) as fetch_handle:
        fasta_data = fetch_handle.read()

    print(fasta_data)
else:
    print("No records found.")

>PX673138.1 Homo sapiens isolate AFA-BRCA1 breast cancer type 1 susceptibility protein (BRCA1) gene, partial cds
CCTGATGGGTTGTGATTTGGTTTCTTTCAACATGATTTTGAAGTCAGAGGAGATGTGGTCAATGGAAGAA
ACCACCAAGGTCCAAAGCGAGCAAGAGAATCCCAGGACAGAAAGGTAAAGCTCCCTCCCTCAAGTTGACA
AAAATCTCACCCCACCACTCTGTATTCCACTCCCCTTTGCAGAGATGGGCCGCTTCATTTTGTAAGACTT
ATTACATACATACACAGTGCTAGATACTTTCACACAGGTTCTTTTTTCACTCTTCCATCCCAACCACATA
AATAAGTATTGTCTCTACTTTATGAATGATAAAACTAAGAGATTTAGAGAGGCTGTGTAATTTGGGATTC
CC

This code combines `esearch` and `efetch` in a simple pipeline. It is a practical pattern when you know what you want to search for but do not yet know the accession number.

## Linking related records with `elink`

One of the most powerful features of Entrez is that records in different databases are connected. For example, a Gene record may link to related proteins, sequences, or publications.

The next example starts with a Gene ID and finds linked PubMed articles.

from Bio import Entrez

Entrez.email = "your_email@example.com"

gene_id = "7157"  # TP53 gene

with Entrez.elink(dbfrom="gene", db="pubmed", id=gene_id) as handle:
    link_results = Entrez.read(handle)

pubmed_ids = []

for linkset_db in link_results[0].get("LinkSetDb", []):
    for link in linkset_db.get("Link", []):
        pubmed_ids.append(link["Id"])

print("Linked PubMed IDs:")
for pmid in pubmed_ids[:10]:
    print(pmid)

Linked PubMed IDs:
19232460
19245591
19245247
27152024
19242059
19239324
19238535
27166782
27167113
27147571

`elink` helps you move from one Entrez database to another. In this example, the gene record acts as the starting point, and the script collects PubMed IDs that are related to that gene.

## Working with larger result sets using WebEnv and QueryKey

If a search returns many results, sending long ID lists around is inefficient. Entrez can store your search history on the NCBI side and let you reuse it in later requests. This is called using the history server.

from Bio import Entrez

Entrez.email = "your_email@example.com"

with Entrez.esearch(
    db="pubmed",
    term="cancer genomics[Title/Abstract]",
    usehistory="y",
    retmax=5
) as search_handle:
    search_results = Entrez.read(search_handle)

print("Count:", search_results["Count"])
print("WebEnv:", search_results["WebEnv"])
print("QueryKey:", search_results["QueryKey"])

with Entrez.efetch(
    db="pubmed",
    rettype="medline",
    retmode="text",
    retstart=0,
    retmax=3,
    webenv=search_results["WebEnv"],
    query_key=search_results["QueryKey"]
) as fetch_handle:
    medline_text = fetch_handle.read()

print(medline_text[:1500])

Count: 2408
WebEnv: MCID_69b1c06d29a14bb3b70a8d1d
QueryKey: 1

PMID- 41802450
OWN - NLM
STAT- Publisher
LR  - 20260309
IS  - 2383-5001 (Electronic)
IS  - 2288-8128 (Linking)
DP  - 2026 Mar 9
TI  - Combined hepatocellular-cholangiocarcinoma: a contemporary pathologic and 
      molecular perspective.
LID - 10.17998/jlc.2026.03.06 [doi]
AB  - Combined hepatocellular cholangiocarcinoma (cHCC-CCA) is a rare primary liver 
      carcinoma characterized by the unequivocal coexistence of hepatocytic and 
      cholangiocytic differentiation within a single tumor. Despite its low incidence, 
      cHCC-CCA has received considerable attention because of its marked histologic 
      heterogeneity, diagnostic challenges, and poorer clinical outcomes than 
      conventional hepatocellular carcinoma. Historically, the biological nature of 
      cHCC-CCA has been controversial, with competing hypotheses, including derivation 
      from hepatic progenitor cells, collision of independent tumors, and 
      transdifferentiation between hepatocytic and biliary lineages. Recent advances in 
      genomic and transcriptomic profiling have substantially improved this 
      understanding. Accumulating evidence indicates that most cHCC-CCAs arise from a 
      common clonal origin and subsequently undergo divergent differentiation rather 
      than representing true collision tumors. Transcriptomic analyses further 
      demonstrate that cHCC-CCAs span a biological continuum between hepatocellular- 
      and cholangiocytic-like states, with intermediat

When `usehistory="y"` is set, the search results are stored remotely by NCBI. You can then fetch batches of records using `WebEnv` and `QueryKey` instead of manually building huge ID strings. This is especially useful for larger projects.

## Getting data in XML and reading it with `Entrez.read`

Some Entrez utilities return XML that Biopython can parse directly into Python objects. This is often easier than reading plain text if you want to extract specific fields.

from Bio import Entrez

Entrez.email = "your_email@example.com"

with Entrez.efetch(db="pubmed", id="31452104", retmode="xml") as handle:
    records = Entrez.read(handle)

article = records["PubmedArticle"][0]
citation = article["MedlineCitation"]
article_info = citation["Article"]

title = article_info["ArticleTitle"]
journal = article_info["Journal"]["Title"]

print("Title:", title)
print("Journal:", journal)

Title: Molegro Virtual Docker for Docking.
Journal: Methods in molecular biology (Clifton, N.J.)

This example fetches a PubMed record in XML format and navigates through the nested structure to extract the article title and journal name. XML responses can look complicated at first, but they are very useful for reliable programmatic access.

## Using an API key

NCBI allows API keys for higher request rates. If you have one, you can set it once near the top of your script.

```python [main.nopy]
from Bio import Entrez

Entrez.email = "your_email@example.com"
Entrez.api_key = "YOUR_NCBI_API_KEY"

with Entrez.esearch(db="pubmed", term="biopython", retmax=3) as handle:
    record = Entrez.read(handle)

print(record["IdList"])
```

This code works exactly like earlier examples, but it includes an API key. Replace the placeholder string with your real key if you have one.

## A complete mini-workflow

The next script puts several ideas together. It searches PubMed, gets summaries for the first few articles, and writes the titles to a text file.

from Bio import Entrez

Entrez.email = "your_email@example.com"

query = "single-cell RNA-seq[Title/Abstract]"

with Entrez.esearch(db="pubmed", term=query, retmax=5) as search_handle:
    search_results = Entrez.read(search_handle)

id_list = search_results["IdList"]

if not id_list:
    print("No articles found.")
else:
    with Entrez.esummary(db="pubmed", id=",".join(id_list)) as summary_handle:
        summaries = Entrez.read(summary_handle)

    with open("pubmed_titles.txt", "w", encoding="utf-8") as output_handle:
        for i, article in enumerate(summaries, start=1):
            title = article.get("Title", "No title available")
            output_handle.write(f"{i}. {title}\n")

    print("Saved article titles to pubmed_titles.txt")

Saved article titles to pubmed_titles.txt

This is the kind of script you could adapt for a class project or a small analysis pipeline. It shows how Entrez can help you collect structured biological information quickly.

## Common pitfalls

A few issues come up often when people first use `Bio.Entrez`.

### 1. Forgetting to set `Entrez.email`

NCBI expects you to identify yourself. Always set `Entrez.email` before making requests.

### 2. Using the wrong database name

The database name must match an Entrez database, such as `pubmed`, `nucleotide`, `protein`, or `gene`. A search that looks valid can still fail if the database name is wrong.

### 3. Mixing text and XML parsing

If you request `retmode="xml"`, use `Entrez.read()` to parse it. If you request plain text such as FASTA or GenBank text, use `handle.read()` or parse it with `SeqIO`.

### 4. Requesting too much data at once

For large result sets, use `retmax`, `retstart`, and the history server features instead of trying to fetch everything in one call.

## When to use each Entrez utility

Here is a good rule of thumb:

* Use `esearch` when you need IDs that match a query
* Use `esummary` when you want compact record summaries
* Use `efetch` when you need full records or sequence files
* Use `elink` when you want to jump between related databases

Once you understand those four tools, you can solve many real bioinformatics data-access problems.

## Conclusion

`Bio.Entrez` is one of the most practical parts of Biopython because it connects Python directly to major NCBI databases. With only a few functions, you can search PubMed, fetch GenBank records, parse XML metadata, and follow links between genes, sequences, proteins, and papers. That makes Entrez a great tool for automating repetitive research tasks and building small bioinformatics workflows.

Working with Entrez Databases in Biopython

You may also like