Biological data is scattered across many specialized databases: papers in PubMed, DNA sequences in Nucleotide, proteins in Protein, and gene records in Gene. NCBI’s Entrez system ties these resources together, and Biopython gives you a clean Python interface for working with them. Once you know the basics, you can automate literature searches, fetch sequence records, and connect related records across databases without leaving your script. In this tutorial, you will learn how to use `Bio.Entrez` to search databases, inspect summary information, download full records, and follow links between databases. ## What `Bio.Entrez` does `Bio.Entrez` is Biopython’s interface to NCBI’s Entrez Programming Utilities, often called E-utilities. These are web services, so your Python code sends requests over the internet and gets back structured data such as XML, plain text, or sequence files. A typical workflow looks like this: 1. Search a database with `esearch` 2. Read the list of matching IDs 3. Fetch summaries with `esummary` or full records with `efetch` 4. Optionally follow links between databases with `elink` Before making requests, you should identify yourself with an email address. ## Your first Entrez request: searching PubMed Let’s start by searching PubMed for articles about CRISPR and counting how many matches were found.
from Bio import Entrez
# Always identify yourself when using Entrez
Entrez.email = "your_email@example.com"
with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=5) as handle:
record = Entrez.read(handle)
print("Total matches:", record["Count"])
print("Returned IDs:", record["IdList"])This example sends a search request to the PubMed database. The `term` argument uses PubMed’s query syntax, and `retmax=5` limits how many IDs are returned in this request. `Entrez.read()` parses the XML response into Python data structures such as dictionaries and lists. ## Understanding the search result The object returned by `Entrez.read()` is usually a nested dictionary-like structure. It is helpful to inspect its keys so you know what data is available.
from Bio import Entrez
Entrez.email = "your_email@example.com"
with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=3) as handle:
record = Entrez.read(handle)
print("Keys in the result:")
for key in record:
print("-", key)
print("\nID list:")
for pmid in record["IdList"]:
print(pmid)This code shows the main fields returned by `esearch`. The most important ones are usually `Count`, which tells you the total number of matches, and `IdList`, which contains the IDs returned in the current request. ## Fetching article summaries with `esummary` A list of PubMed IDs is useful, but usually you want article titles and publication details. That is what `esummary` is for.
from Bio import Entrez
Entrez.email = "your_email@example.com"
with Entrez.esearch(db="pubmed", term="CRISPR[Title/Abstract]", retmax=3) as search_handle:
search_results = Entrez.read(search_handle)
id_list = search_results["IdList"]
id_string = ",".join(id_list)
with Entrez.esummary(db="pubmed", id=id_string) as summary_handle:
summaries = Entrez.read(summary_handle)
for article in summaries:
print("Title:", article.get("Title", "No title available"))
print("PubDate:", article.get("PubDate", "No date available"))
print("Source:", article.get("Source", "No source available"))
print("-" * 60)This example first searches PubMed, then sends those PubMed IDs to `esummary`. The summary records are much lighter than full article records, so they are often a good first step when exploring results. ## Downloading full sequence records with `efetch` Entrez is not only for literature. You can also retrieve biological sequences from NCBI databases such as `nucleotide` and `protein`. Here is an example that fetches a GenBank record from the nucleotide database and saves it to a local file.
from Bio import Entrez
Entrez.email = "your_email@example.com"
accession = "NM_000546" # TP53 transcript example
with Entrez.efetch(
db="nucleotide",
id=accession,
rettype="gb",
retmode="text"
) as fetch_handle:
genbank_text = fetch_handle.read()
with open("tp53_record.gb", "w", encoding="utf-8") as output_handle:
output_handle.write(genbank_text)
print("Saved GenBank record to tp53_record.gb")`efetch` downloads the full record. Here, `rettype="gb"` asks for GenBank format, and `retmode="text"` requests the plain-text version of that format. Saving records to files is useful when you want to inspect them later or parse them with other Biopython tools. ## Parsing a fetched GenBank record with `SeqIO` Once you have a GenBank file, you can use `Bio.SeqIO` to work with the sequence and annotations in Python.
from Bio import Entrez, SeqIO
Entrez.email = "your_email@example.com"
accession = "NM_000546"
with Entrez.efetch(
db="nucleotide",
id=accession,
rettype="gb",
retmode="text"
) as handle:
record = SeqIO.read(handle, "genbank")
print("ID:", record.id)
print("Name:", record.name)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print("Number of features:", len(record.features))In this version, we do not save the file first. Instead, we pass the network handle directly to `SeqIO.read()`, which parses the GenBank data into a `SeqRecord` object. This is a common Biopython pattern and makes scripts shorter and cleaner. ## Searching the nucleotide database directly Let’s search NCBI’s nucleotide database for a specific organism and gene name, then fetch the first matching FASTA record.
from Bio import Entrez
Entrez.email = "your_email@example.com"
query = 'BRCA1[Gene] AND Homo sapiens[Organism]'
with Entrez.esearch(db="nucleotide", term=query, retmax=1) as search_handle:
search_results = Entrez.read(search_handle)
id_list = search_results["IdList"]
if id_list:
first_id = id_list[0]
with Entrez.efetch(
db="nucleotide",
id=first_id,
rettype="fasta",
retmode="text"
) as fetch_handle:
fasta_data = fetch_handle.read()
print(fasta_data)
else:
print("No records found.")This code combines `esearch` and `efetch` in a simple pipeline. It is a practical pattern when you know what you want to search for but do not yet know the accession number. ## Linking related records with `elink` One of the most powerful features of Entrez is that records in different databases are connected. For example, a Gene record may link to related proteins, sequences, or publications. The next example starts with a Gene ID and finds linked PubMed articles.
from Bio import Entrez
Entrez.email = "your_email@example.com"
gene_id = "7157" # TP53 gene
with Entrez.elink(dbfrom="gene", db="pubmed", id=gene_id) as handle:
link_results = Entrez.read(handle)
pubmed_ids = []
for linkset_db in link_results[0].get("LinkSetDb", []):
for link in linkset_db.get("Link", []):
pubmed_ids.append(link["Id"])
print("Linked PubMed IDs:")
for pmid in pubmed_ids[:10]:
print(pmid)`elink` helps you move from one Entrez database to another. In this example, the gene record acts as the starting point, and the script collects PubMed IDs that are related to that gene. ## Working with larger result sets using WebEnv and QueryKey If a search returns many results, sending long ID lists around is inefficient. Entrez can store your search history on the NCBI side and let you reuse it in later requests. This is called using the history server.
from Bio import Entrez
Entrez.email = "your_email@example.com"
with Entrez.esearch(
db="pubmed",
term="cancer genomics[Title/Abstract]",
usehistory="y",
retmax=5
) as search_handle:
search_results = Entrez.read(search_handle)
print("Count:", search_results["Count"])
print("WebEnv:", search_results["WebEnv"])
print("QueryKey:", search_results["QueryKey"])
with Entrez.efetch(
db="pubmed",
rettype="medline",
retmode="text",
retstart=0,
retmax=3,
webenv=search_results["WebEnv"],
query_key=search_results["QueryKey"]
) as fetch_handle:
medline_text = fetch_handle.read()
print(medline_text[:1500])When `usehistory="y"` is set, the search results are stored remotely by NCBI. You can then fetch batches of records using `WebEnv` and `QueryKey` instead of manually building huge ID strings. This is especially useful for larger projects. ## Getting data in XML and reading it with `Entrez.read` Some Entrez utilities return XML that Biopython can parse directly into Python objects. This is often easier than reading plain text if you want to extract specific fields.
from Bio import Entrez
Entrez.email = "your_email@example.com"
with Entrez.efetch(db="pubmed", id="31452104", retmode="xml") as handle:
records = Entrez.read(handle)
article = records["PubmedArticle"][0]
citation = article["MedlineCitation"]
article_info = citation["Article"]
title = article_info["ArticleTitle"]
journal = article_info["Journal"]["Title"]
print("Title:", title)
print("Journal:", journal)
This example fetches a PubMed record in XML format and navigates through the nested structure to extract the article title and journal name. XML responses can look complicated at first, but they are very useful for reliable programmatic access.
## Using an API key
NCBI allows API keys for higher request rates. If you have one, you can set it once near the top of your script.
```python [main.nopy]
from Bio import Entrez
Entrez.email = "your_email@example.com"
Entrez.api_key = "YOUR_NCBI_API_KEY"
with Entrez.esearch(db="pubmed", term="biopython", retmax=3) as handle:
record = Entrez.read(handle)
print(record["IdList"])
```
This code works exactly like earlier examples, but it includes an API key. Replace the placeholder string with your real key if you have one.
## A complete mini-workflow
The next script puts several ideas together. It searches PubMed, gets summaries for the first few articles, and writes the titles to a text file.
from Bio import Entrez
Entrez.email = "your_email@example.com"
query = "single-cell RNA-seq[Title/Abstract]"
with Entrez.esearch(db="pubmed", term=query, retmax=5) as search_handle:
search_results = Entrez.read(search_handle)
id_list = search_results["IdList"]
if not id_list:
print("No articles found.")
else:
with Entrez.esummary(db="pubmed", id=",".join(id_list)) as summary_handle:
summaries = Entrez.read(summary_handle)
with open("pubmed_titles.txt", "w", encoding="utf-8") as output_handle:
for i, article in enumerate(summaries, start=1):
title = article.get("Title", "No title available")
output_handle.write(f"{i}. {title}\n")
print("Saved article titles to pubmed_titles.txt")This is the kind of script you could adapt for a class project or a small analysis pipeline. It shows how Entrez can help you collect structured biological information quickly. ## Common pitfalls A few issues come up often when people first use `Bio.Entrez`. ### 1. Forgetting to set `Entrez.email` NCBI expects you to identify yourself. Always set `Entrez.email` before making requests. ### 2. Using the wrong database name The database name must match an Entrez database, such as `pubmed`, `nucleotide`, `protein`, or `gene`. A search that looks valid can still fail if the database name is wrong. ### 3. Mixing text and XML parsing If you request `retmode="xml"`, use `Entrez.read()` to parse it. If you request plain text such as FASTA or GenBank text, use `handle.read()` or parse it with `SeqIO`. ### 4. Requesting too much data at once For large result sets, use `retmax`, `retstart`, and the history server features instead of trying to fetch everything in one call. ## When to use each Entrez utility Here is a good rule of thumb: * Use `esearch` when you need IDs that match a query * Use `esummary` when you want compact record summaries * Use `efetch` when you need full records or sequence files * Use `elink` when you want to jump between related databases Once you understand those four tools, you can solve many real bioinformatics data-access problems. ## Conclusion `Bio.Entrez` is one of the most practical parts of Biopython because it connects Python directly to major NCBI databases. With only a few functions, you can search PubMed, fetch GenBank records, parse XML metadata, and follow links between genes, sequences, proteins, and papers. That makes Entrez a great tool for automating repetitive research tasks and building small bioinformatics workflows.