**GenBank files** are one of the most information-rich formats used in bioinformatics. Unlike [FASTA files](/tutorials/biopython-fasta-files), which typically store only sequence data, GenBank files include extensive biological annotations such as: * gene locations * coding sequences (CDS) * regulatory regions * [protein translations](/tutorials/biopython-translating-dna-to-protein) * organism metadata * references and publication information Because of this, GenBank files are widely used in genome annotation, plasmid analysis, and biological databases. In this tutorial, you will learn how to use **Biopython** to work with GenBank files. Specifically, you will learn how to: * download a GenBank file * read sequence data * explore metadata and annotations * inspect genomic features * extract genes and coding sequences * convert GenBank files to other formats The key tool we will use is the `Bio.SeqIO` module. --- ## Downloading an Example GenBank File First, let's download a GenBank file to work with.
import requests
url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"
response = requests.get(url)
response.raise_for_status()
with open("plasmid.gbk", "wb") as f:
f.write(response.content)
print("Downloaded plasmid.gbk")This code downloads an example GenBank file from the Biopython repository and saves it locally. The file contains annotated DNA sequences that we will analyze throughout this tutorial. --- ## Reading a GenBank File GenBank files are read using `SeqIO.parse()` with the `"genbank"` format.
from Bio import SeqIO
for record in SeqIO.parse("plasmid.gbk", "genbank"):
print(record.id)
print("Sequence length:", len(record.seq))`SeqIO.parse()` reads each sequence record in the GenBank file. Each record is stored as a **SeqRecord** object containing the sequence and associated annotations. Many GenBank files contain multiple sequence records, which is why we use `parse()` rather than `read()`. --- ## Accessing Sequence Information Each GenBank record contains several useful attributes.
from Bio import SeqIO
for record in SeqIO.parse("plasmid.gbk", "genbank"):
print("ID:", record.id)
print("Name:", record.name)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print()Important fields include: - **`record.id`** — accession identifier - **`record.name`** — short sequence name - **`record.description`** — full description from the file - **`record.seq`** — the DNA sequence The sequence itself is stored as a **Seq object**. --- ## Exploring GenBank Annotations GenBank files store rich metadata in the `annotations` dictionary.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for key in record.annotations:
print(key)Common annotation fields include: - organism - taxonomy - references - source - date - sequence version These provide biological context about the sequence. --- ## Accessing the Organism Name You can easily extract organism information from the annotations.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
print("Organism:", record.annotations["organism"])
print("Taxonomy:", record.annotations["taxonomy"])This information comes directly from the GenBank metadata and is useful when analyzing genomic datasets. --- ## Working with Sequence Features One of the most powerful aspects of GenBank files is their **feature annotations**. Features describe biological regions such as: - genes - coding sequences (CDS) - promoters - exons - regulatory elements You can access these through `record.features`.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
print("Number of features:", len(record.features))
for feature in record.features[:5]:
print(feature.type)Each feature is a **SeqFeature object** containing the type of feature and its location on the sequence. --- ## Extracting Gene Features We can filter features to find specific types, such as genes.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "gene":
print(feature.location)
print(feature.qualifiers.get("gene"))Feature objects contain: - **`feature.type`** — feature category (gene, CDS, etc.) - **`feature.location`** — coordinates on the sequence - **`feature.qualifiers`** — additional metadata such as gene name or product --- ## Extracting Coding Sequences (CDS) Coding sequences represent protein-coding regions of DNA.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "CDS":
gene = feature.qualifiers.get("gene", ["unknown"])[0]
protein = feature.qualifiers.get("product", ["unknown protein"])[0]
print("Gene:", gene)
print("Protein:", protein)
print()CDS features often contain important qualifiers such as: - gene name - protein product - translation (amino acid sequence) These are useful for functional genomics analysis. --- ## Extracting the DNA Sequence of a Feature You can also extract the exact DNA sequence corresponding to a feature.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
for feature in record.features:
if feature.type == "CDS":
sequence = feature.extract(record.seq)
print("CDS: ", sequence)
breakThe `extract()` method retrieves the subsequence defined by the feature location. --- ## Converting GenBank to FASTA Sometimes you want only the raw DNA sequence.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
SeqIO.write([record], "orchid.fasta", "fasta")This converts all sequences in the GenBank file into FASTA format. --- ## Writing a GenBank File You can also write modified sequence records back to a GenBank file.
from Bio import SeqIO
record = next(SeqIO.parse("plasmid.gbk", "genbank"))
records = [record]
SeqIO.write(records, "copy.gbk", "genbank")This example reads a GenBank file and writes the records into a new file. --- ## Conclusion GenBank files store much more than sequence data—they include rich biological annotations that describe genes, coding regions, regulatory elements, and organism metadata. Using **Biopython**, you can easily access and analyze this information in Python. In this tutorial, you learned how to: - read GenBank files with `SeqIO` - explore sequence metadata and annotations - access genomic features - extract genes and coding sequences - retrieve subsequences from features - convert GenBank files to other formats These capabilities are essential when working with **genome annotations, plasmid maps, and public biological databases**.