FASTA files are one of the most common formats used in bioinformatics. They store [DNA and RNA sequences](/tutorials/biopython-nucleotide-sequences) or [protein sequences](/tutorials/biopython-amino-acid-sequences) in a simple text format and are widely used in genomic databases, sequence analysis pipelines, and research workflows. If you're studying biology, bioinformatics, or computational biology, you'll almost certainly encounter FASTA files. Fortunately, the **Biopython** library provides convenient tools for reading, parsing, and writing FASTA data in Python. In this tutorial, you'll learn how to: * Read sequences from a FASTA file * Access sequence IDs and descriptions * Iterate through multiple sequences * Calculate sequence statistics * Write new FASTA files We'll use the `Bio.SeqIO` module, which is designed for reading and writing biological sequence file formats. --- ## Downloading an Example FASTA File Before working with FASTA files, let's download a small example file from the web.
import requests
# Download an example FASTA file
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta"
response = requests.get(url)
with open("orchids.fasta", "w") as f:
f.write(response.text)
print("FASTA file downloaded.")This code downloads a publicly available FASTA file from the Biopython repository and saves it locally as `orchids.fasta`. This file contains multiple DNA sequences that we will use in the examples throughout this tutorial. --- ## Reading a FASTA File The `SeqIO.parse()` function is the most common way to read FASTA files in Biopython.
from Bio import SeqIO
# Parse the FASTA file
for record in SeqIO.parse("orchids.fasta", "fasta"):
print(record.id)* **`SeqIO.parse()`** reads sequences from a file. * The first argument is the filename. * The second argument (`"fasta"`) tells Biopython the file format. * Each sequence is returned as a **SeqRecord** object called `record`. A `SeqRecord` contains useful information such as the sequence ID, description, and the sequence itself. --- ## Accessing Sequence Information Each FASTA entry contains several pieces of information. Let's explore them.
from Bio import SeqIO
for record in SeqIO.parse("orchids.fasta", "fasta"):
print("ID:", record.id)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print("First 20 bases:", record.seq[:20])
print()* **`record.id`**: The sequence identifier. * **`record.description`**: The full FASTA header line. * **`record.seq`**: The biological sequence. * **`len(record.seq)`**: The length of the sequence. The sequence itself is stored as a **Seq object**, which behaves much like a Python string but includes additional biological functionality. --- ## Counting the Number of Sequences Sometimes you just want to know how many sequences are in a FASTA file.
from Bio import SeqIO
count = 0
for record in SeqIO.parse("orchids.fasta", "fasta"):
count += 1
print("Number of sequences:", count)This code iterates through each record and increments a counter. FASTA files can contain thousands or even millions of sequences, so iterating like this avoids loading everything into memory at once. --- ## Converting FASTA Records to a List If the FASTA file is small, you may prefer to load all sequences into a list.
from Bio import SeqIO
records = list(SeqIO.parse("orchids.fasta", "fasta"))
print("Total sequences:", len(records))
print("First sequence ID:", records[0].id)
print("Sequence length:", len(records[0].seq))* `list()` converts the iterator returned by `SeqIO.parse()` into a list. * This allows random access, such as `records[0]` or `records[5]`. Be careful with very large FASTA files, as loading everything into memory can consume a lot of RAM. --- ## Calculating GC Content A common task in sequence analysis is calculating the **[GC content](/tutorials/biopython-ambiguous-nucleotides)**, which is the percentage of nucleotides that are G or C.
from Bio import SeqIO
# Calculate GC content for each sequence
for record in SeqIO.parse("orchids.fasta", "fasta"):
seq = record.seq.upper()
g = seq.count("G")
c = seq.count("C")
gc_content = (g + c) / len(seq) * 100
print(record.id, "GC%:", round(gc_content, 2))This code: 1. Reads each sequence 2. Counts the number of **G** and **C** bases 3. Calculates the percentage of [GC content](/tutorials/biopython-ambiguous-nucleotides) [GC content](/tutorials/biopython-ambiguous-nucleotides) is important in many areas of genomics because it can influence gene expression, sequencing behavior, and genome stability. --- ## Writing a New FASTA File Biopython can also write FASTA files using `SeqIO.write()`.
from Bio import SeqIO
# Filter sequences longer than 600 bases
records = []
for record in SeqIO.parse("orchids.fasta", "fasta"):
if len(record.seq) > 600:
records.append(record)
# Write filtered sequences to a new FASTA file
SeqIO.write(records, "long_sequences.fasta", "fasta")* `SeqIO.write()` writes sequence records to a file. * The first argument is a list of records. * The second argument is the output filename. * The third argument is the file format. This example filters the sequences to keep only those longer than 600 bases and writes them into a new FASTA file. --- ## Creating FASTA Records Manually You can also create new FASTA sequences programmatically.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
record1 = SeqRecord(
Seq("ATGCGTACGTAGCTAGCTAG"),
id="Example1",
description="Example DNA sequence"
)
record2 = SeqRecord(
Seq("ATGGGCTAGCTAGGCTA"),
id="Example2",
description="Another DNA sequence"
)
records = [record1, record2]
SeqIO.write(records, "example_sequences.fasta", "fasta")* **`Seq`** represents the biological sequence. * **`SeqRecord`** stores sequence metadata like ID and description. * The records are written to a FASTA file using `SeqIO.write()`. This is useful when generating sequences from simulations, analyses, or custom pipelines. --- ## Conclusion FASTA files are fundamental to bioinformatics, and **Biopython** makes them easy to work with in Python. Using the `SeqIO` module, you can efficiently read, analyze, and write sequence data. In this tutorial, you learned how to: * Parse FASTA files with `SeqIO.parse()` * Access sequence IDs, descriptions, and sequences * Count and analyze sequences * Calculate [GC content](/tutorials/biopython-ambiguous-nucleotides) * Write new FASTA files * Create FASTA records programmatically These skills form the foundation for many real-world bioinformatics workflows, including genome analysis, sequence filtering, and building data processing pipelines.