Biopython is an open-source collection of non-commercial Python modules for computational biology and bioinformatics. It makes robust and well-tested code easily accessible to researchers. Python is an object-oriented programming language and is a suitable choice for automation of common tasks. The availability of reusable libraries saves development time and lets researchers focus on addressing scientific questions. Biopython is constantly updated and maintained by a large team of volunteers across the globe[1].
Biopython | |
---|---|
![]() | |
Original author(s) | Chapman B, Chang J[1] |
Initial release | December 17, 2002 |
Stable release | v1.85
|
Repository | https://biopython.org/wiki/SourceCode[2] ![]() |
Written in | Python, C |
Platform | Cross platform |
Type | Bioinformatics |
License | Biopython License |
Website | biopython |
Biopython contains parsers for diverse bioinformatic sequence, alignment, and structure formats. Sequence formats include FASTA, FASTQ, GenBank, and EMBL. Alignment formats include Clustal, BLAST, PHYLIP, and NEXUS. Structural formats include the PDB, which contains the 3D atomic coordinates of the macromolecules. It has provisions to access information from biological databases like NCBI, Expasy, PBD, and BioSQL. This can be used in scripts or incorporated into their software.[3] Biopython contains a standard sequence class, sequence alignment, and motif analysis tools. It also has clustering algorithms, a module for structural biology, and a module for phylogenetics analysis. [4]
History
editThe development of Biopython began in 1999, and it was first released in July 2000.[5] First “semi-complete” and “semi-stable” release was done in March 2001 and December 2002 respectively. It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including BioPerl, BioRuby and BioJava. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date.[6] In 2007, a similar Python project, namely PyCogent, was established.[7]
The initial scope of Biopython involved accessing, indexing and processing biological sequence files. The retrieved data from common biological databases will then be parsed into a python data structure. While this is still a major focus, over the following years added modules have extended its functionality to cover additional areas of biology. The key challenge in the design of parsers for bioinformatics file formats is the frequency at which the data formats change. This is due to inadequate curation of the structure of the data, and changes in the database contents. This problem is overcome by the application of a standard event-oriented parser design (see Key features and examples).[1]
As of version 1.77, Biopython no longer supports Python 2.[8] The current stable release of Biopython version 1.85 was released on 15th January 2025. It only supports Python 3 and the recent releases of Biopython require NumPy (and not Numeric). [9]
Design
editWherever possible, Biopython follows the conventions used by the Python programming language to make it easier for users familiar with Python. For example, Seq
and SeqRecord
objects can be manipulated via slicing, in a manner similar to Python's strings and lists. It is also designed to be functionally similar to other Bio* projects, such as BioPerl.[5] It is organized into modular sub-packages, e.g., Bio.Seq
, Bio.Align
, Bio.PDB
, Bio.Entrez
each of them useful in a different bioinformatics ___domain. It used principles, like encapsulation and polymorphism, notably in classes Seq
, SeqRecord
, and Bio.PDB.Structure
. It can also interoperate with other Python tools (Pandas, Matplotlib and SciPy).[3]
Biopython can read and write most common file formats for each of its functional areas, and its license is permissive and compatible with most other software licenses, which allows Biopython to be used in a variety of software projects.[10]
Requirements
editBiopython is currently supported and tested with the following Python implementations:[11]
- Python 3 or PyPy3
- NumPy
Key features and examples
editInput and output
editBiopython can read and write to a number of common formats. When reading files, descriptive information in the file is used to populate the members of Biopython classes, such as SeqRecord
. This allows records of one file format to be converted into others.
Very large sequence files can exceed a computer's memory resources, so Biopython provides various options for accessing records in large files. They can be loaded entirely into memory in Python data structures, such as lists or dictionaries, providing fast access at the cost of memory usage. Alternatively, the files can be read from disk as needed, with slower performance but lower memory requirements.
>>>#The code reads a GenBank file record-by-record to efficiently handle large sequence files without exhausting memory. It converts each sequence record into FASTA format and writes it to a new output file.
>>>from Bio import SeqIO
>>># Reading sequences from a GenBank file and writing to a FASTA file
>>>input_file = "sequence_1.gb"
>>>output_file = "converted_sequences.fasta"
>>># Using iterator to read large file without loading all into memory
>>>with open(output_file, "w") as out_handle:
... for record in SeqIO.parse(input_file, "genbank"):
... # Each record is a SeqRecord populated with metadata
... print(f"Processing record: {record.id} - {record.description}")
... SeqIO.write(record, out_handle, "fasta") # Convert and write to FASTA
Sequences
editA core concept in Biopython is the biological sequence, and this is represented by the Seq
class.[12] A Biopython Seq
object is similar to a Python string in many respects: it supports the Python slice notation, can be concatenated with other sequences and is immutable. This object includes both general string-like and biological sequence-specific methods. It is best to store information about the biological type (DNA, RNA, protein) separately from the sequence, rather than using an explicit alphabet argument.
>>># This script creates a DNA sequence and performs some typical manipulations
>>>from Bio.Seq import Seq
>>>dna_sequence = Seq("AGGCTTCTCGTA")
>>>print(dna_sequence)
Seq('AGGCTTCTCGTA')
>>>print(dna_sequence[2:7])
Seq('GCTTC')
>>>print(dna_sequence.reverse_complement())
Seq('TACGAGAAGCCT')
>>>rna_sequence = dna_sequence.transcribe()
>>>print(rna_sequence)
Seq('AGGCUUCUCGUA')
>>>print(rna_sequence.translate())
Seq('RLLV')
Sequence annotation
editThe SeqRecord
class describes sequences, along with information such as name, description and features in the form of SeqFeature
objects. Each SeqFeature
object specifies the type of the feature and its ___location. Feature types can be ‘gene’, ‘CDS’ (coding sequence), ‘repeat_region’, ‘mobile_element’ or others, and the position of features in the sequence can be exact or approximate.
>>>#The script reads a GenBank file to extract and print the sequence’s name and description. It then accesses and displays detailed information about a specific annotated feature (e.g., a gene) within the sequence.
>>>from Bio import SeqIO
>>>seq_record = SeqIO.read("sequence.gb", "genbank")
>>># Access metadata
>>>print(seq_record.name)
>>>print(seq_record.description)
'NC_005816'
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
>>># Access features list and example feature at index 14 if available
>>>if len(seq_record.features) > 14:
... print(seq_record.features[14])
...else:
... print("Feature index 14 not available")
type: CDS
___location: [6115:6421](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: inference, Value: ['COORDINATES: similar to AA sequence:RefSeq:WP_002221218.1']
Key: locus_tag, Value: ['YP_RS22235']
Key: note, Value: ['Derived by automated computational analysis using gene prediction method: Protein Homology.']
Key: old_locus_tag, Value: ['pPCP07', 'YP_pPCP07']
Key: product, Value: ['hypothetical protein']
Key: protein_id, Value: ['WP_002221218.1']
Key: transl_table, Value: ['11']
Key: translation, Value: ['MSKTKSGRHRLSKTDKRLLAALVVAGYEERTARDLIQKHVYTLTQADLRHLVSEISNGVGQSQAYDAIYQAR
Accessing online databases
editThrough the Bio.Entrez module, users of Biopython can download biological data from NCBI databases. Each of the functions provided by the Entrez search engine is available through functions in this module, including searching for and downloading records.
>>>#This code fetches nucleotide sequence records from the NCBI database for specified accession IDs, reads the GenBank-formatted data and prints the first two lines. The output can also be written to a file.
>>>from Bio import Entrez
>>>Entrez.email = "email@gmail.com"
>>>record_ids = ["NM_000546.6", "NM_001354689.3"]
>>>for record_id in record_ids:
... with Entrez.efetch(db="nucleotide", id=record_id, rettype="gb", retmode="text") as handle:
... line_count = 0
... for line in handle:
... print(line.rstrip())
... line_count += 1
... if line_count == 2: # Print only first 2 lines of the record
... break
LOCUS NM_000546 2512 bp mRNA linear PRI 12-JUN-2025
DEFINITION Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA.
LOCUS NM_001354689 3251 bp mRNA linear PRI 12-JUN-2025
DEFINITION Homo sapiens Raf-1 proto-oncogene, serine/threonine kinase (RAF1)
Phylogeny
editThe Bio.Phylo module provides tools for working with and visualising phylogenetic trees. A variety of file formats are supported for reading and writing, including Newick, NEXUS and phyloXML. Common tree manipulations and traversals are supported via the Tree
and Clade
objects. Examples include converting and collating tree files, extracting subsets from a tree, changing a tree's root, and analysing branch features such as length or score.[14]
Rooted trees can be drawn in ASCII or using matplotlib (see Figure 1), and the Graphviz library can be used to create unrooted layouts (see Figure 2).
Genome diagrams
editThe GenomeDiagram module provides methods of visualising sequences within Biopython.[16] Sequences can be drawn in a linear or circular form (see Figure 3), and many output formats are supported, including PDF and PNG. Diagrams are created by making tracks and then adding sequence features to those tracks. By looping over a sequence's features and using their attributes to decide if and how they are added to the diagram's tracks, one can exercise much control over the appearance of the final diagram. Cross-links can be drawn between different tracks, allowing one to compare multiple sequences in a single diagram.
Macromolecular structure
editThe Bio.PDB module can load molecular structures from PDB and mmCIF files, and was added to Biopython in 2003.[17] The Structure
object is central to this module, and it organises macromolecular structure in a hierarchical fashion: Structure
objects contain Model
objects which contain Chain
objects which contain Residue
objects which contain Atom
objects. Disordered residues and atoms get their own classes, DisorderedResidue
and DisorderedAtom
, that describe their uncertain positions.
Using Bio.PDB, one can navigate through individual components of a macromolecular structure file, such as examining each atom in a protein. Common analyses can be carried out, such as measuring distances or angles, comparing residues and calculating residue depth.
>>>#This script parses a PDB file to print the first model’s chain IDs and extract coordinates of atoms in the 100th residue of each chain. It demonstrates navigating protein structure hierarchy and accessing specific residue data.
>>>from Bio.PDB import PDBParser
>>># Parse the PDB file
>>>parser = PDBParser(QUIET=True)
>>>structure = parser.get_structure("2yox", "2yox.pdb")
>>># Iterate over models
>>>for model in structure:
>>> print(f"Model ID: {model.id}")
>>> # Iterate over chains in the model
>>> for chain in model:
... print(f" Chain ID: {chain.id}")
>>> # Check if residue 100 exists in this chain
>>> if 100 in chain:
... residue = chain[100]
... print(f" Coordinates of atoms in residue 100:")
... # Print coordinates of each atom in residue 100
... for atom in residue:
... print(atom.coord)
... else:
... print(" Residue 100 not found in this chain.")
... break
Model ID: 0
Chain ID: A
Coordinates of atoms in residue 100:
[ 9.837 18.218 81.24 ]
[ 9.644 18.809 79.938]
[ 8.772 20.066 80.01 ]
[ 7.572 19.996 80.27 ]
[ 9.07 17.788 78.962]
[ 8.989 18.261 77.529]
[10.352 18.647 76.938]
[11.281 17.832 76.922]
[10.486 19.917 76.503]
Chain ID: B
Coordinates of atoms in residue 100:
[23.712 13.531 36.955]
[23.197 12.95 35.746]
[23.961 11.693 35.339]
[25.138 11.757 34.935]
[23.183 13.97 34.623]
[22.49 13.49 33.361]
[21.022 13.13 33.571]
[20.22 13.96 34.039]
[20.66 11.867 33.253]
Population genetics
editThe Bio.PopGen module adds support to Biopython for Genepop, a software package for statistical analysis of population genetics.[18] This allows for analyses of Hardy–Weinberg equilibrium, linkage disequilibrium and other features of a population's allele frequencies.
This module can also carry out population genetic simulations using coalescent theory with the fastsimcoal2 program.[19]
Wrappers for command line tools
editBiopython previously included command-line wrappers for tools such as BLAST, Clustal, EMBOSS, and SAMtools. This option allowed users to run external tool commands from within the code using specialized Biopython classes.
However, Bio.Application
modules and their wrappers have deprecated and will be removed in future Biopython releases. The main reason for this is the high maintenance burden of updating them with the evolving external tools.
The recommended approach is to directly construct and execute command-line tool commands using Python’s built-in subprocess
module. This method provides flexibility and removes the dependency on the Biopython wrappers. subprocess
is a native Python module useful for running external commands, programs, and capturing their output.[20]
See also
editReferences
edit- ^ a b c Chapman, Brad; Chang, Jeff (August 2000). "Biopython: Python tools for computational biology". ACM SIGBIO Newsletter. 20 (2): 15–19. doi:10.1145/360262.360268. S2CID 9417766.
- ^
Error: Unable to display the reference from Wikidata properly. Technical details:
- Reason for the failure of {{Cite web}}: The Wikidata reference contains the property copyright license (P275), which is not assigned to any parameter of this template.
- Reason for the failure of {{Cite Q}}: The Wikidata reference contains the property copyright license (P275), which is not assigned to any parameter of this template.
- ^ a b Cock, Peter J. A.; Antao, Tiago; Chang, Jeffrey T.; Chapman, Brad A.; Cox, Cymon J.; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel J. L. (2009-03-20). "Biopython: freely available Python tools for computational molecular biology and bioinformatics". Bioinformatics. 25 (11): 1422–1423. doi:10.1093/bioinformatics/btp163. hdl:10400.1/5523. ISSN 1367-4811.
- ^ "Introduction — Biopython 1.85 documentation". biopython.org. Retrieved 2025-08-15.
- ^ a b Chapman, Brad (11 March 2004), The Biopython Project: Philosophy, functionality and facts (PDF), retrieved 11 September 2014
- ^ List of Biopython contributors, archived from the original on 11 September 2014, retrieved 11 September 2014
- ^ Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C; McDonald, D; Robeson, M; Sammut, R; Smit, S; Wakefield, M. J.; Widmann, J; Wikman, S; Wilson, S; Ying, H; Huttley, G. A. (2007). "Py Cogent: A toolkit for making sense from sequence". Genome Biology. 8 (8): R171. doi:10.1186/gb-2007-8-8-r171. PMC 2375001. PMID 17708774.
- ^ Daley, Chris, Biopython 1.77 released, retrieved 6 October 2021
- ^ "Download · Biopython". biopython.org. Retrieved 2025-08-15.
- ^ Refer to the Biopython website for other papers describing Biopython, and a list of over one hundred publications using/citing Biopython.
- ^ "Biopython".
- ^ Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek (29 May 2014), Biopython Tutorial and Cookbook, retrieved 28 August 2014
- ^ Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam (24 October 2007). "Surprising complexity of the ancestral apoptosis network". Genome Biology. 8 (10): R226. doi:10.1186/gb-2007-8-10-r226. PMC 2246300. PMID 17958905.
- ^ Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A (21 August 2012). "Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython". BMC Bioinformatics. 13 (209): 209. doi:10.1186/1471-2105-13-209. PMC 3468381. PMID 22909249.
- ^ "Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence". NCBI. Retrieved 10 September 2014.
- ^ Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K (March 2006). "GenomeDiagram: a python package for the visualization of large-scale genomic data". Bioinformatics. 22 (5): 616–617. doi:10.1093/bioinformatics/btk021. PMID 16377612.
- ^ Hamelryck, Thomas; Manderick, Bernard (10 May 2003). "PDB file parser and structure class implemented in Python". Bioinformatics. 19 (17): 2308–2310. doi:10.1093/bioinformatics/btg299. PMID 14630660.
- ^ Rousset, François (January 2008). "GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux". Molecular Ecology Resources. 8 (1): 103–106. Bibcode:2008MolER...8..103R. doi:10.1111/j.1471-8286.2007.01931.x. PMID 21585727. S2CID 25776992.
- ^ Excoffier, Laurent; Foll, Matthieu (1 March 2011). "fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios". Bioinformatics. 27 (9): 1332–1334. doi:10.1093/bioinformatics/btr124. PMID 21398675.
- ^ "Bio.Application package — Biopython 1.74 documentation". biopython.org. Retrieved 2025-08-15.