Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
Bioconductor | |
---|---|
![]() | |
Stable release | 3.21
/ 16 April 2025 |
Operating system | Linux, macOS, Windows |
Platform | R programming language |
Type | Bioinformatics |
License | Artistic License 2.0 |
Website | www |
Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages. It has two releases each year that follow the semiannual releases of R. At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs. In addition there are many genome annotation packages available that are mainly, but not solely, oriented towards different types of microarrays.
The project was started in the Fall of 2001 and is overseen by the Bioconductor core team, based primarily at the Fred Hutchinson Cancer Research Center, with other members coming from international institutions.
Packages
editMost Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, sequence, or SNP data.
Goals
editThe broad goals of the projects are to:
- Provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
- Facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from LocusLink/Entrez.
- Provide a common software platform that enables the rapid development and deployment of plug-able, scalable, and interoperable software.
- Further scientific understanding by producing high-quality documentation and reproducible research.
- Train researchers on computational and statistical methods for the analysis of genomic data.
Main features
edit- Documentation and reproducible research. Each Bioconductor package contains at least one vignette, which is a document that provides a textual, task-oriented description of the package's functionality. These vignettes come in several forms. Many are simple "How-to"s that are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package or might even discuss general issues related to the package. In the future, the Bioconductor project is looking towards providing vignettes that are not specifically tied to a package, but rather are demonstrating more complex concepts. As with all aspects of the Bioconductor project, users are encouraged to participate in this effort.
- Statistical and graphical methods. The Bioconductor project aims to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data. Analysis packages are available for: pre-processing Affymetrix and Illumina, cDNA array data; identifying differentially expressed genes; graph theoretical analyses; plotting genomic data. In addition, the R package system itself provides implementations for a broad range of state-of-the-art statistical and graphical techniques, including linear and non-linear modeling, cluster analysis, prediction, resampling, survival analysis, and time series analysis.
- Genome annotation. The Bioconductor project provides software for associating microarray and other genomic data in real time to biological metadata from web databases such as GenBank, LocusLink and PubMed (annotate package). Functions are also provided for incorporating the results of statistical analysis in HTML reports with links to annotation WWW resources. Software tools are available for assembling and processing genomic annotation data, from databases such as GenBank, the Gene Ontology Consortium, LocusLink, UniGene, the UCSC Human Genome Project and others with the AnnotationDbi package. Data packages are distributed to provide mappings between different probe identifiers (e.g. Affy IDs, LocusLink, PubMed). Customized annotation libraries can also be assembled.This project also contain several functions for genomic analysis and phylogenetic (e.g. ggtree, phytools packages ..).
- Open source. The Bioconductor project has a commitment to full open source discipline, with distribution via a SourceForge.net-like platform. All contributions are expected to exist under an open source license such as Artistic 2.0, GPL2, or BSD. There are many different reasons why open-source software is beneficial to the analysis of microarray data and to computational biology in general. The reasons include:
- To provide full access to algorithms and their implementation
- To facilitate software improvements through bug fixing and plug-ins
- To encourage good scientific computing and statistical practice by providing appropriate tools and instruction
- To provide a workbench of tools that allow researchers to explore and expand the methods used to analyze biological data
- To ensure that the international scientific community is the owner of the software tools needed to carry out research
- To lead and encourage commercial support and development of those tools that are successful
- To promote reproducible research by providing open and accessible tools with which to carry out that research (reproducible research is distinct from independent verification)
- Open development. Users are encouraged to become developers, either by contributing Bioconductor compliant packages or documentation. Additionally Bioconductor provides a mechanism for linking together different groups with common goals to foster collaboration on software, possibly at the level of shared development.
Milestones
editEach release of Bioconductor is developed to work best with a chosen version of R.[1] In addition to bugfixes and updates, a new release typically adds packages. The table below maps a Bioconductor release to a R version and shows the number of available Bioconductor software packages for that release.
Version | Release date | Package count | R dependency |
---|---|---|---|
3.21 | 16 Apr 2025 | 2341 | R 4.5 |
3.20 | 30 Oct 2024 | 2289 | R 4.4 |
3.18 | 25 Oct 2023 | 2266 | R 4.3 |
3.16 | 2 Nov 2022 | 2183 | R 4.2 |
3.14 | 27 Oct 2021 | 2083 | R 4.1 |
3.11 | 28 Apr 2020 | 1903 | R 4.0 |
3.10 | 30 Oct 2019 | 1823 | R 3.6 |
3.8 | 31 Oct 2018 | 1649 | R 3.5 |
3.6 | 31 Oct 2017 | 1473 | R 3.4 |
3.4 | 18 Oct 2016 | 1296 | R 3.3 |
3.2 | 14 Oct 2015 | 1104 | R 3.2 |
3.0 | 14 Oct 2014 | 934 | R 3.1 |
2.13 | 15 Oct 2013 | 749 | R 3.0 |
2.11 | 3 Oct 2012 | 610 | R 2.15 |
2.9 | 1 Nov 2011 | 517 | R 2.14 |
2.8 | 14 Apr 2011 | 466 | R 2.13 |
2.7 | 18 Nov 2010 | 418 | R 2.12 |
2.6 | 23 Apr 2010 | 389 | R 2.11 |
2.5 | 28 Oct 2009 | 352 | R 2.10 |
2.4 | 21 Apr 2009 | 320 | R 2.9 |
2.3 | 22 Oct 2008 | 294 | R 2.8 |
2.2 | 1 May 2008 | 260 | R 2.7 |
2.1 | 8 Oct 2007 | 233 | R 2.6 |
2.0 | 26 Apr 2007 | 214 | R 2.5 |
1.9 | 4 Oct 2006 | 188 | R 2.4 |
1.8 | 27 Apr 2006 | 172 | R 2.3 |
1.7 | 14 Oct 2005 | 141 | R 2.2 |
1.6 | 18 May 2005 | 123 | R 2.1 |
1.5 | 25 Oct 2004 | 100 | R 2.0 |
1.4 | 17 May 2004 | 81 | R 1.9 |
1.3 | 30 Oct 2003 | 49 | R 1.8 |
1.2 | 29 May 2003 | 30 | R 1.7 |
1.1 | 19 Oct 2002 | 20 | R 1.6 |
1.0 | 1 May 2002 | 15 | R 1.5 |
Application of Bioconductor in small-RNA seq and microRNA data analysis
editIntroduction
editSmall RNA sequencing is a widely used technique to study microRNA(miRNAs), small interfering RNAs (siRNAs), piwi-interacting RNA (piRNAs) that play a crucial role in RNA-mediated gene silencing process or known as RNA silencing /Gene silencing process. RNA silencing process employs different types of substrates which give rise to different types of RNA population, namely microRNAs, siRNAs, etc. In the laboratory, small RNA sequencing typically start by extraction of RNA from cells or tissues, followed by Adapter ligation to the 5' and 3' ends of small RNAs, followed by Reverse transcription and PCR amplification to generate cDNA libraries. Finally, High-throughput sequencing ( most commonly Illumina platforms) is used to produce millions of short reads. These resulting data then undergo computational processing to align reads to reference genomes of particular species or miRNA databases.
Bioconductor in RNA Biology
editBioconductor(BioC)[2] is a widely used open-source platform for analysing different types of small-RNA sequencing and genomic data. It primarily utilizes the R programming language and offers a wide range of packages for bioinformatics and computational biology. Bioconductor provides a wide range of packages[3] for handling small-RNA seq data among them few are widely used by researchers. Popular Bioconductor packages like DESeq2,[4] edgeR,[5] limma + voom,[6][7] GenomicAlignment,[8] GenomicFeatures,[8] Rsubread,[9] ShortRead,[10] featureCounts[11] provide robust analysis of RNA-seq data.[12]
It uses a negative binomial distribution modeling for differential expression analysis of read count from RNA-seq data.[13] It is popular for dispersion estimation, normalization, and visualization by PCA plots or MA plots.[4]
edgeR
editIt also uses a negative binomial distribution modeling for differential expression analysis of read count from RNA seq data. In contrast with DESeq2, it is used when sample number is relatively small.[5][14]
limma + voom
editIt is used to estimate the mean-variance relationship of count data and transforms it to log2-counts per million (CPM). It is used for analysing microarray data and also to calculate CPM value from RNA-seq data.[15]
GenomicAlignment
editIt is widely popular for reads like BAM and SAM file alignment to assign aligned reads to genes or miRNAs for downstream analysis.[8][16]
GenomicFeatures
editIt is used to build transcript-centric annotation databases like TxDb objects which store information about genes, exons, transcripts from GTF/GFF files.[8][17]
Rsubread
editIt is used mostly for summarizing the reads and mapping, where functions like align(), featureCounts() are used to provide an efficient alternative to external aligners like STAR or HISAT2.[18]
ShortRead
editIt is often used to pre-process the raw FASTQ files to check the quality of raw FASTQ data, which comes from a sequencing platform like Illumina sequencing etc.[10]
Computational Workflow
editData Import and Quality Control
editFASTQ files[19] are typically imported by using different Bioconductor packages like ShortRead[10] which provides quality assessment reports.
Adapter Trimming and Filtering
editDifferent external tools like Cutadapt,[20] trimmomatic[21] is used to remove the adapter sequence from the raw FASTQ files. This helps to improve the Reads quality.
Read Alignment
editThe processed Reads are aligned to the reference genome. This alignment can be done by different aligners like Rsubread, or external tools such as STAR, with results stored in standard formats like SAM (Sequence Alignment Map) or BAM (Binary Alignment Map) files .
Annotation of microRNAs
editBioconductor supports to integrate miRBase data where different packages like miRBaseConverter,[22] AnnotationHub,[23] org.Mm.eg.db[24] are used for annotate reads to known miRNAs.
Quantification
editCount reads are mapped to known genes or microRNAs, and summarize counts across samples.
Differential Expression Analysis
editAfter mapping and quantifying microRNA expression, different well-established packages like DESeq2, edgeR are used for differential expression analysis.
Visualization
editTo interpret and present miRNA expression results, different visualization packages are used like ggplot2,[25] pheatmap,[26] ComplexHeatmap which generates Volcano plot (statistics), PCA plot (Principal component analysis), MA plot, pheatmap are used to visualize the differential expression data.
Resources
edit- Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer. ISBN 978-0-387-25146-2.
- Gentleman, R. (2008). R Programming for Bioinformatics. Chapman & Hall/CRC. ISBN 978-1-4200-6367-7.
- Hahne, F.; Huber, W.; Gentleman, R.; Falcon, S. (2008). Bioconductor Case Studies. Springer. ISBN 978-0-387-77239-4.
- Gentleman, Robert C.; Carey, Vincent J.; Bates, Douglas M.; Bolstad, Ben; Dettling, Marcel; Dudoit, Sandrine; Ellis, Byron; Gautier, Laurent; Ge, Yongchao; Gentry, Jeff; Hornik, Kurt; Hothorn, Torsten; Huber, Wolfgang; Iacus, Stefano; Irizarry, Rafael; Leisch, Friedrich; Li, Cheng; Maechler, Martin; Rossini, Anthony J.; Sawitzki, Gunther; Smith, Colin; Smyth, Gordon; Tierney, Luke; Yang, Jean Y. H.; Zhang, Jianhua (2004). "Bioconductor: open software development for computational biology and bioinformatics". Genome Biology. 5 (10): R80. doi:10.1186/gb-2004-5-10-r80. PMC 545600. PMID 15461798.
See also
editReferences
edit- ^ "Bioconductor – Release Announcements". bioconductor.org. Bioconductor. Retrieved 28 May 2019.
- ^ Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, Wolfgang Huber, BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis, Bioinformatics, Volume 21, Issue 16, August 2005, Pages 3439–3440, https://doi.org/10.1093/bioinformatics/bti525
- ^ https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html
- ^ a b Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID 25516281; PMCID: PMC4302049.
- ^ a b Mark D. Robinson, Davis J. McCarthy, Gordon K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, Volume 26, Issue 1, January 2010, Pages 139–140, https://doi.org/10.1093/bioinformatics/btp616
- ^ Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID 27441086; PMCID: PMC4937821
- ^ Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. doi: 10.1093/nar/gkv007. Epub 2015 Jan 20. PMID 25605792; PMCID: PMC4402510.
- ^ a b c d Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015 Feb;12(2):115-21. doi: 10.1038/nmeth.3252. PMID 25633503; PMCID: PMC4509590.
- ^ Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019 May 7;47(8):e47. doi: 10.1093/nar/gkz114. PMID 30783653; PMCID: PMC6486549.
- ^ a b c Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H, Gentleman R. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics. 2009 Oct 1;25(19):2607-8. doi: 10.1093/bioinformatics/btp450. Epub 2009 Aug 3. PMID 19654119; PMCID: PMC2752612.
- ^ Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014 Apr 1;30(7):923-30. doi: 10.1093/bioinformatics/btt656. Epub 2013 Nov 13. PMID 24227677.
- ^ Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, Winter DR. A Beginner's Guide to Analysis of RNA Sequencing Data. Am J Respir Cell Mol Biol. 2018 Aug;59(2):145-157. doi: 10.1165/rcmb.2017-0430TR. PMID 29624415; PMCID: PMC6096346.
- ^ Chen Y, Lun AT, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 2016 Jun 20;5:1438. doi: 10.12688/f1000research.8987.2. PMID 27508061; PMCID: PMC4934518.
- ^ Chen Y, Lun AT, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 2016 Jun 20;5:1438. doi: 10.12688/f1000research.8987.2. PMID 27508061; PMCID: PMC4934518
- ^ Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID 27441086; PMCID: PMC4937821.
- ^ https://bioconductor.org/packages/devel/bioc/html/GenomicAlignments.html
- ^ https://bioconductor.org/packages/devel/bioc/html/GenomicFeatures.html#:~:text=Extract%20the%20genomic%20locations%20of,tools%20from%20the%20txdbmaker%20package.
- ^ Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013 May 1;41(10):e108. doi: 10.1093/nar/gkt214. Epub 2013 Apr 4. PMID 23558742; PMCID: PMC3664803.
- ^ Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71. doi: 10.1093/nar/gkp1137. Epub 2009 Dec 16. PMID 20015970; PMCID: PMC2847217.
- ^ https://cutadapt.readthedocs.io/en/stable/
- ^ Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID 24695404; PMCID: PMC4103590.
- ^ Xu, T., Su, N., Liu, L. et al. miRBaseConverter: an R/Bioconductor package for converting and retrieving miRNA name, accession, sequence and family information in different versions of miRBase. BMC Bioinformatics 19 (Suppl 19), 514 (2018). https://doi.org/10.1186/s12859-018-2531-5
- ^ https://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html
- ^ https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html
- ^ https://ggplot2-book.org/introduction.html#:~:text=ggplot2%20is%20an%20R%20package,graphs%20by%20combining%20independent%20components.
- ^ https://rdrr.io/cran/pheatmap/
External links
edit- Official website
- The R Project GNU R is a programming language for statistical computing.
- Bioconductor Releases
- The community of the Debian GNU/Linux distribution strives towards an automated building of BioConductor packages Archived 2007-08-11 at the Wayback Machine for their distribution. BioKnoppix and Quantian are projects extending Knoppix that have contributed bootable Debian GNU/Linux CDs providing BioConductor installations.