Content deleted Content added
No edit summary |
DocWatson42 (talk | contribs) Cleaned up layout and other matters. |
||
Line 1:
{{Short description|Computational analysis of large, complex sets of biological data}}
{{cs1 config|name-list-style=vanc|display-authors=6}}▼
{{For|the journal|Bioinformatics (journal)}}
{{Not to be confused with|Biological computation|Genetic algorithm}}
▲{{cs1 config|name-list-style=vanc|display-authors=6}}
{{Use dmy dates|date=September 2020}}
[[File:WPP ___domain alignment.PNG|500px|thumbnail|right|Early bioinformatics—computational alignment of experimentally determined sequences of a class of related proteins; see {{Section link||Sequence analysis}} for further information.]]
[[Image:Genome viewer screenshot small.png|thumbnail|220px|Map of the human X chromosome (from the [[National Center for Biotechnology Information]] (NCBI) website)]]
'''Bioinformatics''' ({{IPAc-en|audio=en-us-bioinformatics.ogg|ˌ|b|aɪ|.|oʊ|ˌ|ɪ|n|f|ɚ|ˈ|m|æ|t|ɪ|k|s}}) is an [[interdisciplinary]] field of [[science]] that develops methods and [[Bioinformatics software|software tool]]s for understanding [[biology|biological]] data, especially when the data sets are large and complex. Bioinformatics uses [[biology]], [[chemistry]], [[physics]], [[computer science]], [[computer programming]], [[Information engineering (field)|information engineering]], [[mathematics]] and [[statistics]] to analyze and interpret [[biological data]].<ref>{{cite book |last1=Gagniuc |first1=Paul |title=Algorithms in Bioinformatics: Theory and Implementation |date=17 August 2021 |publisher=Wiley |isbn=978-1-119-69796-1 |pages=1-528 |edition=
Computational, statistical, and computer programming techniques have been used for [[In silico|computer simulation]] analyses of biological queries. They include reused specific analysis "pipelines", particularly in the field of [[genomics]], such as by the identification of [[gene]]s and single [[nucleotide]] polymorphisms ([[Single-nucleotide polymorphism|SNPs]]). These pipelines are used to better understand the genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. Bioinformatics also includes [[proteomics]], which tries to understand the organizational principles within [[nucleic acid]] and [[protein]] sequences.<ref>{{cite web |vauthors=Lesk AM |date=26 July 2013 |title=Bioinformatics |url=https://www.britannica.com/science/bioinformatics |website=Encyclopaedia Britannica |access-date=17 April 2017 |archive-date=14 April 2021 |archive-url=https://web.archive.org/web/20210414103621/https://www.britannica.com/science/bioinformatics |url-status=live }}</ref>
Line 21 ⟶ 22:
=== Sequences ===
[[File: Example DNA sequence.png|thumbnail|right|Sequences of genetic material are frequently used in bioinformatics and are easier to manage using computers than manually.]]
[[File:Muscle alignment view.png|thumb|369x369px|These are sequences being compared in a MUSCLE multiple sequence alignment (MSA). Each sequence name (leftmost column) is from various louse species, while the sequences themselves are in the second column.]]▼
There has been a tremendous advance in speed and cost reduction since the completion of the Human Genome Project, with some labs able to [[sequence]] over 100,000 billion bases each year, and a full genome can be sequenced for $1,000 or less.<ref>{{cite web | vauthors = Colby B | date = 2022 | work = Sequencing.com | title = Whole Genome Sequencing Cost | url = https://sequencing.com/education-center/whole-genome-sequencing/whole-genome-sequencing-cost | access-date = 8 April 2022 | archive-date = 15 March 2022 | archive-url = https://web.archive.org/web/20220315025036/https://sequencing.com/education-center/whole-genome-sequencing/whole-genome-sequencing-cost | url-status = live }}</ref>
Line 27 ⟶ 29:
In the 1970s, new techniques for sequencing DNA were applied to bacteriophage MS2 and øX174, and the extended nucleotide sequences were then parsed with informational and statistical algorithms. These studies illustrated that well known features, such as the coding segments and the triplet code, are revealed in straightforward statistical analyses and were the proof of the concept that bioinformatics would be insightful.<ref>{{cite journal | vauthors = Erickson JW, Altman GG |title=A Search for Patterns in the Nucleotide Sequence of the MS2 Genome |journal=Journal of Mathematical Biology |date=1979 |volume=7 |issue=3 |pages=219–230 |doi=10.1007/BF00275725 |s2cid=85199492 }}</ref><ref>{{cite journal | vauthors = Shulman MJ, Steinberg CM, Westmoreland N | title = The coding function of nucleotide sequences can be discerned by statistical analysis | journal = Journal of Theoretical Biology | volume = 88 | issue = 3 | pages = 409–20 | date = February 1981 | pmid = 6456380 | doi = 10.1016/0022-5193(81)90274-5 | bibcode = 1981JThBi..88..409S }}</ref>
▲[[File:Muscle alignment view.png|thumb|369x369px|These are sequences being compared in a MUSCLE multiple sequence alignment (MSA). Each sequence name (leftmost column) is from various louse species, while the sequences themselves are in the second column.]]
== Goals ==
Line 49:
{{main|Sequence alignment|Sequence database|Alignment-free sequence analysis}}
Since the bacteriophage [[Phi X 174|Phage Φ-X174]] was [[sequencing|sequenced]] in 1977,<ref name="pmid870828">{{cite journal | vauthors = Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, Smith M | title = Nucleotide sequence of bacteriophage phi X174 DNA | journal = Nature | volume = 265 | issue = 5596 | pages = 687–95 | date = February 1977 | pmid = 870828 | doi = 10.1038/265687a0 | s2cid = 4206886 | bibcode = 1977Natur.265..687S }}</ref> the [[DNA sequence]]s of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode [[protein]]s, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a [[species]] or between different species can show similarities between protein functions, or relations between species (the use of [[molecular systematics]] to construct [[phylogenetic tree]]s). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. [[Computer program]]s such as [[BLAST (biotechnology)|BLAST]] are used routinely to search sequences—as of 2008, from more than 260,000 organisms, containing over 190 billion [[nucleotide]]s.<ref name="pmid18073190">{{cite journal | vauthors = Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL | title = GenBank | journal = Nucleic Acids Research | volume = 36 | issue = Database issue | pages = D25-30 | date = January 2008 | pmid = 18073190 | pmc = 2238942 | doi = 10.1093/nar/gkm929 }}</ref>
[[File:Sequencing analysis steps.png|center|600px|Image: 450 pixels Sequencing analysis steps]]▼
===DNA sequencing===
{{main|DNA sequencing}}
Before sequences can be analyzed, they are obtained from a data storage bank, such as GenBank. [[DNA sequencing]] is still a non-trivial problem as the raw data may be noisy or affected by weak signals. [[Algorithm]]s have been developed for [[base calling]] for the various experimental approaches to DNA sequencing.
▲[[File:Sequencing analysis steps.png|center|600px|Image: 450 pixels Sequencing analysis steps]]
===Sequence assembly===
Line 180 ⟶ 181:
===Molecular interaction networks===
[[File:The protein interaction network of Treponema pallidum.png|200px|thumbnail|right|Interactions between proteins are frequently visualized and analyzed using networks. This network is made up of protein–protein interactions from ''[[Treponema pallidum]]'', the causative agent of [[syphilis]] and other diseases.<ref>{{cite journal | vauthors = Titz B, Rajagopala SV, Goll J, Häuser R, McKevitt MT, Palzkill T, Uetz P | title = The binary protein interactome of Treponema pallidum--the syphilis spirochete | journal = PLOS ONE | volume = 3 | issue = 5 | pages = e2292 | date = May 2008 | pmid = 18509523 | pmc = 2386257 | doi = 10.1371/journal.pone.0002292 | bibcode = 2008PLoSO...3.2292T | veditors = Hall N | doi-access = free }}</ref>]]▼
{{main|Protein–protein interaction prediction|interactome}}
▲[[File:The protein interaction network of Treponema pallidum.png|200px|thumbnail|right|Interactions between proteins are frequently visualized and analyzed using networks. This network is made up of protein–protein interactions from ''[[Treponema pallidum]]'', the causative agent of [[syphilis]] and other diseases.<ref>{{cite journal | vauthors = Titz B, Rajagopala SV, Goll J, Häuser R, McKevitt MT, Palzkill T, Uetz P | title = The binary protein interactome of Treponema pallidum--the syphilis spirochete | journal = PLOS ONE | volume = 3 | issue = 5 | pages = e2292 | date = May 2008 | pmid = 18509523 | pmc = 2386257 | doi = 10.1371/journal.pone.0002292 | bibcode = 2008PLoSO...3.2292T | veditors = Hall N | doi-access = free }}</ref>]]
Tens of thousands of three-dimensional protein structures have been determined by [[X-ray crystallography]] and [[protein nuclear magnetic resonance spectroscopy]] (protein NMR) and a central question in structural bioinformatics is whether it is practical to predict possible protein–protein interactions only based on these 3D shapes, without performing [[protein–protein interaction]] experiments. A variety of methods have been developed to tackle the [[protein–protein docking]] problem, though it seems that there is still much work to be done in this field.
Other interactions encountered in the field include Protein–ligand (including drug) and [[protein–peptide]]. Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational [[algorithm]]s, termed docking algorithms, for studying [[interactome|molecular interactions]].
==Biodiversity informatics==
{{main|Biodiversity informatics}}
Biodiversity informatics deals with the collection and analysis of [[biodiversity]] data, such as [[taxonomic database]]s, or [[microbiome]] data. Examples of such analyses include [[phylogenetics]], [[niche modelling]], [[species richness]] mapping, [[DNA barcoding]], or [[Speciesism|species]] identification tools. A growing area is also [[Macroecology|macro-ecology]], i.e. the study of how biodiversity is connected to [[ecology]] and human impact, such as [[climate change]].
==Others==
===Literature analysis===
{{main|Text mining|Biomedical text mining}}
The enormous number of published literature makes it virtually impossible for individuals to read every paper, resulting in disjointed sub-fields of research. Literature analysis aims to employ computational and statistical linguistics to mine this growing library of text resources. For example:
* Abbreviation recognition – identify the long-form and abbreviation of biological terms
* [[Named-entity recognition]] – recognizing biological terms such as gene names
Line 204 ⟶ 206:
===High-throughput image analysis===
Computational technologies are used to automate the processing, quantification and analysis of large amounts of high-information-content [[medical imaging|biomedical imagery]]. Modern [[image analysis]] systems can improve an observer's [[accuracy]], [[Objectivity (science)|objectivity]], or speed. Image analysis is important for both [[diagnostics]] and research. Some examples are:
* high-throughput and high-fidelity quantification and sub-cellular localization ([[high-content screening]], cytohistopathology, [[Bioimage informatics]])
* [[morphometrics]]
Line 215 ⟶ 218:
===High-throughput single cell data analysis===
{{main|Flow cytometry bioinformatics}}
Computational techniques are used to analyse high-throughput, low-measurement single cell data, such as that obtained from [[flow cytometry]]. These methods typically involve finding populations of cells that are relevant to a particular disease state or experimental condition.
Line 224 ⟶ 228:
==Databases==
{{main|List of biological databases|Biological database}}
Databases are essential for bioinformatics research and applications. Databases exist for many different information types, including DNA and protein sequences, molecular structures, phenotypes and biodiversity. Databases can contain both empirical data (obtained directly from experiments) and predicted data (obtained from analysis of existing data). They may be specific to a particular organism, pathway or molecule of interest. Alternatively, they can incorporate data compiled from multiple other databases. Databases can have different formats, access mechanisms, and be public or private.
Line 241 ⟶ 246:
===Open-source bioinformatics software===
{{Main articles|List of bioinformatics software}}
Many [[free and open-source software]] tools have existed and continued to grow since the 1980s.<ref name="obf-main">{{cite web |title=Open Bioinformatics Foundation: About us |url=http://www.open-bio.org/wiki/Main_Page |website=Official website |publisher=[[Open Bioinformatics Foundation]] |access-date=10 May 2011 |archive-date=12 May 2011 |archive-url=https://web.archive.org/web/20110512022059/http://open-bio.org/wiki/Main_Page |url-status=live }}</ref> The combination of a continued need for new [[algorithm]]s for the analysis of emerging types of biological readouts, the potential for innovative ''[[in silico]]'' experiments, and freely available [[open code]] bases have created opportunities for research groups to contribute to both bioinformatics regardless of [[Funding of science|funding]]. The open source tools often act as incubators of ideas, or community-supported [[Plug-in (computing)|plug-ins]] in commercial applications. They may also provide ''[[de facto]]'' standards and shared object models for assisting with the challenge of bioinformation integration.
Line 280 ⟶ 286:
== See also ==
{{Columns-list|colwidth=30em|
* [[Biodiversity informatics]]
Line 288 ⟶ 295:
* [[Cyberbiosecurity]]
* [[Functional genomics]]
* [[Gene Disease Database]]▼
* [[Health informatics]]
* [[International Society for Computational Biology]]
Line 298 ⟶ 306:
* [[Phylogenetics]]
* [[Proteomics]]
▲* [[Gene Disease Database]]
}}
Line 306 ⟶ 313:
== Further reading ==
<!-- It's possible that some of these were used as the original sources for the article. -->
{{Library resources box}}
{{refbegin|35em}}
* Sehgal et al. : Structural, phylogenetic and docking studies of D-amino acid oxidase activator(DAOA ), a candidate schizophrenia gene. Theoretical Biology and Medical Modelling 2013 10 :3.
Line 332 ⟶ 340:
== External links ==
<!-- Please use the talk page to propose any additions to this section. If you do not do this, the link will almost certainly be deleted. Also, do not list bioinformatics research groups or centers.-->
* [http://expasy.org Bioinformatics Resource Portal (SIB)]
{{Bioinformatics}}
Line 346 ⟶ 354:
{{Computer science}}
{{Health informatics}}
▲{{Portal bar|Biology|Evolutionary biology}}
{{Authority control}}
|