Content deleted Content added
Wikisaurus (talk | contribs) |
→References: var cols |
||
(88 intermediate revisions by 46 users not shown) | |||
Line 1:
{{Short description|Chemical species structure notation}}
{{Redirect|SMILES|other uses|Smiles (disambiguation)}}
{{Use mdy dates|date=July 2020}}
{{Infobox file format
| name = SMILES
| extension =
| owner = ▼
| genre = [[chemical file format]]▼
▲| owner =
▲| genre = [[chemical file format]]
| container for =
| contained by =
| extended from =
| extended to =
}}
[[Image:SMILES.png|thumb|class=skin-invert-image|300px|SMILES generation algorithm for [[
The '''
The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an [[open standard]] called OpenSMILES was developed in
==History==
The original SMILES specification was initiated by [[David Weininger]] at the USEPA Mid-Continent Ecology Division Laboratory in [[Duluth, Minnesota|Duluth]] in the 1980s.<ref name="
It has since been modified and extended by others, most notably by [[Daylight Chemical Information Systems]]. In 2007, an [[open standard]] called "OpenSMILES" was developed by the [[Blue Obelisk]] open-source chemistry community. Other 'linear' notations include the [[Wiswesser Line Notation]] (WLN), [[ROSDAL]] and [[SYBYL Line Notation|SLN]] (Tripos Inc).
In July 2006, the [[International Union of Pure and Applied Chemistry|IUPAC]] introduced the [[International Chemical Identifier|InChI]] as a standard for formula representation. SMILES is generally considered to have the advantage of being
== Terminology ==
The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES strings can be written for a molecule. For example, <code>CCO</code>, <code>OCC</code> and <code>C(O)C</code> all specify the structure of [[ethanol]]. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the [[canonicalization]] algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by
The original paper that described the CANGEN<ref name="
SMILES notation allows the specification of [[molecular configuration|configuration at tetrahedral centers]], and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which [[isomer]]s are specified.
== Graph-based definition ==
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a [[depth-first search|depth-first]] [[tree traversal]] of a [[chemical graph]]. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a [[spanning tree (mathematics)|spanning tree]]. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
Line 43 ⟶ 41:
* of the starting atom used for the depth-first traversal, and
* of the order in which branches are listed when encountered.
==SMILES definition as strings of a context-free language==
From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and [[biodegradability]]) based on the main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance)<ref>{{cite journal | vauthors = Sidorova J, Anisimova M | title = NLP-inspired structural pattern recognition in chemical application. | journal = Pattern Recognition Letters | date = August 2014 | volume = 45 | pages = 11–16 | doi = 10.1016/j.patrec.2014.02.012 | bibcode = 2014PaReL..45...11S }}</ref> as well as a more robust scheme based on statistical pattern recognition.<ref>{{cite journal | vauthors = Sidorova J, Garcia J | title = Bridging from syntactic to statistical methods: Classification with automatically segmented features from sequences. | journal = Pattern Recognition | date = November 2015 | volume = 48 | issue = 11 | pages = 3749–3756 | doi = 10.1016/j.patcog.2015.05.001 | bibcode = 2015PatRe..48.3749S | hdl = 10016/33552 | hdl-access = free }}</ref>
== Description ==
Line 48 ⟶ 50:
=== Atoms ===
[[Atom]]s are represented by the standard abbreviation of the [[chemical element]]s, in square brackets, such as <code>[Au]</code> for [[gold]]. Brackets may be omitted in the common case of atoms which:
# are in the "[[CHON|organic subset]]" of [[boron|B]], [[carbon|C]], [[nitrogen|N]], [[oxygen|O]], [[phosphorus|P]], [[sulfur|S]], [[fluorine|F]], [[chlorine|Cl]], [[bromine|Br]], or [[iodine|I]], and
# have no [[formal charge]], and
# have the number of hydrogens attached implied by the SMILES valence model (typically their normal valence, but for N and P it is 3 or 5, and for S it is 2, 4 or 6), and
# are the normal [[isotope]]s, and
# are not [[Stereocenter|chiral centers]].
All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for [[water (molecule)|water]] may be written as either <code>O</code> or <code>[OH2]</code>. Hydrogen may also be written as a separate atom; water may also be written as <code>[H]O[H]</code>.
Line 62 ⟶ 64:
Bonds between [[Aliphatic compound|aliphatic]] atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as <code>-</code>, this is usually omitted. For example, the SMILES for [[ethanol]] may be written as <code>C-C-O</code>, <code>CC-O</code> or <code>C-CO</code>, but is usually written <code>CCO</code>.
Double, triple, and quadruple [[chemical bond|bonds]] are represented by the symbols <code>=</code>, <code>#</code>, and <code>$</code> respectively as illustrated by the SMILES <code>O=C=O</code> ([[carbon dioxide]] {{CO2}}), <code>C#N</code> ([[hydrogen cyanide]] HCN) and <code>[Ga
An additional type of bond is a "non-bond", indicated with <code>.</code>, to indicate that two parts are not bonded together. For example, aqueous [[sodium chloride]] may be written as <code>[Na+].[Cl-]</code> to show the dissociation.
An aromatic "one and a half" bond may be indicated with <code>:</code>; see {{
Single bonds adjacent to double bonds may be represented using <code>/</code> or <code>\</code> to indicate stereochemical configuration; see {{
===Rings===
Line 87 ⟶ 89:
=== Aromaticity ===
[[Aromaticity|Aromatic]] rings such as [[benzene]] may be written in one of three forms:
# In [[
# Using the aromatic bond symbol <code>:</code>, e.g. <code>C:1:C:C:C:C:C1</code>,{{Citation needed|date=June 2025|reason=Not mentioned in www.daylight.com/dayhtml/doc/theory/theory.smiles.html, probably SMARTS related.}} or
# Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms <code>b</code>, <code>c</code>, <code>n</code>, <code>o</code>, <code>p</code> and <code>s</code>, respectively.
Line 99 ⟶ 101:
The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.
[[Image:3-cyanoanisole SMILES.svg|right|thumb|class=skin-invert-image|350px|Visualization of 3-cyanoanisole as <code>COc(c1)cccc1C#N</code>.]]
=== Branching ===
Branches are described with parentheses, as in <code>CCC(=O)O</code> for [[propionic acid]] and <code>FC(F)F</code> for [[fluoroform]]. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol
Substituted rings can be written with the branching point in the ring as illustrated by the SMILES <code>COc(c1)cccc1C#N</code> ([https://web.archive.org/web/20130522091354/http://www.daylight.com/daycgi/depict?434f6328633129636363633143234e see depiction]) and <code>COc(cc1)ccc1C#N</code> ([https://web.archive.org/web/20130522074308/http://www.daylight.com/daycgi/depict?434f6328636331296363633143234e see depiction]) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Line 108 ⟶ 110:
Branches may be written in any order. For example, [[bromochlorodifluoromethane]] may be written as <code>FC(Br)(Cl)F</code>, <code>BrC(F)(F)Cl</code>, <code>C(F)(Cl)(F)Br</code>, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are:
* If ring numbers are reused, they are paired according to their order of appearance in the SMILES string. Some adjustments may be required to preserve the correct pairing.
* If stereochemistry is specified, adjustments must be made; see {{
The one form of branch which does ''not'' require parentheses are ring-closing bonds: the SMILES fragment <code>C1N</code> is equivalent to <code>C(1)N</code>, both denoting a bond between the <code>C</code> and the <code>N</code>. Choosing ring-closing bonds
=== Stereochemistry ===
{{See also|Skeletal formula}}[[File:Trans-1,2-difluoroethylene.svg|thumb|right|class=skin-invert-image|upright=0.5|''trans''-1,2-difluoroethylene]]
<!--[[File:Cis-1,2-difluoroethylene.svg|thumb|right|class=skin-invert-image|upright=0.5|''cis''-1,2-difluoroethylene]]-->
SMILES permits, but does not require, specification of [[stereoisomer]]s.
Line 121 ⟶ 123:
Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, <code>F\C=C\F</code> is the same as <code>F/C=C/F</code>. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written <code>C/C=C/C=C/C</code>.
[[File:Beta-Carotene_conjugation.svg|thumb|right|
As a more complex example, [[beta-carotene]] has a very long backbone of alternating single and double bonds, which may be written <code>CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C</code>.
Line 129 ⟶ 131:
For example, consider the [[amino acid]] [[alanine]]. One of its SMILES forms is <code>NC(C)C(=O)O</code>, more fully written as <code>N[CH](C)C(=O)O</code>. [[L-alanine|<small>L</small>-Alanine]], the more common [[enantiomer]], is written as <code>N[C@@H](C)C(=O)O</code> ([https://web.archive.org/web/20130704043108/http://www.daylight.com/daycgi/depict?4e5b434040485d28432943283d4f294f see depiction]). Looking from the nitrogen–carbon bond, the hydrogen (<code>H</code>), methyl (<code>C</code>), and carboxylate (<code>C(=O)O</code>) groups appear clockwise. <small>D</small>-Alanine can be written as <code>N[C@H](C)C(=O)O</code> ([https://web.archive.org/web/20130522072012/http://www.daylight.com/daycgi/depict?4e5b4340485d28432943283d4f294f see depiction]).
While the order
Normally, the first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as <code>C(C)(N)C(=O)O</code>, then all four are to the right, but the first to appear (the <code>[CH]</code> bond in this case) is used as the reference to order the following three: <small>L</small>-alanine may also be written <code>[C@@H](C)(N)C(=O)O</code>.
Line 136 ⟶ 138:
===Isotopes===
[[Isotopes]] are specified with a number equal to the integer isotopic mass preceding the atomic symbol. [[Benzene]] in which one atom is [[carbon-14]] is written as <code>[
=== Examples ===
{|class=wikitable
|-
!Molecule||Structure||SMILES
|-----
|[[Dinitrogen]]
Line 149 ⟶ 151:
|-----
|[[Methyl isocyanate]] (MIC)
|[[File:Methyl isocyanate.svg|frameless|120px|class=skin-invert-image]]
|<code>CN=C=O</code>
|-----
Line 157 ⟶ 159:
|-----
|[[Vanillin]]
|[[Image:Vanillin.svg|class=skin-invert-image|70px|
|<code>O=Cc1ccc(O)c(OC)c1</code><br/>
|-----
|[[Melatonin]] (C<sub>13</sub>H<sub>16</sub>N<sub>2</sub>O<sub>2</sub>)
|[[Image:Melatonin2.svg|class=skin-invert-image|160px|
|<code>CC(=O)NCCC1=CNc2c1cc(OC)cc2</code><br/><code>CC(=O)NCCc1c[nH]c2ccc(OC)cc12</code>
|-----
|[[Flavopereirin]] (C<sub>17</sub>H<sub>15</sub>N<sub>2</sub>)
|[[Image:Flavopereirine.svg|class=skin-invert-image|160px|
|<code>CCc(c1)ccc2[n+]1ccc3c2[nH]c4c3cccc4</code><br/><code>CCc1c[n+]2ccc3c4ccccc4[nH]c3c2cc1</code>
|-----
|[[Nicotine]] (C<sub>10</sub>H<sub>14</sub>N<sub>2</sub>)
|[[Image:Nicotine.svg|class=skin-invert-image|80px|
|<code>CN1CCC[C@H]1c2cccnc2</code>
|-----
|[[Oenanthotoxin]] (C<sub>17</sub>H<sub>22</sub>O<sub>2</sub>)
|[[Image:Oenanthotoxin-structure.png|class=skin-invert-image|180px|
|<code>CCC[C@@H](O)CC\C=C\C=C\C#CC#C\C=C\CO</code><br/><code>CCC[C@@H](O)CC/C=C/C=C/C#CC#C/C=C/CO</code>
|-----
|[[Pyrethrin]] II (C<sub>22</sub>H<sub>28</sub>O<sub>5</sub>)
|[[Image:Pyrethrin-II-2D-skeletal.svg|class=skin-invert-image|180px|
|<code>CC1=C(C(=O)C[C@@H]1OC(=O)[C@@H]2[C@H](C2(C)C)/C=C(\C)/C(=O)OC)C/C=C\C=C</code>
|-----
|[[Aflatoxin]] B<sub>1</sub> (C<sub>17</sub>H<sub>12</sub>O<sub>6</sub>)
|[[Image:Aflatoxin B1.svg|class=skin-invert-image|130px|
|<code>O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5</code>
|-----
|[[Glucose]] (β-<small>D</small>-glucopyranose) (C<sub>6</sub>H<sub>12</sub>O<sub>6</sub>)
|[[Image:Beta-D-Glucose.svg|class=skin-invert-image|140px|
|<code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C
|-----
|[[Bergenin]] (cuscutin, a [[resin]]) (C<sub>14</sub>H<sub>16</sub>O<sub>9</sub>)
|[[Image:Cuscutine.svg|class=skin-invert-image|130px|
|<code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2</code>
|-----
|A [[pheromone]] of the Californian [[scale insect]]
|[[Image:Pheromone cochenille californienne.svg|class=skin-invert-image|180px|
|<code>CC(=O)OCCC(/C)=C\C[C@H](C(C)=C)CCC=C</code>
|-----
|(2''S'',5''R'')-[[Chalcogran]]: a [[pheromone]] of the [[Scolytinae|bark beetle]] ''[[Pityogenes chalcographus]]''<ref>{{cite journal |
|[[Image:2S,5R-chalcogran-skeletal.svg|class=skin-invert-image|130px|
▲|[[Image:2S,5R-chalcogran-skeletal.svg|130px|''(2<i>S</i>,5<i>R</i>)-2-ethyl-1,6-dioxaspiro[4.4]nonane'']]
|<code>CC[C@H](O1)CC[C@@]12CCCO2</code>
|-----
|[[Thujone|α-Thujone]] (C<sub>10</sub>H<sub>16</sub>O)
|[[Image:Alpha-thujone.svg|class=skin-invert-image|100px|
|<code>CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2</code>
|-----
|[[Thiamine]] (vitamin B<sub>1</sub>, C<sub>12</sub>H<sub>17</sub>N<sub>4</sub>OS<sup>+</sup>)
|[[Image:Thiamin.svg|class=skin-invert-image|150px|
|<code>OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N</code>
|}
{{Clear}}
To illustrate a molecule with more than 9 rings, consider [[cephalostatin]]-1,<ref
{{Clear}}
:[[Image:Cephalostatine-1.svg|class=skin-invert-image|360px|
Starting with the left-most methyl group in the figure:
Line 218 ⟶ 219:
:<code>CC(C)(O1)C[C@@H](O)[C@@]1(O2)[C@@H](C)[C@@H]3CC=C4[C@]3(C2)C(=O)C[C@H]5[C@H]4CC[C@@H](C6)[C@]5(C)Cc(n7)c6nc(C[C@@]89(C))c7C[C@@H]8CC[C@@H]%10[C@@H]9C[C@@H](O)[C@@]%11(C)C%10=C[C@H](O%12)[C@]%11(O)[C@H](C)[C@]%12(O%13)[C@H](O)C[C@@]%13(C)CO</code>
=== Other examples of SMILES ===
The SMILES notation is described extensively in the SMILES theory manual provided by
== Extensions ==
Line 228 ⟶ 229:
[[Smiles arbitrary target specification|SMARTS]] is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of [[Wildcard character|wildcard]] atoms and bonds, which can be used to define substructural queries for [[chemical database]] searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for [[Glossary of graph theory#Subgraphs|subgraph]] [[isomorphism]].
{{anchor|SMIRKS}}SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is <code>REACTANT>AGENT>PRODUCT</code> (without spaces), where any of the fields can either be left blank or filled with multiple molecules
SMILES corresponds to discrete molecular structures. However many materials are macromolecules, which are too large (and often stochastic) to conveniently generate SMILES for. BigSMILES is an extension of SMILES that aims to provide an efficient representation system for macromolecules.<ref>{{cite journal | vauthors = Lin TS, Coley CW, Mochigase H, Beech HK, Wang W, Wang Z, Woods E, Craig SL, Johnson JA, Kalow JA, Jensen KF, Olsen BD | display-authors = 6 | title = BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules | journal = ACS Central Science | volume = 5 | issue = 9 | pages = 1523–1531 | date = September 2019 | pmid = 31572779 | pmc = 6764162 | doi = 10.1021/acscentsci.9b00476 }}</ref>
== Conversion ==
SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms.<ref
▲SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities.
== See also ==
Line 243 ⟶ 245:
== References ==
{{Reflist
{{Molecular visualization}}
|