Simplified Molecular Input Line Entry System: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:31, 11 November 2004 edit Keenan Pepper (talk \| contribs) Autopatrolled, Administrators 19,075 edits m →Extensions: grammar ← Previous edit		Latest revision as of 11:29, 3 August 2025 edit undo Graeme Bartlett (talk \| contribs) Administrators 258,941 edits →References: var cols
(671 intermediate revisions by more than 100 users not shown)
Line 1: {{Short description\|Chemical species structure notation}} The '''simplified molecular input line entry specification''' or '''SMILES''' is a specification for unambiguously describing the structure of [[chemistry\|chemical]] [[molecule]]s using short [[ASCII]] alpha-numeric [[string]]s. SMILES strings can be imported by most [[molecule editor]]s for conversion back into [[two-dimensional]] drawings or [[three-dimensional]] models of the the molecules. {{Redirect\|SMILES\|other uses\|Smiles (disambiguation)}} {{Use mdy dates\|date=July 2020}} {{Infobox file format \| name = SMILES \| extension = .smi \| owner = \| genre = [[chemical file format]] \| container for = \| contained by = \| extended from = \| extended to = }} [[Image:SMILES.png\|thumb\|class=skin-invert-image\|300px\|SMILES generation algorithm for [[ciprofloxacin]]: break cycles, then write as branches off a main backbone]] The '''Simplified Molecular Input Line Entry System''' ('''SMILES''') is a specification in the form of a [[line notation]] for describing the structure of [[chemical species]] using short [[ASCII]] [[string (computer science)\|strings]]. SMILES strings can be imported by most [[molecule editor]]s for conversion back into [[two-dimensional]] drawings or [[dimension\|three-dimensional]] models of the molecules. The SMILES specification was developed by David Weininger in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc). The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an [[open standard]] called OpenSMILES was developed in the [[open source]] chemistry community. ~~==Graph based definition==~~ ==History== In terms of a graph based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree-traversal of a chemical graph. The chemical graph is first trimmed to remove Hydrogen atoms and cycles are broken to make it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Brackets are used to indicate points of branching on the tree. The original SMILES specification was initiated by [[David Weininger]] at the USEPA Mid-Continent Ecology Division Laboratory in [[Duluth, Minnesota\|Duluth]] in the 1980s.<ref name="Weininger-1988">{{cite journal\| vauthors = Weininger D \| title=SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules\| journal=Journal of Chemical Information and Computer Sciences\| volume=28\| issue= 1\|pages=31–6\|date=February 1988\|doi=10.1021/ci00057a005 }}</ref><ref name="Weininger-1989">{{cite journal\| vauthors = Weininger D, Weininger A, Weininger JL \| title=SMILES. 2. Algorithm for generation of unique SMILES notation\| journal=Journal of Chemical Information and Modeling\| volume=29\| issue=2\| pages=97–101\|date=May 1989\|doi=10.1021/ci00062a008 }}</ref><ref name="Weininger-1990">{{cite journal\| vauthors = Weininger D \| title=SMILES. 3. DEPICT. Graphical depiction of chemical structures\| journal=Journal of Chemical Information and Modeling\| volume=30\| issue= 3\|pages=237–43\|date=August 1990\|doi=10.1021/ci00067a005 }}</ref><ref name="Swanson-2004">{{cite book \| vauthors = Swanson RP \| veditors = Rayward WB, Bowden ME \|title=The History and Heritage of Scientific and Technological Information Systems: Proceedings of the 2002 Conference of the American Society of Information Science and Technology and the Chemical Heritage Foundation \|date=2004 \|publisher=[[Information Today]] \|___location=Medford, NJ \|isbn=978-1-57387-229-4 \|page=205 \|url=https://books.google.com/books?id=76OOQannpBgC&pg=PA205 \|ref=ASIST monograph series 2002 \|chapter=The Entrance of Informatics into Combinatorial Chemistry \|chapter-url=https://wayback.archive-it.org/2118/20100925010036/http://64.251.202.97/pubs/asist2002/17-swanson.pdf }}</ref> Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and [[Corwin Hansch]] ([[Pomona College]]) for supporting the work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system."<ref name="Weininger-1998">{{cite web\| vauthors = Weininger D \|title=Acknowledgements on Daylight Tutorial smiles-etc page\|url=http://www.daylight.com/meetings/summerschool98/course/dave/smiles-etc.html\|access-date=24 June 2013 \|date=1998 }}</ref> The [[United States Environmental Protection Agency\|Environmental Protection Agency]] funded the initial project to develop SMILES.<ref name="Anderson-1987">{{cite book \|year=1987 \|title= SMILES: A line notation and computerized interpreter for chemical structures \|id=Report No. EPA/600/M-87/021 \|publisher=[[United States Environmental Protection Agency\|U.S. EPA]], Environmental Research Laboratory-Duluth \|___location=Duluth, MN \|url=https://nepis.epa.gov/Exe/ZyPDF.cgi/2000CAUR.PDF?Dockey=2000CAUR.PDF \| vauthors = Anderson E, Veith GD, Weininger D }}</ref><ref name="SMILES Tutorial: What is SMILES?">{{Cite web\|url=http://www.epa.gov/med/Prods_Pubs/smiles.htm \| archive-url = https://web.archive.org/web/20080328080430/https://www.epa.gov/med/Prods_Pubs/smiles.htm \| archive-date = 28 March 2008 \|title=SMILES Tutorial: What is SMILES? \|publisher=[[United States Environmental Protection Agency\|U.S. EPA]] \|access-date=2012-09-23 }}</ref> It has since been modified and extended by others, most notably by [[Daylight Chemical Information Systems]]. In 2007, an [[open standard]] called "OpenSMILES" was developed by the [[Blue Obelisk]] open-source chemistry community. Other 'linear' notations include the [[Wiswesser Line Notation]] (WLN), [[ROSDAL]] and [[SYBYL Line Notation\|SLN]] (Tripos Inc). ~~==Examples==~~ In July 2006, the [[International Union of Pure and Applied Chemistry\|IUPAC]] introduced the [[International Chemical Identifier\|InChI]] as a standard for formula representation. SMILES is generally considered to have the advantage of being more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing (such as [[graph theory]]). [[Atom]]s are represented by the standard abbreviation of the [[chemical element]]s, in square brackets, such as [Au] for [[gold]]. [[Hydroxide]] [[anion]] is [OH-]. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for [[water]] is simply O and that for [[ethanol]] is CCO. The [[chemical bond\|double-bonded]] [[carbon dioxide]] is represented as O=C=O and the triple-bonded [[hydrogen cyanide]] as C#N. [[Cyclohexane]] is represented as C1CCCCC1, the idea being that the two ones label the same position in the molecule, thus forming a ring with six carbons. Branches are described with parentheses, as in CCC(=O)O for [[propionic acid]] and FC(F)F, or alternatively C(F)(F)F, for [[fluoroform]]. ==~~Extensions~~ Terminology == The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive. Typically, a number of equally valid SMILES strings can be written for a molecule. For example, <code>CCO</code>, <code>OCC</code> and <code>C(O)C</code> all specify the structure of [[ethanol]]. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the [[canonicalization]] algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, [[OpenEye Scientific Software]], [[MEDIT]], [[Chemical Computing Group]], [[MolSoft]] LLC, and the [[Chemistry Development Kit]]. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a [[Chemical database\|database]]. SMARTS is a modification of SMILES that allows, in addition to the SMILES elements, the specification of [[wildcard]] atoms and bonds. This is used in specifying search structures and is widely used in [[chemical database]] search applications. This practise has led to a common misconception that chemical substructure search is achieved computationally by matching SMILES/SMARTS strings, when in fact it is achieved by the computationally more intensive search for [[subgraph]] [[isomorphism]] in the graphs reconstructed from the SMILES representations. The original paper that described the CANGEN<ref name="Weininger-1989" /> algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. [[cuneane]], 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically.<ref>{{cite book \| vauthors = Neglur G, Grossman RL, Liu B \|publisher=Springer \|___location=Berlin \|isbn=978-3-540-27967-9 \|volume=3615 \|pages=145–157 \| veditors = Ludäscher B \| series = Lecture Notes in Computer Science \|title=Data Integration in the Life Sciences \|chapter=Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples \|access-date=2013-02-12 \|year=2005 \|chapter-url=https://doi.org/10.1007%2F11530084_13 \|doi=10.1007/11530084_13 }}</ref> There is currently no systematic comparison across commercial software to test if such flaws exist in those packages. Since SMILES is generated by tree-traversal, the string can vary depending on the root node chosen as well as the order in which nodes are encountered. A unique or 'canonical' form of the SMILES representation can be generated by applying rules to preprocess the tree before tree-traversal. A common application of unique SMILES is for exact matching of two structures and also for ensuring uniqueness among molecules in a database. SMILES notation allows the specification of [[molecular configuration\|configuration at tetrahedral centers]], and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which [[isomer]]s are specified. ~~Important enhancements to SMILES include extensions to store information on [[stereochemistry]].~~ == Graph-based definition == ~~==External links==~~ In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a [[depth-first search\|depth-first]] [[tree traversal]] of a [[chemical graph]]. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a [[spanning tree (mathematics)\|spanning tree]]. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. The resultant SMILES form depends on the choices: * SMILES tutorial, http://www.daylight.com/dayhtml/smiles/smiles-intro.html * of the bonds chosen to break cycles, * Web-based applications capable of converting SMILES strings to 2D structure images * of the starting atom used for the depth-first traversal, and ** http://www.daylight.com/daycgi/depict Simple, no customization * of the order in which branches are listed when encountered. ** http://cactus.nci.nih.gov/services/gifcreator/ More complicated, superior image quality and many customization options * Molecule editor applet that can create SMILES, http://www.molinspiration.com/jme/index.html * SMILES parsing, http://www.dalkescientific.com/writings/diary/archive/ ==SMILES definition as strings of a context-free language== ~~[[Category:Chemistry]]~~ From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and [[biodegradability]]) based on the main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance)<ref>{{cite journal \| vauthors = Sidorova J, Anisimova M \| title = NLP-inspired structural pattern recognition in chemical application. \| journal = Pattern Recognition Letters \| date = August 2014 \| volume = 45 \| pages = 11–16 \| doi = 10.1016/j.patrec.2014.02.012 \| bibcode = 2014PaReL..45...11S }}</ref> as well as a more robust scheme based on statistical pattern recognition.<ref>{{cite journal \| vauthors = Sidorova J, Garcia J \| title = Bridging from syntactic to statistical methods: Classification with automatically segmented features from sequences. \| journal = Pattern Recognition \| date = November 2015 \| volume = 48 \| issue = 11 \| pages = 3749–3756 \| doi = 10.1016/j.patcog.2015.05.001 \| bibcode = 2015PatRe..48.3749S \| hdl = 10016/33552 \| hdl-access = free }}</ref> ~~[[de:SMILES]]~~ == Description == === Atoms === [[Atom]]s are represented by the standard abbreviation of the [[chemical element]]s, in square brackets, such as <code>[Au]</code> for [[gold]]. Brackets may be omitted in the common case of atoms which: # are in the "[[CHON\|organic subset]]" of [[boron\|B]], [[carbon\|C]], [[nitrogen\|N]], [[oxygen\|O]], [[phosphorus\|P]], [[sulfur\|S]], [[fluorine\|F]], [[chlorine\|Cl]], [[bromine\|Br]], or [[iodine\|I]], and # have no [[formal charge]], and # have the number of hydrogens attached implied by the SMILES valence model (typically their normal valence, but for N and P it is 3 or 5, and for S it is 2, 4 or 6), and # are the normal [[isotope]]s, and # are not [[Stereocenter\|chiral centers]]. All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for [[water (molecule)\|water]] may be written as either <code>O</code> or <code>[OH2]</code>. Hydrogen may also be written as a separate atom; water may also be written as <code>[H]O[H]</code>. When brackets are used, the symbol <code>H</code> is added if the atom in brackets is bonded to one or more hydrogen, followed by the number of hydrogen atoms if greater than 1, then by the sign <code>+</code> for a positive charge or by <code>-</code> for a negative charge. For example, <code>[NH4+]</code> for [[ammonium]] ({{chem\|NH\|4\|+}}). If there is more than one charge, it is normally written as digit; however, it is also possible to repeat the sign as many times as the ion has charges: one may write either <code>[Ti+4]</code> or <code>[Ti++++]</code> for [[titanium]](IV) Ti<sup>4+</sup>. Thus, the [[hydroxide]] [[anion]] ({{OH-}}) is represented by <code>[OH-]</code>, the [[hydronium]] cation ({{H3O+}}) is <code>[OH3+]</code> and the [[cobalt]](III) [[cation]] (Co<sup>3+</sup>) is either <code>[Co+3]</code> or <code>[Co+++]</code>. ===Bonds=== A bond is represented using one of the symbols <code>. - = # $ : / \</code>. Bonds between [[Aliphatic compound\|aliphatic]] atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as <code>-</code>, this is usually omitted. For example, the SMILES for [[ethanol]] may be written as <code>C-C-O</code>, <code>CC-O</code> or <code>C-CO</code>, but is usually written <code>CCO</code>. Double, triple, and quadruple [[chemical bond\|bonds]] are represented by the symbols <code>=</code>, <code>#</code>, and <code>$</code> respectively as illustrated by the SMILES <code>O=C=O</code> ([[carbon dioxide]] {{CO2}}), <code>C#N</code> ([[hydrogen cyanide]] HCN) and <code>[Ga+]$[As-]</code> ([[gallium arsenide]]). An additional type of bond is a "non-bond", indicated with <code>.</code>, to indicate that two parts are not bonded together. For example, aqueous [[sodium chloride]] may be written as <code>[Na+].[Cl-]</code> to show the dissociation. An aromatic "one and a half" bond may be indicated with <code>:</code>; see {{section link\|\|Aromaticity}} below. Single bonds adjacent to double bonds may be represented using <code>/</code> or <code>\</code> to indicate stereochemical configuration; see {{section link\|\|Stereochemistry}} below. ===Rings=== Ring structures are written by breaking each ring at an arbitrary point (although some choices will lead to a more legible SMILES than others) to make an [[Open-chain compound\|acyclic]] structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms. For example, [[cyclohexane]] and [[dioxane]] may be written as <code>C1CCCCC1</code> and <code>O1CCOCC1</code> respectively. For a second ring, the label will be 2. For example, [[decalin]] (decahydronaphthalene) may be written as <code>C1CCCC2C1CCCC2</code>. SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this is rarely used. Also, it is permitted to reuse ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, [[bicyclohexyl]] is usually written as <code>C1CCCCC1C2CCCCC2</code>, but it may also be written as <code>C0CCCCC0C0CCCCC0</code>. Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin is <code>C1CCCC2CCCCC12</code>, where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by <code>%</code>, so <code>C%12</code> is a single ring-closing bond of ring 12. Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example, [[cyclopropene]] is usually written <code>C1=CC1</code>, but if the double bond is chosen as the ring-closing bond, it may be written as <code>C=1CC1</code>, <code>C1CC=1</code>, or <code>C=1CC=1</code>. (The first form is preferred.) <code>C=1CC-1</code> is illegal, as it explicitly specifies conflicting types for the ring-closing bond. Ring-closing bonds may not be used to denote multiple bonds. For example, <code>C1C1</code> is not a valid alternative to <code>C=C</code> for [[ethylene]]. However, they may be used with non-bonds; <code>C1.C2.C12</code> is a peculiar but legal alternative way to write [[propane]], more commonly written <code>CCC</code>. Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, [[cyclohexane-1,2-diol]] is most simply written as <code>OC1CCCCC1O</code>; choosing a different ring-break ___location produces a branched structure that requires parentheses to write. === Aromaticity === [[Aromaticity\|Aromatic]] rings such as [[benzene]] may be written in one of three forms: # In [[August Kekulé\|Kekulé]] form with alternating single and double bonds, e.g. <code>C1=CC=CC=C1</code>, # Using the aromatic bond symbol <code>:</code>, e.g. <code>C:1:C:C:C:C:C1</code>,{{Citation needed\|date=June 2025\|reason=Not mentioned in www.daylight.com/dayhtml/doc/theory/theory.smiles.html, probably SMARTS related.}} or # Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms <code>b</code>, <code>c</code>, <code>n</code>, <code>o</code>, <code>p</code> and <code>s</code>, respectively. In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, [[benzene]], [[pyridine]] and [[furan]] can be represented respectively by the SMILES <code>c1ccccc1</code>, <code>n1ccccc1</code> and <code>o1cccc1</code>. Aromatic nitrogen bonded to hydrogen, as found in [[pyrrole]] must be represented as <code>[nH]</code>; thus [[imidazole]] is written in SMILES notation as <code>n1c[nH]cc1</code>. When aromatic atoms are singly bonded to each other, such as in [[biphenyl]], a single bond must be shown explicitly: <code>c1ccccc1-c2ccccc2</code>. This is one of the few cases where the single bond symbol <code>-</code> is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the nonstandard form <code>c1ccccc1c2ccccc2</code>.) The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity. [[Image:3-cyanoanisole SMILES.svg\|right\|thumb\|class=skin-invert-image\|350px\|Visualization of 3-cyanoanisole as <code>COc(c1)cccc1C#N</code>.]] === Branching === Branches are described with parentheses, as in <code>CCC(=O)O</code> for [[propionic acid]] and <code>FC(F)F</code> for [[fluoroform]]. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol must appear inside the parentheses; outside (e.g. <code>CCC=(O)O</code>) is invalid. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES <code>COc(c1)cccc1C#N</code> ([https://web.archive.org/web/20130522091354/http://www.daylight.com/daycgi/depict?434f6328633129636363633143234e see depiction]) and <code>COc(cc1)ccc1C#N</code> ([https://web.archive.org/web/20130522074308/http://www.daylight.com/daycgi/depict?434f6328636331296363633143234e see depiction]) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable. Branches may be written in any order. For example, [[bromochlorodifluoromethane]] may be written as <code>FC(Br)(Cl)F</code>, <code>BrC(F)(F)Cl</code>, <code>C(F)(Cl)(F)Br</code>, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are: * If ring numbers are reused, they are paired according to their order of appearance in the SMILES string. Some adjustments may be required to preserve the correct pairing. * If stereochemistry is specified, adjustments must be made; see {{section link\|\|Stereochemistry}} below. The one form of branch which does ''not'' require parentheses are ring-closing bonds: the SMILES fragment <code>C1N</code> is equivalent to <code>C(1)N</code>, both denoting a bond between the <code>C</code> and the <code>N</code>. Choosing ring-closing bonds adjacent to branch points can reduce the number of parentheses required. For example, [[toluene]] is normally written as <code>Cc1ccccc1</code> or <code>c1ccccc1C</code>, avoiding the parentheses required if written as <code>c1cc(C)ccc1</code> or <code>c1cc(ccc1)C</code>. === Stereochemistry === {{See also\|Skeletal formula}}[[File:Trans-1,2-difluoroethylene.svg\|thumb\|right\|class=skin-invert-image\|upright=0.5\|''trans''-1,2-difluoroethylene]] <!--[[File:Cis-1,2-difluoroethylene.svg\|thumb\|right\|class=skin-invert-image\|upright=0.5\|''cis''-1,2-difluoroethylene]]--> SMILES permits, but does not require, specification of [[stereoisomer]]s. Configuration around double bonds is specified using the characters <code>/</code> and <code>\</code> to show directional single bonds adjacent to a double bond. For example, <code>F/C=C/F</code> ([https://web.archive.org/web/20130522072357/http://www.daylight.com/daycgi/depict?462f433d432f46 see depiction]) is one representation of ''[[trans isomer\|trans]]''-[[1,2-difluoroethylene]], in which the fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas <code>F/C=C\F</code> ([https://web.archive.org/web/20130522074206/http://www.daylight.com/daycgi/depict?462f433d435c46 see depiction]) is one possible representation of ''[[Cis-trans isomerism\|cis]]''-1,2-difluoroethylene, in which the fluorines are on the same side of the double bond. Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, <code>F\C=C\F</code> is the same as <code>F/C=C/F</code>. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written <code>C/C=C/C=C/C</code>. [[File:Beta-Carotene_conjugation.svg\|thumb\|right\|class=skin-invert-image\|upright=0.866\|[[Beta-carotene]], with the eleven double bonds highlighted.]] As a more complex example, [[beta-carotene]] has a very long backbone of alternating single and double bonds, which may be written <code>CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C</code>. Configuration at [[Stereocenter\|tetrahedral carbon]] is specified by <code>@</code> or <code>@@</code>. Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with <code>@@</code> and <code>@</code>, respectively (because the <code>@</code> symbol itself is a counter-clockwise spiral). [[File:L-Alanin - L-Alanine.svg\|thumb\|right\|upright=0.5\|<small>L</small>-Alanine]] For example, consider the [[amino acid]] [[alanine]]. One of its SMILES forms is <code>NC(C)C(=O)O</code>, more fully written as <code>N[CH](C)C(=O)O</code>. [[L-alanine\|<small>L</small>-Alanine]], the more common [[enantiomer]], is written as <code>N[C@@H](C)C(=O)O</code> ([https://web.archive.org/web/20130704043108/http://www.daylight.com/daycgi/depict?4e5b434040485d28432943283d4f294f see depiction]). Looking from the nitrogen–carbon bond, the hydrogen (<code>H</code>), methyl (<code>C</code>), and carboxylate (<code>C(=O)O</code>) groups appear clockwise. <small>D</small>-Alanine can be written as <code>N[C@H](C)C(=O)O</code> ([https://web.archive.org/web/20130522072012/http://www.daylight.com/daycgi/depict?4e5b4340485d28432943283d4f294f see depiction]). While the order in which branches are specified in SMILES is normally unimportant, in this case it matters; swapping any two groups requires reversing the chirality indicator. If the branches are reversed so alanine is written as <code>NC(C(=O)O)C</code>, then the configuration also reverses; <small>L</small>-alanine is written as <code>N[C@H](C(=O)O)C</code> ([https://web.archive.org/web/20130522073747/http://www.daylight.com/daycgi/depict?4e5b434040485d2843283d4f294f2943 see depiction]). Other ways of writing it include <code>C[C@H](N)C(=O)O</code>, <code>OC(=O)[C@@H](N)C</code> and <code>OC(=O)[C@H](C)N</code>. Normally, the first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as <code>C(C)(N)C(=O)O</code>, then all four are to the right, but the first to appear (the <code>[CH]</code> bond in this case) is used as the reference to order the following three: <small>L</small>-alanine may also be written <code>[C@@H](C)(N)C(=O)O</code>. The SMILES specification includes elaborations on the <code>@</code> symbol to indicate stereochemistry around more complex chiral centers, such as [[trigonal bipyramidal molecular geometry]]. ===Isotopes=== [[Isotopes]] are specified with a number equal to the integer isotopic mass preceding the atomic symbol. [[Benzene]] in which one atom is [[carbon-14]] is written as <code>[14cH]1ccccc1</code> and [[deuterochloroform]] is <code>[2H]C(Cl)(Cl)Cl</code>. === Examples === {\|class=wikitable \|- !Molecule\|\|Structure\|\|SMILES formula \|----- \|[[Dinitrogen]] \|N≡N \|<code>N#N</code> \|----- \|[[Methyl isocyanate]] (MIC) \|[[File:Methyl isocyanate.svg\|frameless\|120px\|class=skin-invert-image]] \|<code>CN=C=O</code> \|----- \|[[Copper(II) sulfate]] \|Cu<sup>2+</sup>{{chem\|SO\|4\|2−}} \|<code>[Cu+2].[O-]S(=O)(=O)[O-]</code> \|----- \|[[Vanillin]] \|[[Image:Vanillin.svg\|class=skin-invert-image\|70px\|Molecular structure of vanillin]] \|<code>O=Cc1ccc(O)c(OC)c1</code><br/><code>COc1cc(C=O)ccc1O</code> \|----- \|[[Melatonin]] (C<sub>13</sub>H<sub>16</sub>N<sub>2</sub>O<sub>2</sub>) \|[[Image:Melatonin2.svg\|class=skin-invert-image\|160px\|Molecular structure of melatonin]] \|<code>CC(=O)NCCC1=CNc2c1cc(OC)cc2</code><br/><code>CC(=O)NCCc1c[nH]c2ccc(OC)cc12</code> \|----- \|[[Flavopereirin]] (C<sub>17</sub>H<sub>15</sub>N<sub>2</sub>) \|[[Image:Flavopereirine.svg\|class=skin-invert-image\|160px\|Molecular structure of flavopereirin]] \|<code>CCc(c1)ccc2[n+]1ccc3c2[nH]c4c3cccc4</code><br/><code>CCc1c[n+]2ccc3c4ccccc4[nH]c3c2cc1</code> \|----- \|[[Nicotine]] (C<sub>10</sub>H<sub>14</sub>N<sub>2</sub>) \|[[Image:Nicotine.svg\|class=skin-invert-image\|80px\|Molecular structure of nicotine]] \|<code>CN1CCC[C@H]1c2cccnc2</code> \|----- \|[[Oenanthotoxin]] (C<sub>17</sub>H<sub>22</sub>O<sub>2</sub>) \|[[Image:Oenanthotoxin-structure.png\|class=skin-invert-image\|180px\|Molecular structure of oenanthotoxin]] \|<code>CCC[C@@H](O)CC\C=C\C=C\C#CC#C\C=C\CO</code><br/><code>CCC[C@@H](O)CC/C=C/C=C/C#CC#C/C=C/CO</code> \|----- \|[[Pyrethrin]] II (C<sub>22</sub>H<sub>28</sub>O<sub>5</sub>) \|[[Image:Pyrethrin-II-2D-skeletal.svg\|class=skin-invert-image\|180px\|Molecular structure of pyrethrin II]] \|<code>CC1=C(C(=O)C[C@@H]1OC(=O)[C@@H]2[C@H](C2(C)C)/C=C(\C)/C(=O)OC)C/C=C\C=C</code> \|----- \|[[Aflatoxin]] B<sub>1</sub> (C<sub>17</sub>H<sub>12</sub>O<sub>6</sub>) \|[[Image:Aflatoxin B1.svg\|class=skin-invert-image\|130px\|Molecular structure of aflatoxin B<sub>1</sub>]] \|<code>O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5</code> \|----- \|[[Glucose]] (β-<small>D</small>-glucopyranose) (C<sub>6</sub>H<sub>12</sub>O<sub>6</sub>) \|[[Image:Beta-D-Glucose.svg\|class=skin-invert-image\|140px\|Molecular structure of glucopyranose]] \|<code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)1</code> \|----- \|[[Bergenin]] (cuscutin, a [[resin]]) (C<sub>14</sub>H<sub>16</sub>O<sub>9</sub>) \|[[Image:Cuscutine.svg\|class=skin-invert-image\|130px\|Molecular structure of cuscutine (bergenin)]] \|<code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2</code> \|----- \|A [[pheromone]] of the Californian [[scale insect]] \|[[Image:Pheromone cochenille californienne.svg\|class=skin-invert-image\|180px\|(3''Z'',6''R'')-3-methyl-6-(prop-1-en-2-yl)deca-3,9-dien-1-yl acetate]] \|<code>CC(=O)OCCC(/C)=C\C[C@H](C(C)=C)CCC=C</code> \|----- \|(2''S'',5''R'')-[[Chalcogran]]: a [[pheromone]] of the [[Scolytinae\|bark beetle]] ''[[Pityogenes chalcographus]]''<ref>{{cite journal \| vauthors = Byers JA, Birgersson G, Löfqvist J, Appelgren M, Bergström G \| title = Isolation of pheromone synergists of bark beetle, Pityogenes chalcographus, from complex insect-plant odors by fractionation and subtractive-combination bioassay \| journal = Journal of Chemical Ecology \| volume = 16 \| issue = 3 \| pages = 861–876 \| date = March 1990 \| pmid = 24263601 \| doi = 10.1007/BF01016496 \| bibcode = 1990JCEco..16..861B \| s2cid = 226090 }}</ref> \|[[Image:2S,5R-chalcogran-skeletal.svg\|class=skin-invert-image\|130px\|(2''S'',5''R'')-2-ethyl-1,6-dioxaspiro[4.4]nonane]] \|<code>CC[C@H](O1)CC[C@@]12CCCO2</code> \|----- \|[[Thujone\|α-Thujone]] (C<sub>10</sub>H<sub>16</sub>O) \|[[Image:Alpha-thujone.svg\|class=skin-invert-image\|100px\|Molecular structure of thujone]] \|<code>CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2</code> \|----- \|[[Thiamine]] (vitamin B<sub>1</sub>, C<sub>12</sub>H<sub>17</sub>N<sub>4</sub>OS<sup>+</sup>) \|[[Image:Thiamin.svg\|class=skin-invert-image\|150px\|Molecular structure of thiamin]] \|<code>OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N</code> \|} {{Clear}} To illustrate a molecule with more than 9 rings, consider [[cephalostatin]]-1,<ref name="PubChem-183413">{{cite web \|title=CID 183413 \|url=https://pubchem.ncbi.nlm.nih.gov/compound/183413 \|website=[[PubChem]] \|access-date=May 12, 2012 \|language=en}}</ref> a steroidic 13-ringed [[pyrazine]] with the [[empirical formula]] C<sub>54</sub>H<sub>74</sub>N<sub>2</sub>O<sub>10</sub> isolated from the [[Indian Ocean]] [[hemichordate]] ''[[Cephalodiscus gilchristi]]'': {{Clear}} :[[Image:Cephalostatine-1.svg\|class=skin-invert-image\|360px\|Molecular structure of cephalostatin-1]] Starting with the left-most methyl group in the figure: :<code>CC(C)(O1)C[C@@H](O)[C@@]1(O2)[C@@H](C)[C@@H]3CC=C4[C@]3(C2)C(=O)C[C@H]5[C@H]4CC[C@@H](C6)[C@]5(C)Cc(n7)c6nc(C[C@@]89(C))c7C[C@@H]8CC[C@@H]%10[C@@H]9C[C@@H](O)[C@@]%11(C)C%10=C[C@H](O%12)[C@]%11(O)[C@H](C)[C@]%12(O%13)[C@H](O)C[C@@]%13(C)CO</code> <code>%</code> appears in front of the index of ring closure labels above 9; see {{Section link\|\|Rings}} above. === Other examples of SMILES === The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool. == Extensions == [[Smiles arbitrary target specification\|SMARTS]] is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of [[Wildcard character\|wildcard]] atoms and bonds, which can be used to define substructural queries for [[chemical database]] searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for [[Glossary of graph theory#Subgraphs\|subgraph]] [[isomorphism]]. {{anchor\|SMIRKS}}SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is <code>REACTANT>AGENT>PRODUCT</code> (without spaces), where any of the fields can either be left blank or filled with multiple molecules delineated with a dot (<code>.</code>), and other descriptions dependent on the base language. Atoms can additionally be identified with a number (e.g. <code>[C:1]</code>) for mapping,<ref>{{cite web \|title=SMIRKS Tutorial \|url=http://daylight.com/dayhtml_tutorials/languages/smirks/ \| publisher = Daylight Chemical Information Systems, Inc. \|access-date=29 October 2018}}</ref> for example in .<ref>{{cite web \|title=Reaction SMILES and SMIRKS \|url=http://www.daylight.com/meetings/summerschool01/course/basics/smirks.html \|access-date = 29 October 2018 \| publisher = Daylight Chemical Information Systems, Inc. }}</ref> SMILES corresponds to discrete molecular structures. However many materials are macromolecules, which are too large (and often stochastic) to conveniently generate SMILES for. BigSMILES is an extension of SMILES that aims to provide an efficient representation system for macromolecules.<ref>{{cite journal \| vauthors = Lin TS, Coley CW, Mochigase H, Beech HK, Wang W, Wang Z, Woods E, Craig SL, Johnson JA, Kalow JA, Jensen KF, Olsen BD \| display-authors = 6 \| title = BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules \| journal = ACS Central Science \| volume = 5 \| issue = 9 \| pages = 1523–1531 \| date = September 2019 \| pmid = 31572779 \| pmc = 6764162 \| doi = 10.1021/acscentsci.9b00476 }}</ref> == Conversion == SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms.<ref name="Helson-1999">{{cite book \| vauthors = Helson HE \| year = 1999 \| chapter = Structure Diagram Generation \| title = Reviews in Computational Chemistry \| veditors = Lipkowitz KB, Boyd DB \|___location=New York \|pages=313–398 \|publisher=Wiley-VCH \|doi=10.1002/9780470125908.ch6 \|volume=13 \| isbn = 978-0-470-12590-8 }}</ref> This conversion is sometimes ambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities. == See also == * [[Smiles arbitrary target specification\|SMILES arbitrary target specification]] (SMARTS), an extension of SMILES for specification of substructural queries * [[SYBYL Line Notation]], another line notation * [[International Chemical Identifier]] (InChI), the [[IUPAC]]'s alternative to SMILES * [[Molecular Query Language]], a [[query language]] allowing also numerical properties, e.g. physicochemical values or distances * [[Chemistry Development Kit]], 2D layout and conversion software * [[OpenBabel]], [[JOELib]], [[OELib]] (conversion) == References == {{Reflist}} {{Molecular visualization}} {{chemistry software\|state=collapsed}} {{DEFAULTSORT:Simplified Molecular Input Line Entry System}} [[Category:Chemical nomenclature]] [[Category:Encodings]] [[Category:Chemical file formats]]