The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
SMILES | |
---|---|
Filename extension |
.smi |
Internet media type |
chemical/x-daylight-smiles |
Type of format | chemical file format |

The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an open standard called OpenSMILES was developed in the open-source chemistry community. Other linear notations include the Wiswesser line notation (WLN), ROSDAL, and SYBYL Line Notation (SLN).
History
The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s.[1][2][3][4] Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (Pomona College) for supporting the work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system."[5] The Environmental Protection Agency funded the initial project to develop SMILES.[6][7]
It has since been modified and extended by others, most notably by Daylight Chemical Information Systems. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).
In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing (such as graph theory).
Terminology
The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES strings can be written for a molecule. For example, CCO
, OCC
and C(O)C
all specify the structure of ethanol. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the canonicalization algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC, and the Chemistry Development Kit. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database.
The original paper that described the CANGEN[2] algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. cuneane, 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically.[8] There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.
SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which isomers are specified.
Graph-based definition
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
The resultant SMILES form depends on the choices:
- of the bonds chosen to break cycles,
- of the starting atom used for the depth-first traversal, and
- of the order in which branches are listed when encountered.
SMILES definition as strings of a context-free language
From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and biodegradability) based on the main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance) as well as a more robust scheme based on statistical pattern recognition. Cite error: A <ref>
tag is missing the closing </ref>
(see the help page).
|
|CC[C@H](O1)CC[C@@]12CCCO2
|-----
|α-Thujone (C10H16O)
|
|CC(C)[C@@]12C[C@@H]1[C@@H](C)C(=O)C2
|-----
|Thiamine (vitamin B1, C12H17N4OS+)
|
|OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N
|}
To illustrate a molecule with more than 9 rings, consider cephalostatin-1,[9] a steroidic 13-ringed pyrazine with the empirical formula C54H74N2O10 isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi:
Starting with the left-most methyl group in the figure:
CC(C)(O1)C[C@@H](O)[C@@]1(O2)[C@@H](C)[C@@H]3CC=C4[C@]3(C2)C(=O)C[C@H]5[C@H]4CC[C@@H](C6)[C@]5(C)Cc(n7)c6nc(C[C@@]89(C))c7C[C@@H]8CC[C@@H]%10[C@@H]9C[C@@H](O)[C@@]%11(C)C%10=C[C@H](O%12)[C@]%11(O)[C@H](C)[C@]%12(O%13)[C@H](O)C[C@@]%13(C)CO
Note that %
appears in front of the index of ring closure labels above 9; see § Rings above.
Other examples of SMILES
The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.
Extensions
SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism.
SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is REACTANT>AGENT>PRODUCT
(without spaces), where any of the fields can either be left blank or filled with multiple molecules deliminated with a dot (.
), and other descriptions dependent on the base language. Atoms can additionally be identified with a number (e.g. [C:1]
) for mapping,[10] for example in [CH2:1]=[CH:2][CH:3]=[CH:4][CH2:5][H:6]>>[H:6][CH2:1][CH:2]=[CH:3][CH:4]=[CH2:5]
.[11]
Conversion
SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms.[12] This conversion is not always unambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities.
See also
- SMILES arbitrary target specification (SMARTS), an extension of SMILES for specification of substructural queries
- SYBYL Line Notation, another line notation
- International Chemical Identifier (InChI), the IUPAC's alternative to SMILES
- Molecular Query Language, a query language allowing also numerical properties, e.g. physicochemical values or distances
- Chemistry Development Kit, 2D layout and conversion software
- OpenBabel, JOELib, OELib (conversion)
References
- ^ Weininger, David (February 1988). "SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules". Journal of Chemical Information and Computer Sciences. 28 (1): 31–6. doi:10.1021/ci00057a005.
- ^ a b Weininger, David; Weininger, Arthur; Weininger, Joseph L. (May 1989). "SMILES. 2. Algorithm for generation of unique SMILES notation". Journal of Chemical Information and Modeling. 29 (2): 97–101. doi:10.1021/ci00062a008.
- ^ Weininger, David (August 1990). "SMILES. 3. DEPICT. Graphical depiction of chemical structures". Journal of Chemical Information and Modeling. 30 (3): 237–43. doi:10.1021/ci00067a005.
- ^ Swanson, Richard Pommier (2004). "The Entrance of Informatics into Combinatorial Chemistry" (PDF). In Rayward, W. [Warden] Boyd; Bowden, Mary Ellen (eds.). The History and Heritage of Scientific and Technological Information Systems: Proceedings of the 2002 Conference of the American Society of Information Science and Technology and the Chemical Heritage Foundation. Medford, NJ: Information Today. p. 205. ISBN 9781573872294.
- ^ Weininger, Dave (1998). "Acknowledgements on Daylight Tutorial smiles-etc page". Retrieved June 24, 2013.
- ^ Anderson, E.; Veith, G. D.; Weininger, D. (1987). SMILES: A line notation and computerized interpreter for chemical structures (PDF). Duluth, MN: U.S. EPA, Environmental Research Laboratory-Duluth. Report No. EPA/600/M-87/021.
- ^ "SMILES Tutorial: What is SMILES?". U.S. EPA. Retrieved September 23, 2012.
- ^ Hutchison D, Kanade T, Kittler J, Klienberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Rangan CP, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Raschid L, Neglur G, Grossman RL, Liu B (2005). "Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples". In Ludäscher B (ed.). Data Integration in the Life Sciences. Lecture Notes in Computer Science. Vol. 3615. Berlin: Springer. pp. 145–157. doi:10.1007/11530084_13. ISBN 978-3-540-27967-9.
{{cite book}}
:|access-date=
requires|url=
(help); External link in
(help); Unknown parameter|chapterurl=
|chapterurl=
ignored (|chapter-url=
suggested) (help) - ^ "CID 183413". PubChem. Retrieved May 12, 2012.
- ^ "SMIRKS Tutorial". Daylight. Retrieved October 29, 2018.
- ^ "Reaction SMILES and SMIRKS". Retrieved October 29, 2018.
- ^ Helson, H. E. (1999). "Structure Diagram Generation". In Lipkowitz, K. B.; Boyd, D. B. (eds.). Rev. Comput. Chem. Vol. 13. New York: Wiley-VCH. pp. 313–398. doi:10.1002/9780470125908.ch6.