Simplified Molecular Input Line Entry System: Difference between revisions

Content deleted Content added
Terminology: Removed a deadlink
Terminology: Removed a deadlink
Line 30:
The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
 
Typically, a number of equally valid SMILES strings can be written for a molecule. For example, <code>CCO</code>, <code>OCC</code> and <code>C(O)C</code> all specify the structure of [[ethanol]]. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the [[canonicalization]] algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems, [[OpenEye Scientific Software]], [[MEDIT]], [[Chemical Computing Group]], [[MolSoft LLC]], and the [[Chemistry Development Kit]]. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a [[Chemical database|database]].
 
The original paper that described the CANGEN<ref name="Weininger-1989" /> algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g. [[cuneane]], 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically.<ref>{{cite book |publisher=Springer |___location=Berlin |isbn=978-3-540-27967-9 |volume=3615 |pages=145–157 | editor-first = Bertram | editor-last=Ludäscher | last1 = Hutchison | first1 = David | first2 = Takeo | last2 = Kanade | first3 = Josef | last3 = Kittler | first4 = Jon M. | last4 = Klienberg | author-link4 = Jon Kleinberg | first5 = Friedemann | last5 = Mattern | first6 = John C. | last6 = Mitchell | first7 = Moni | last7 = Naor | author-link7 = Moni Naor | first8 = Oscar | last8 = Nierstrasz | first9 = C. Pandu | last9 = Rangan | author-link9 = Bernhard Steffen (computer scientist) | first10 = Bernhard | last10 = Steffen | first11 = Madu | last11 = Sudan | author-link11 = Madhu Sudan | first12 = Demetri | last12 = Terzopoulos | first13 = Dough | last13 = Tygar | first14 = Moshe Y. | last14 = Vardi | author-link14 = Moshe Y. Vardi | first15 = Gerhard | last15 = Weikum | first16 = Louiqa | last16 = Raschid |author16-link=Louiqa Raschid | first17 = Greeshma | last17 = Neglur | first18 = Robert L. | last18 = Grossman | first19 = Bing | last19 = Liu | name-list-style = vanc | series = Lecture Notes in Computer Science |title=Data Integration in the Life Sciences |chapter=Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples |access-date=2013-02-12 |year=2005 |chapter-url=https://doi.org/10.1007%2F11530084_13 |doi=10.1007/11530084_13 }}</ref> There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.