SMILES is a simple yet comprehensive chemical nomenclature.
The answer to the most commonly asked question about SMILES is: yes,
it is an acronym, meaning Simplified Molecular Input Line Entry Specification.
(SMILES originated in the depths of the US government, where humorous
names for things are frowned upon unless they are acronyms.)
SMILES is widely used as a general-purpose chemical nomenclature and
data exchange format. However, SMILES differs in several fundamental ways
from most chemical nomenclatures and other chemical formats. It is useful
to review a few fundamental concepts before digging into the specifics
of the SMILES language.
SMARTS
Substructure searching, the process
of finding a particular pattern (subgraph) in a molecule (graph), is one
of the most important tasks for computers in chemistry. It is used in
virtually every application that employs a digital representation of a
molecule, including depiction (to highlight a particular functional group),
drug design (searching a database for similar structures and activity),
analytical chemistry (looking for previously-characterized structures
and comparing their data to that of an unknown), and a host of other problems.
SMARTS is a language that allows you to specify substructures using
rules that are straightforward extensions of SMILES. For example, to search
a database for phenol-containing structures, one would use the SMARTS
string "[OH]c1ccccc1", which should be familiar to those aquainted
with SMILES. In fact, almost all SMILES specifications are valid SMARTS
targets (see "SMARTS Exceptions," below). Using SMARTS, flexible
and efficient substructure-search specifications can be made in terms
that are meaningful to chemists.
In the SMILES language, there are two fundamental types of symbols:
atoms and bonds. Using these SMILES symbols, once can specify
a molecule's graph (its "nodes" and "edges") and assign
"labels" to the components of the graph (that is, say what type
of atom each node represents, and what type of bond each edge represents).
The same is true in SMARTS: One uses atomic and bond symbols to specify
a graph. However, in SMARTS the labels for the graph's nodes and edges
(its "atoms" and "bonds") are extended to include
"logical operators" and special atomic and bond symbols; these
allow SMARTS atoms and bonds to be more general. For example, the SMARTS
atomic symbol [C,N] is an atom that can be aliphatic C or aliphatic N;
the SMARTS bond symbol "~" (tilde) matches any bond.
SMARTS provides a number of primitive symbols describing atomic properties
beyond those used in SMILES (atomic symbol, charge, and isotopic specifications).
The following tables list the atomic primitives used in SMARTS (all SMILES
atomic symbols are also legal). In these tables <n> stands for a
digit, <c> for chiral class.
SMARTS Atomic Primitives
Symbol
Symbol name
Atomic property requirements
Default
*
wildcard
any atom
(no default)
a
aromatic
aromatic
(no default)
A
aliphatic
aliphatic
(no default)
D<n>
degree
<n> explicit connections
exactly one1
H<n>
total-H-count
<n> attached hydrogens
exactly one1
h<n>
implicit-H-count
<n> implicit hydrogens
exactly one1
R<n>
ring membership
in <n> SSSR rings
any ring atom
r<n>
ring size
in smallest SSSR ring of size <n>
any ring atom
v<n>
valence
total bond order <n>
exactly one1
X<n>
connectivity
<n> total connections
exactly one1
- <n>
negative charge
-<n> charge
-1 charge (-- is -2, etc)
+<n>
positive charge
+<n> formal charge
+1 charge (++ is +2, etc)
#n
atomic number
atomic number <n>
(no default)
@
chirality
anticlockwise
anticlockwise, default class
@@
chirality
clockwise
clockwise, default class
@<c><n>
chirality
chiral class <c> chirality <n>
(nodefault)
@<c><n>?
chiral or unspec
chirality <c><n> orunspecified
(no default)
<n>
atomic mass
explicit atomic mass
unspecified mass
1Note that atomic primitive
"H" can have two meanings, implying a property or the element
itself. [H] means hydrogen atom. [*H2]
means any atom with exactly two hydrogens attached.
Examples:
C
aliphatic carbon atom
c
aromatic carbon atom
a
aromatic atom
[#6]
carbon atom
[Ca]
calcium atom
[++]
atom with a +2 charge
[R]
atom in any ring
[D3]
atom with 3 explicit bonds (implicit H's don't count)
[X3]
atom with 3 total bonds (includes implicit H's)
[v3]
atom with bond orders totaling 3 (includes implicit H's)
C[C@H](F)O
match chirality (H-F-O anticlockwise viewed from C)
C[C@?H](F)O
matches if chirality is as specified or is not specified
Atom and bond primitive specifications may be combined to form expressions
by using logical operators. In the following table, "e" is an
atom or bond SMARTS expression (which may be a primitive). The logical
operators are listed in order of decreasing precedence (high precedence
operators are evaluated first).
SMARTS Logical Operators
Symbol
Expression
Meaning
exclamation
!e1
not e1
ampersand
e1&e2
a1 and e2 (high precedence)
comma
e1,e2
e1 or e2
semicolon
e1;e2
a1 and e2 (low precedence)
All atomic expressions which are not simple primitives must be enclosed
in brackets. The default operation is `&' (high precedence "and"),
i.e., two adjacent primitives without an intervening logical operator
must both be true for the expression (or subexpression) to be true.
The ability to form expressions gives the SMARTS user a great deal of
power to specify exactly what is desired. The two forms of the AND operator
are used in SMARTS instead of grouping operators.
Examples:
[CH2]
aliphatic carbon with two hydrogens (methylene carbon)
Any SMARTS expression may be used to define an atomic environment by
writing a SMARTS starting with the atom of interest in this form:
$(SMARTS)
Such definitions may be considered atomic properties. These expressions
can be used in same manner as other atomic primitives (also, they can
be nested).
Recursive SMARTS expressions are used
*C
atom connected to methyl (or methylene) carbon
*CC
atom connected to ethyl carbon
[$(*C);$(*CC)]
atom in both above environments (matches CCC)
The additional power of such expressions is illustrated by the following
example which derives an expression for methyl carbons which are ortho
to oxygen and meta to a nitrogen on an aromatic ring.
SMARTS may contain "zero-level" parentheses which can be used
to group dot-disconnected fragments. This grouping operator allows SMARTS
to express more powerful component queries. In general, a single set of
parentheses may surround any legal SMARTS expression. Two or more of these
expressions may be combined into more complex SMARTS:
(SMARTS)
(SMARTS).(SMARTS)
(SMARTS).SMARTS
The semantics of the "zero-level" parentheses are that all
of the atom and bond expressions within a set of zero-level parentheses
must match within a single component of the target.
SMARTS
SMILES
Match behavior
C.C
CCCC
yes, no component level grouping specified
(C.C)
CCCC
yes, both carbons in the query match the same component
(C).(C)
CCCC
no, the query must match carbons in two different components
(C).(C)
CCCC.CCCC
yes, the query does match carbons in two different components
(C).C
CCCC
yes, both carbons in the query match the same component
(C).(C).C
CCCC.CCCC
yes, the first two carbons match different components, the third matches
a carbon anywhere
These component-level grouping operators were added specifically for
reaction processing. Without this construct, it is impossible to distinguish
inter- versus intra-molecular reaction queries. For example:
All SMILES expressions are also valid SMARTS expressions, but the semantics
changes because SMILES describes molecules whereas SMARTS describes patterns.
The molecule represented by a SMILES string is usually, but not always,
matched by the same string when used as a SMARTS.
SMILES is interpreted as a molecule, and it is the resultant molecule
(not the SMILES string) which is subject to searching. Similarly, SMARTS
is interpreted as a pattern; it is this pattern (not the SMARTS string)
which is matched against molecules. For instance, the SMILES "C1=CC=CC=C1"
(cyclohexatriene) is interpreted as the benzene molecule. This molecule
will be matched by the SMARTS c1ccccc1, which is interpreted as the pattern
"6 aromatic carbons in a ring". The SMARTS "C1=CC=CC=C1"
makes a pattern ("six aliphatic carbons in a ring with alternating
single and double bonds") which will not match benzene. It
will, however, match the nonaromatic phenylate cation with SMILES C1=CC=CC=[CH+]1.
When atoms are specified without brackets in SMILES, default values
are used; in SMARTS, unspecified properties are not defined to be part
of the pattern. For instance, the SMILES O means an aliphatic oxygen with
zero charge and two hydrogens, i.e. water. In SMARTS, the same expression
means any aliphatic oxygen regardless of charge, hydrogen count, etc,
e.g. it will match the oxygen in water, but also those in ethanol, acetone,
molecular oxygen, hydroxy and hydronium ions, etc. Specifying [OH2] limits
the pattern to match only water (this is also the fully specified SMILES
for water).
There are a few anachronisms in most SMILES interpreters which can also
lead to confusion. Some SMILES interpreters allow implicit hydrogens to
be added as explicit atoms on input as a shortcut. E.g., the SMILES for
1H-pyrrole is "[nH]1cccc1" which is matched by itself as SMARTS
and by "n1cccc1". The current Daylight SMILES interpreter will
also accept "Hn1cccc1" for (not very good) reasons of historical
compatability; this generates the same (hydrogen-suppressed) molecule
as does "[nH]1cccc1" and is matched by the same SMARTS. However,
the SMARTS "Hn1cccc1" does not match this molecule.
Most SMARTS expressions are not valid SMILES expressions. For instance,
the string "cOc" is a valid SMARTS, matching an aliphatic oxygen
connected to two aromatic carbons as part of a larger molecule (e.g. diphenyl
ether). However, "cOc" does not describe a molecule per se,
and is therefore not a valid SMILES.
The Daylight 4.x SMARTS Toolkit provides a function, dt_smarts_opt(),
which automatically optimizes a SMARTS by reordering, expanding, and/or
consolidating atom and bond expressions. Programs which use this feature
(e.g. the Merlin program) can be expected to be near optimal in terms
of the time used to search typical organic structures.
When this optimization method is not used, there are some things which
can be done to facilitate efficient (fast) searching operations using
SMARTS. It is important to recognize that SMARTS target strings are processed
in strictly left-to-right order. For this reason, substantial gains in
speed can be achieved by following these guidelines:
Uncommon atoms or bond arrangements should be placed
early in SMARTS targets.
In an "and-expression", the less common atom
or bond specifications should be placed early.
In an "or-expression", the less common atom
or bond specifications should be placed last.