Sunday, November 12, 2017

Atom_mapper: determining atom correspondence between reactants and products

Many TS search algorithms interpolate between reactants and products and require that the atom order is identical. This is time consuming to enforce by hand and hard to automate, but this paper by the folks at Schrödinger describe a solution that seems to work pretty well. The general idea is also fairly easy to implement if you are familiar with RDKit or similar cheminformatics tools and here I describe my implementation so far.

The basic idea
If two molecules are identical but have different atom numbering then RDKit can map the atoms using "GetSubstructMatch". So (pseudocode) "OCC.GetSubstructMatch(CCO)" gives "(2,1,0)" meaning that atom 2 in CCO corresponds to atom 0 in OCC and so forth.  And "RenumberAtoms(CCO, (2,1,0))" gives OCC.

Let's say our reaction is CH3-CH=O -> CH2=CH-OH.  "product.GetSubstructMatch(reactant)" will not give a match because the molecules are different. To focus on connectivity we remove bond orders: CH3-CH-O -> CH2-CH-OH, but the molecules are still different. The next step is to break each bond in turn and compare. For the reactant you get [H + CH2-CH-O, CH3 + CH-OH, H + CH3-C-O, O + CH3-CH2] and the first entry will match the last entry in the corresponding list for the product [H + CH-CH-OH, CH2 + CH-OH, H + CH2-C-OH, H + CH2-CH-O].

It is very easy to check to for equivalence by converting the list of fragmented molecules to a set of corresponding SMILES strings and looking for identical elements in the two sets: "matches = list(set(react_frag_smiles) & set(prod_frag_smiles))". If there is a match then the atoms can be matched as before because they are identical (note that the bond breaking does not alter the atom numbering of the molecule). If no match is found then the procedure is repeated for the fragments.

One problem is that equivalent atoms are not matched in 3D. For example, if you label (i.e. number) the equivalent H atoms in a methyl group then the methyl group is chiral and RDKit will create a random chirality.  So, the Cartesian coordinated of methyl hydrogens in the CCO example above are not necessarily superimposable, which creates problems in TS search algorithms that use interpolation.

Following the ideas presented in the Schrödinger paper, I try to solve this issue by switching pairs of equivalent atoms and comparing the RMSDs between reactant and product before and after.  However, this will only work if the reactants and products have the same conformation, so I use a modified version of ConstrainedEmbed (ConstrainedEmbedRP) which uses either Cartesian or distance constraints to match two structures using a UFF force field minimization of one of the structures.

Using ConstrainedEmbed has the added advantage of producing reactant and product structures that are in the same orientation and, therefore, often good starting points for interpolation.

Atom_mapper
* I have implemented these ideas in Atom_mapper and made it available on GitHub under the MIT license.
Usage  
* The input is SMILES strings of reactant and products but the code can be adapted to work with user-supplied coordinates via sdf files.
* The output is two sdf files labelled "template" and "mol" with (hopefully) identical atom ordering. Template is the most rigid of the two and the coordinates come from an unconstrained UFF minimization. Mol is superimposed on template using ConstrainedEmbedRP.
* Both geometries should probably be minimized using the energy function you plan to use for the TS search and, depending on the optimization algorithm, the structures should probably be rigidly superimposed before interpolation
Current limitations
* I haven't tested the code extensively. If you run into cases where it doesn't work, file a bug report on GitHub
* No conformation search is done so the conformations chosen will be random. Solution: start with coordinates rather than SMILES.
* If reactants or products consists of two molecules their relative orientation resulting from the UFF minimization will be somewhat arbitrary. This is not a problem if only the reactants or products consists of two molecules, since the coordinates will be superimposed on a more rigid molecule. However, it is a problem if the reactants and products both consist of two molecules. Solution: start with coordinates rather than SMILES, but not sure how to determine best intermolecular orientation automatically.
* The code only permutes pairs of equivalent atoms so it won't work for equivalent groups, e.g. methyl groups in CC(C)(C)CO because we need to switch several pairs (corresponding to C and 3 H atoms) simultaneously.  Not sure how to fix that easily. Suggestions welcome.

There are also some general limitations to this "active bonds" approach as described in the article.


This work is licensed under a Creative Commons Attribution 4.0

Sunday, September 3, 2017

Automatic generation of a set of molecules

Many quantum chemistry projects have reached a point where setup and analysis consumes more human time than CPU time, e.g. it takes all day to set-up enough input files to keep the computer busy overnight. Many people use scripts to automatically extract and analyse the data, but few use scripts to generate the coordinates.

Here I show how this can easily be done using the RDKit toolkit.  The following python script adds F and OH substituents to benzene making all possible 91 combinations of mono-, di-, hexa-substituted molecules.

However, the script can easily be changed to do something else by changing parent_smiles and rxn_smarts_list in line 4 and 5.  If you are not familiar with SMILES start here and there are plenty of GUIs, such as this, that generate SMILES.

To use the Reaction SMARTS you have to learn SMARTS, which can be a bit tricky, but it is a very powerful tool. For example, if you change [cX3;H1:1]>>[*:1]F to [*;H1:1]>>[*:1]F then the program will add H to any atom with one H, i.e. also the OH group to create the OF substituent.  So it you set substitutions = 2, you'll get mono-substituted Ph-OF in addition to mono- and di-substituted Ph-F and Ph-OH.

Similarly, ff you add [cX3;H1:1][cX3;H1:2]>>[c:1]1[c:2]cccc1 to the list (and use substitutions = 2) you'll get un- and mono-substituted napthalene as well as un-substituted anthracene and phenanthrene.

In my experience, the only thing that limits what I can build with this approach is my understanding of SMARTS.  Hope this is of some use to you.



This work is licensed under a Creative Commons Attribution 4.0

Thursday, August 10, 2017

Predicting most labile CH bond using semiempirical methods



Yesterday, I mentioned on Twitter that I had written some prototype code to predict labile proton using PM3/COSMO.  A few people expressed interest so I put the code on GitHub.

The code is based on some previous work and removes all CH protons/hydrides/H atoms and finds the position with the lowest free energy.  I've just finished the code so I have no idea how well it works or if PM3/COSMO is the best choice.  Also, there are very few comments and some aspects of the setup (paths, etc) are particular to my machine and needs to be changed.

In its current form the method doesn't consider different reference molecules, so the prediction is "pure PM3/COSMO".  The basic code to do it is there but I haven't extended it yet. Also, the code only considers CH groups, so if have other groups (OH, NH, etc) with lower pKa they won't be considered.  This is very much a work in progress but if you have suggestions, test cases, ideas, or any other kind of feedback please let me know.  I hope you find it useful.


This work is licensed under a Creative Commons Attribution 4.0

Saturday, July 15, 2017

Planned papers for 2017 - six months in

In January I wrote about the papers I plan to publish and made this list:

Accepted
1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable
3. Intermolecular Interactions in the Condensed Phase: Evaluation of Semi-empirical Quantum Mechanical Methods
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods
5. Benchmarking cost vs. accuracy for computation of NMR shielding constants by quantum mechanical methods
6. Improved prediction of chemical shifts using machine learning
7. PM6 for all elements in GAMESS, including PCM interface

Probably not in 2017
8. Protonator: an open source program for the rapid prediction of the dominant protonation states of organic molecules in aqueous solution
9. pKa prediction using semi-empirical methods: difficult cases
10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods
11. High throughput transition state determination using semi-empirical methods

The status is

Published
1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable (at least submission)
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods

We had a draft ready to be sent in in mid-April, but decided to include a "few" more types of heteroaromatics and the study has now ballooned to nearly 600 compounds (up from about 150 in the original draft)! The calculations and most of the analysis is done, but the paper basically has to be rewritten, and this really has to be done by my synthetic chemistry co-author, since the study is now much more about the chemistries of these heteroaromatic and much less about the method.  Even though its out of my hands I am still hopeful that we can submit it this year.

9. pKa prediction using semi-empirical methods: difficult cases automation

This paper is 2/3 written and presents a completely automated PM3-based pKa prediction protocol. The method works quite well, but most outliers turn out to be due to high energy conformations. The main remaining issue is to find a conformer-search protocol that consistently gives low-energy conformations. Depending on how much time I have to devote to paper 4 and the proposal mentioned below, I am still hopefull I can get this published this year.

7. PM6 for all elements in GAMESS, including PCM interface

This will basically be a paper on the accuracy of the SMD method using semiempirical methods.  I'm hopefull we will submit this year, but there is still a lot to do.  Among other things, it explains why PM3/COSMO works best for pKa predictions: the solvation energy errors happen to be smallest for this method.

10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods

I am planning to submit a proposal on prediction of pKa,  homolytic bond strengths, etc as predictors of CH activation-sites. The proposals is due September 25th, so much of my "spare" time until then will be spent on getting preliminary results and writing.


This work is licensed under a Creative Commons Attribution 4.0

Sunday, May 14, 2017

Predicting pKa values using PM3 - conformer search using implicit solvation

Disclaimer: These are preliminary results and may contain errors

In a previous post I showed that minimising RDKit-generated conformers generated with MMFF led to slightly worse PM3/COSMO pKa predictions. One reason might be that the MMFF minimisation is done in the gas phase.  Anders Christensen reminded me that TINKER can do MMFF/GBSA minimisations so I used TINKER so minimise 20 and 50 RDKit-generated conformers in used these as initial guesses for the PM3/COSMO energy minimisations (20mmtk and 50mmtk

As you can see there is relatively little difference in the overall performance of Xmm and Xmmtk. The error distribution is a bit tighter for the tk variants and 50mmtk does better at finding structures closer to the "global" minimum, but 50nomm is still the best choice for avoiding >1 pKa errors.

So, either GBSA is not a good substitute for COSMO or MMFF is not a good substitute for PM3.  In light of this recent paper my money is on the latter.  So I'll go with 50nomm for now.


This work is licensed under a Creative Commons Attribution 4.0

Saturday, May 13, 2017

Computing apparent pKa values using QM energies of isodesmic reactions

A few days ago I presented some preliminary pKa results.  Almost all the molecules in the set have more than one ionisable group. Here's how I compute the apparent pKa values

The microscopic pKa values are computed by
\begin{equation}
\label{eqn:pka}
\mathrm{pK_a}=\mathrm{pK_a^{ref}} + \frac{\Delta G^\circ}{RT\ln (10)}
\end{equation}
where $\Delta G^\circ$ denotes the change in standard free energy for the isodesmic reaction
\begin{equation}
\mathrm{ BH^+ + B_{ref} \rightleftharpoons B + B_{ref}H^+ }
\end{equation}
If there are more than one titrateable sites then several protonation states can contribute to the apparent pKa.  Here I follow the approach described by Bochevarov et al. If $\mathrm{D}$ and $\mathrm{P}$ differ by one proton and there are $M$ deprotonated sites and $N$ protonated sites then
\begin{equation}
\mathrm{K_{app}}  = \frac{([\mathrm{D}_1] + [\mathrm{D}_2] + ... [\mathrm{D}_M])[\mathrm{H}^+]}{[\mathrm{P}_1] + [\mathrm{P}_2] + ... [\mathrm{P}_N]}
\end{equation}
which can be rewritten in terms of microscopic pKa values
\begin{equation}
\mathrm{K_{app}}  = \sum^M_j \frac{1}{\sum^N_i 10^{pK_{ij}} }
\end{equation}
The sum contains contains microscopic pKa values for which more than one protonation state is changed. For example, molecule with three ionizable groups $(\mathrm{B_1H^+B_2H^+B_3H^+})$ will have the following microscopic pKa value
\begin{equation}
\label{eqn:pkapp}
K_{ij} = \frac{ [\mathrm{B_1B_2H^+B_3][H^+]}}{[\mathrm{B_1H^+B_2B_3H^+}]}
\end{equation}
in the expression for the apparent pKa value for going from the doubly to the singly protonated state. However, the error in such a pKa value is considerably higher due to less error cancellation and such pKa values are therefore neglected in the sum.

In the Bochevarov et al. implementation the user defines the titrateable group for which the apparent pKa is computed.  However, in my approach all possible protonation states so the assignment of the apparent pKa value to a particular ionizable group is not immediately obvious.  Inspection of Eq $\ref{eqn:pkapp}$ shows that the largest microscopic pKa values will dominate the sum over $N$, while the smallest of these maximum pKa values will dominate the sum over $M$. Thus, the apparent pKa is assigned to the functional group corresponding to the microscopic pKa
\begin{equation}
\label{eqn:max}
pK^\prime_{ij} = \min_{j}(\{ \max_{i}(\{ pK_{ij} \} ) \} )
\end{equation}
In some cases there are several microscopic pKa that contribute almost equally to the respective sums and for these cases the apparent pKa value cannot meaningfully be assigned to a single ionizable sites.  In my approach include all microscopic pKa values that are within 0.6 pKa units of the respective maximum and minimum values defined in Eq $\ref{eqn:max}$.   I choose 0.6 because that is the site-site interaction energy below which two groups titrate simultaneously.

You can find the code I wrote to do this here. It is not cleaned up yet.



This work is licensed under a Creative Commons Attribution 4.0

Thursday, May 11, 2017

Small molecule energy minimisation and conformational search using Tinker

In this post I outlined the (possible) need for conformational analysis using a continuum model. Anders Christensen reminded me that Tinker can do this so I looked in to it. This blogpost and this GitHub repo were particularly helpful.  Here's what I found.

Generating atomic charges
Tinker has MMFF force field parameters except the atomic charges.  These can be generated from an sdf file using sdf2tinkerxyz (which relies on OpenBabel). When you download from SourceForge you get a bin file, which I didn't know what to do with so I downloaded the Linux executable from here instead.

./sdf2tinkerxyz -k default.key < xxx.sdf

creates yyy.xyz (coordinates) and yyy.key (charges), where yyy is the title in the sdf file (which I'll assume = xxx in the following). "default.key" is a file with Tinker keywords that is added to xxx.key. Here's mine

# Solvation Model
SOLVATE GBSA


#Force Field Parameters
PARAMETERS /opt/tinker/tinker/params/mmff.prm


Energy minimisation
To isolate the effect of continuum solvation on the pKa predictions, I want to generate conformers with RDKit and minimise them with MMFF/GBSA using Tinker.  This is done by

/opt/tinker/tinker/bin/minimize xxx.xyz -k xxx.key 0.01 > xxx.tout

"0.01" is the convergence criterium (in kcal/molÅ??). xxx.tout contains the energy information and a xxx.xyz_2 file is generated with the optimized coordinates.  This is in a format that OpenBabel can't handle, so I need to write a converter.  The main challenge is to translate the atom types into names. See the list called lookup in this code.

Conformational Search using SCAN
Tinker also has its own conformational search method. It looks like it's based on eigenvector following but I haven't looked closely at it yet.  This is done by

/opt/tinker/tinker/bin/scan xxx.xyz -k xxx.key 0 10 20 0.00001 > xxx.tout

Here I use the settings used in the DP4 GitHub repo.  "0" is automatic selection of torsional angles, "10" is the number of search directions (modes?), "20" is the energy threshold for local minima,and "0.00001" is the optimisation convergence criterium.  You get a bunch of coordinate files and xxx.tout holds the energy information.

I haven't really played with this option yet. If you have any tips, please leave comment.


This work is licensed under a Creative Commons Attribution 4.0

Sunday, May 7, 2017

Predicting pKa values using PM3 - new code and conformer search

Disclaimer: These are preliminary results and may contain errors

The code I used in my previous pKa prediction paper had some limitations that I have now mostly removed: The setup of protonation states is automated and includes acids like acetic acid and phenol, there is a rudimentary tautomer generations, and the reference molecules are chosen a bit more systematically.

I ran the new code on the same set of molecules and saw a few differences.  Some were expected and due to different references and overall protonation states but most major differences were due to different conformations even though I used the same code to generate conformers. This let me to look at the effect of conformer generation.

The original study used 20 MMFF-minimised conformations ("20mm") generated using RDKit as starting points for PM3/COSMO minimisations.  Now I've tried skipping the MMFF minimisation ("nomm"), 50 conformations with and without MMFF minimisation, and a scheme where I MMFF-minimise 50 or 100 conformations and select conformers that have energies within either 10 or 20 kcal/mol of the lowest energy ("XecutY") for each protonation state/tautomer.



The RMSEs are very similar, but 50nomm has the fewest number of cases where the error in the pKa value (ΔpKa) is greater than 1 pH unit. However, the major outlier for 20mm and 20nomm is Sparteine for which 20mm and 20nomm actually find a conformer that is 4.9 kcal/mol lower than for 50nomm, for the protonated state.  

This lead me to look at the energies themselves. The numbers labelled ΔE in the table are computed as follows. For a given molecule I find the lowest energy for the protonated and deprotonated state for each method and identify the method with the lowest energy $E_{min}$. Then I count the number of molecules for which the difference to $E_{min}$ is greater than 1.36 kcal/mol and average that number for protonated and deprotonated to get one number per method.  

So there are 4 molecules for which 50nomm doesn't something close to the "global" minimum for either the protonated or deprotonated state, or both. In principle, I should go on and try 100nomm to see if I can "converge", but that's starting to become pretty expensive and the point is trying to develop a practically useful tool.

The 100ecut20 results suggest that the MM gas phase energetics don't represent the PM3/COSMO energetics all that well.  So it would be interesting to try a MM conformational search with an implicit solvent model. RDKit can't do this, but Macrmodel and NAMD should be able to do this.  It also looks like the next release of OpenMM will have this capability.  

Suggestions and comments most welcome.



This work is licensed under a Creative Commons Attribution 4.0

Saturday, April 15, 2017

Finding the function that is its own derivative

I saw this "derivation" some years ago but now I can't find it.  If anyone knows the source please let me know.

Let's say we want to find a function $f(x)$ for which
$$ \frac{d}{dx} f(x) = f^\prime (x) = f(x) $$
Well, how about $f(x) = x$?
$$ f(x) = x \implies f^\prime (x) = 1$$
No, because $f^\prime (x) \ne x$. OK, how about $f(x) = 1+x$?
$$ f(x) = 1+x \implies f^\prime (x) = 1$$
That get's me a little closer.  I get the "$1$" back, but I need something whose derivative will give me back the $x$:
$$ f(x) = 1+x+\tfrac{1}{2}x^2 \implies f^\prime (x) = 1 + x$$
Now I need something whose derivative will me back the $\tfrac{1}{2}x^2$:
$$ f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2$$
OK, so if I use infinitely many terms then this function fits the bill
$$ f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots  \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2 + \cdots \tfrac{1}{(n-1)!}x^{n-1} \cdots $$
Can I write $f(x)$ in some compact form?  Well, $f(0) = 1$ and $f(1) $ will give me some number, which I'll call $e$.
$$ f(1) = 1+1+\tfrac{1}{2} + \tfrac{1}{2 \cdot 3} + \cdots = e  $$
You only need a few terms before $e$ has converged to the first two decimal places (2.717).

For $x$ = 2 you need a few more terms to get the same precision, but the number (7.382) is roughly 2.717 x 2.717, an agreement that quickly gets better with a few more terms
$$ f(2) = 1+2+\tfrac{1}{2}4 + \tfrac{1}{2 \cdot 3}8 + \cdots = e^2  $$
So
$$ f(x) = e^x = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots  $$
and
$$  \frac{d}{dx} e^x = e^x $$


This work is licensed under a Creative Commons Attribution 4.0

Where does the ln come from in S = k ln(W) ? - Take 2


Some years ago I wrote this post. Now I want to come at it from a different angle.

A closed system spontaneously goes towards the state with maximum multiplicity, $W$. For a system with $N$ molecules with energies $\varepsilon_1, \varepsilon_2, \ldots $ we therefore want to find the values of $N_i$ that maximises $W(N_1, N_2, \ldots)$.

This is easier to do for $\ln W$ than $W$, which is fine because $W$ is a maximum when $\ln W$ is a maximum, and this happens when
$$ \frac{N_i}{N}= p_i = \frac{e^{-\beta \varepsilon_i}}{q} $$
$\beta$ can be found by comparison to experiment.  For example, the energy of an ideal monatomic gas with $N_A$ particles is
$$ U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2\beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{kT} $$
where $R$ is determined by measuring the temperature increase due to adding a known amount of energy to the system.

So far we have used $\ln W$ instead of $W$ for convenience, but is there something special about $\ln W$? Yes, the change in $\ln W$ has can be expressed quite simply
$$ d \ln W = \beta \sum_i \varepsilon_i dN_i = N \beta \sum_i \varepsilon_i dp_i =  \frac{dU}{kT} $$
So change in internal energy $U$ due to a redistribution of molecules among energy levels is equal to the change in $\ln W$ (as opposed to $W$) multiplied by $kT$
$$ dU = Td\left( k \ln W \right) = TdS$$
The final question is whether there is something special about $\ln = \log_e$ as opposed to say $\log_a$ where $a \ne e$? Well, $\log_a W$ is a maximum when
$$ \frac{N_i}{N}= p_i = \frac{a^{-\beta \varepsilon_i}}{q} $$
There is an extra term in the derivation but that cancels out in the end.  So no change there.

What about $\beta$?  There are two changes.  The previous derivation of $U^{\mathrm{Trans}}$ relied on this relation (I'll drop the "Trans" label for the moment)
$$  \varepsilon_i e^{-\beta \varepsilon_i}  = - \frac{d }{d \beta} e^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q} \frac{dq}{d\beta} $$
which now becomes
$$  \varepsilon_i a^{-\beta \varepsilon_i}  = - \frac{1}{\ln(a)} \frac{d }{d \beta} a^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q \ln(a)} \frac{dq}{d\beta}$$
While $q$ has an extra $\ln (a)$ term, the derivative wrt $\beta$ is still the same and
$$ U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2 \ln (a) \beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{\ln (a) kT} $$
So, $S = k \log_e W$ is special in the sense that the proportionality constant $k$ is the experimentally measured ideal gas constant divided by Avogadro's number.  In any other base we have to write either $S = k \ln (a) \log_a W$ where $k = R/N_A$ or $S = k^\prime \log_a W$ where $k^\prime = \ln(a)R/N_A$

Clearly, $S = k \log_e W$ is the most natural choice, and this is because $e$ is the (only) value of $a$ for which
$$ \frac{d}{dx} a^x = a^x $$
In fact that is one way to define what $e$ actually is.



This work is licensed under a Creative Commons Attribution 4.0

Tuesday, February 28, 2017

Biotechnology and drug-design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal Chemical Science. It's ultimately related to biotechnology and drug design so first some background.

Background
Most biotechnology and medicine involves proteins in some way. Many diseases involve mutations that alter the function of proteins, most drugs are molecules that bind to proteins and inhibit their actions, and a large part of industrial biotechnology involves making new types of proteins such as enzymes. Like with everything else, it is easier to deal with proteins of you know what they look like but protein structure determination can be very difficult and we don't know what many important proteins actually look like.

The most popular way of determining protein structure is a technique called x-ray crystallography where you basically take an x-ray of a crystal made from the protein. Unfortunately, it can be very difficult or impossible to grow crystals of some proteins and if you can't get the protein to crystallise you can't use x-ray crystallography to find the structure.  The other main way for determining protein structure is a technique called NMR spectroscopy where you basically take an MRI of a solution containing the protein. The advantage is that there is no need for crystallisation, but the disadvantage it that it is difficult to extract enough information from the "NMR-MRI" go get a good structure.

The "NMR-MRI" of a protein actually provides a unique fingerprint of each protein so in principle all one has to do is generate a lot of possible structures of a protein, compute the NMR fingerprint for each, and compare to the measured fingerprint. The structure with the best fingerprint match should be the correct protein structure.  The questions are how to best generate the structure and how to best predict the NMR fingerprint using the structure.

The New Study
In 2015 we published a new method for predicting NMR fingerprints and in the paper that just got published we combined it with a method for generating a lot of protein structures. We started with known x-ray structures and generated millions of relatively small variations of the structure and found the structure with the best match.  We started from a known structure to answer the question: what is the best match we can hope for? The answer is: not perfect but good enough.  Now that we know this the next step will be to start with a structure we know is wrong and see if the program can find the right structure.  Also, our NMR fingerprint method does not generate fingerprints for all parts of the protein so we need to improve the model as well.




This work is licensed under a Creative Commons Attribution 4.0

Monday, February 13, 2017

What are the most important reactions in drug synthesis and stability?

This is a rough draft that I hope to flesh out based on feedback, i.e. "I'm asking, not telling". Please leave a comment or tweet me

Synthesis
1. Aromatic electrophilic substitution of heteroaromatics
2. Suzuki coupling of heteroaryl halides
3. Diels-Alder reaction (of what)?
4. Michael reaction (of what)?
5. Friedel-Crafts alkylation (of what)?
6. Nazarov cyclization reaction (of what)?
7. ...

Stability (Probably solvolysis and oxidative degradation, but can we be more specific)
1. Autooxidation of C-H bonds?
2. Ester hydrolysis?
3. ..



This work is licensed under a Creative Commons Attribution 4.0

Sunday, February 5, 2017

Preprints and the speed of publishing

When I talk about preprints with colleagues some of them say "Oh, what's the rush?" or "Publishing is so fast these days. Why, my last paper was online 6 weeks after submission don't you know" and then go on to clean their pipe with a thoughtful smile or get up to stoke the coal fire.

I'm currently writing a couple of proposals and as I was updating the reference list on one of them when I noticed that two preprints I cited apparently still haven't been "published" by a journal.  One of them first appeared on arXiv on October 7th and the other October 27th.

Both papers are important to the proposal in the sense that they changed my thinking on what is possible and I think the proposal would be less ambitious if I hadn't read them. So I am very happy these authors chose to deposit them as preprints.  I wish more people would do this and I wish fewer journals/journal editors would stand in the way of them doing so.


This work is licensed under a Creative Commons Attribution 4.0

Saturday, January 28, 2017

Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the Journal of Physical Chemistry A.  If you don't have access to this journal you can find a free version of an earlier draft here. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail.

Last year we developed a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. We showed that the basic approach works reasonably well for small prototypical molecules and we even tested one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great). 

The New Study
We test the method on 48 drug molecules and show that it works reasonably well.  It is not quite as accurate as the methods that rely on experimental data, but this is probably because many of the molecules we test are in the databases that the programs use.  But we felt we had to test these molecules first because they are some of the first molecules other users will try to test the method. The next step is to test the method on molecules where some of the existing methods perform poorly. We also have to think about how best to make this method available to researchers who are acutually doing the drug design.



This work is licensed under a Creative Commons Attribution 4.0 

Saturday, January 21, 2017

Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods

Art Winter tweeted this paper by Morten Jørgensen and co-workers last year and I decided to see if semi-empirical methods could help here.  The paper uses Chemdraws chemical shift predictor to predict where a bromine atom will be added to a heteroaromatic molecules using electrophilic aromatic substitution reactions. They tested this on 132 different compounds and achieved an 80% success rate, which is very good.

Googling a bit let me to this paper by Wang and Streitwieser where they show a correlation between the rate of electrophilic aromatic substitution reactions and the lowest proton affinity of the protonated species.  This suggests that the protonated carbon with the lowest proton affinity (or pKa if solvent is included) should be the reacting carbon.  So I tested this using semiempirical QM methods for these 132 compounds.  When I say "I" I should say that +Jimmy Charnley Kromann ran many of the calculations and Monika Kruszyk provided most of the structures as Chemdraw files, which I could convert to SMILES strings using OpenBabel. These are preliminary results and may contain errors. 

The reactions for the 132 compounds are not all run in the same solvent, so I first tested gas phase, chloroform (i.e. dielectric 4.8) and DMF (dimethylformamide, dielectric 37) using PM3 and COSMO in MOPAC. I chose PM3/COSMO because that gave the best results in a previous pKa study. The most representative choice of solvent seems to be chloroform, where PM3/COSMO predicted the correct bromination site in 95% of the cases, i.e. it fails for 7 cases. Gas phase and DMF fails for 14 and 8 cases, so it's important to include solvent, but the value of the dielectric constant is not all that important.  Using chloroform as a solvent, I then tested AM1,  PM6, PM6-DH+, PM7 and DFTB3/SMD (using GAMESS for the last one), which resulted in 12, 12, 12, 9, and 13 wrong predictions. One of the compounds includes an Si atom, which the DFTB3 parameter set I used couldn't handle so the 13 wrong predictions is out of 131 compounds.  Anyway, PM3/COSMO/chloroform works best.

In some cases the lowest pKa value is quite close to some of the other pKa values, so I took an approach similar to that of Jørgensen and co-workers: if the correct bromination site is included in the set of atoms with pKa values within 0.74 pH units (corresponding to 1 kcal/mol at room temperature) then I counted it as correct.  For PM3/COSMO/chloroform this occurred 10 times. In 9 cases the set included 2 atoms and in 1 case, 3 atoms.  In one of the 9 cases (15) there are only two possible bromination sites, so this case is not a successful prediction and PM3/COSMO/chloroform actually gets 8 wrong. However, in all other cases there are more possibilities than those predicted. Furthermore, in all but 2 of thes 10 cases the atom with the lowest pKa is the "correct" atom.

Bromination, or more generally, halogenation is often a first step towards adding an aryl group, usually using a Suzuki reaction.  Often there is more than one halogen of the same type so there is also interest in predicting where the aryl group will go.  I tried the PM3/COSMO/chloroform approach on the six molecules in this paper by Houk and co-workers. Computing pKa's of the halogenated carbon atoms let to correct predictions in 4 of the 6 cases, while computing proton affinities of the carbon atoms in the non-halogenated parent compounds let to correct predictions in 2 of the 6 cases. The former approach seems promising but needs to be tested on a much larger set of molecules.

Next step is to write this up and get the set-up and analysis code in such a shape that we can distribute it. I've also started thinking about how to make the approach more generally available and usable for non-experts. A grant proposal is also in the works, so if we're successful that should definitely be possible to achieve.


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Monday, January 16, 2017

Open access chemistry publishing options in 2017

I just noticed that my go-to journal increased its APC again.*  Now there's a flat fee of $1095 so I am re-evaluating my options for impact neutral OA publishing. I don't think PeerJ is greedy, so I think the most likely explanation is be that their old model was not sustainable.  I now feel I have been a bit to hard on some other OA publishers (e.g. here and here, but not here).

While price and impact-neutrality is the main consideration, open peer review is a nice bonus that I became accustomed to from PeerJ. In my experience it makes for much better reviews and keeps the tone civil.

Impact neutral journals

$0. Royal Society Open Science has an APC waiver and open peer review. In 2018 the APC will become £900. (The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

€750. Research Ideas and Outcomes (disclaimer: I am subject editor), open peer review.

$1000 F1000Research. Open peer review

$1095 PeerJ. Open peer review.

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review. HT +Stephan P. A. Sauer

$1495 PLoS ONE. Closed peer review.

$1675 Scientific Reports. Closed peer review

$2000 ACS Omega. Price for CC-BY by ACS member ($140/year). Closed peer review.

So it looks like Royal Society Open Science is the next thing for me to try, as long as the APC waiver is in place.

Free or reasonably priced journals that judge perceived impact

$0 Chemical Science Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review. HT +Wendy Patterson

$0 Beilstein Journal of Nanotechnology. Closed peer review. HT +Wendy Patterson

$500 ACS Central Science Price for CC-BY by ACS member ($140/year). Closed peer review.

£500 RSC Advances. Closed peer review. (Normally £750)

Let me know if I have missed anything.

Last update: 2017.03.05

*I just noticed that the membership model still exists though the price has increased. I already have a premium membership, so this may still be a viable option for me. If you are a single author or have only one co-author this is still the way to go.


This work is licensed under a Creative Commons Attribution 4.0

Sunday, January 15, 2017

Making your computational protocol available to the non-expert

I recently read this paper by Jonathan Goodman and co-workers which I learned about through this highlight by Steven Bachrach.  The DP4 method is a protocol for computing chemical shifts of organic molecules using DFT and comparing the chemical shifts to experimental values.  This paper automates the method, switches to free software packages (NWCHEM instead of Gaussian and TINKER instead of Macromodel), and tests the applicability for drug like molecules.  The python and Java code is made available on Github under the MIT license.

I like everything about this paper and what follows is not a criticism of this paper.

The method is clearly aimed at organic chemists who use NMR to figure out what they made or isolated. Let's say they want to try DP4 to see how well it works on some molecule they are currently working on.

What's needed to get started
1. Access to multicore Linux computer.  The method requires quite many B3LYP/6-31G(d,p) NMR calculations and given the typical size of organic molecules it will probably not be practically possible to even test this method on a desktop computer.  Even if it is, the instructions for PyDP4 assumes you are using Linux to you'd have to somehow deal with that if you, like many, have a Windows machine.

2. Installation. You have to install NWCHEM, Tinker, OpenBabel and configure PyDP4.

3. Coordinates. PyDP4 requires an sdf file as input.  You have to figure out what that is and how to make one.

4. Familiarity with Linux.  All this assumes that you are familiar with Linux. How many synthetic organic chemists are?

If you'll be using DP4 a lot, all of this may be worth doing but perhaps not just to try it?  If you don't have access to a Linux cluster, buying one for the occasional NMR calculation may be hard to justify. If one is convinced/determined enough, the most likely solution would probably be to find and pay an undergrad to do all this using an older computer you were gonna throw out anyway.  Or maybe your department has a shared cluster and a sysadmin who could handle the installation.

Alternative 1: Web server
One alternative is to make DP4 available as a web server, where the user can upload the sdf file and other data.  If one includes a GUI all 4 problems are solved ... for the user.  The problem for the developer is that this could eat up a lot of your own computational resources. One could probably do something smart to only use idle cycles, but the best case scenario (lots of users) also becomes the worst case scenario.  Perhaps there's a way to crowdsource this?

Alternative 2: VM Virtual box
Another alternative is to make DP4 available as a virtual machine (VM).


This mostly solves the installation issue. The main problem here is that the user needs still needs to find a reasonably powerful computer to run this on. The other problem is that the developer needs to test the VM-installation on various operating systems and keep up to date as new ones appear. Perhaps there's a way to crowdsource all this?

Alternative 3: Amazon Web Services or Google Compute Engine
Another alternative is to make DP4 available as a VM image for AWS or GCE.  This mostly solves the CPU and installation issue. The user creates an AWS or GCE account and imports the VM image and then pays Amazon and Google for computer time using a credit card. For reasonably sized molecules the cost would probably be less than $10/molecule as far as I can tell.

I don't have any direct experience with AWS or GCE so I don't know how slick the interface can be made. All examples I have seen have involved ssh to the AWS/GCE console, so some Linux knowledge is required.

Alternative 4: AWS/GCE-based Web server
Another alternative is to combine 2 and 3. The problem here is how to bill the individual user for their CPU-usage. There is probably ways to to this but it's starting to sound like a lot of work to set up and manage. Perhaps by adding a surcharge one could pay someone to handle this on a part-time basis.  Perhaps existing companies would be interesting in offering such a service?

Licensing issues
As far as I can tell the licenses of NWCHEM, TINKER, and OpenBabel allow for all 4 alternatives.  

The bigger issue
A key step in making a computational chemistry-based methods such as DP4 usable to the non-expert is clearly automation and careful testing.  Another is using free software (I have access to Gaussian but I am not going to buy Macromodel just to try out DP4!). Kudos to Goodman and co-workers for doing this. But if we want to target the non-experts, I think we should try to go a bit further. One could even imagine something like this in the impact/dissemination section of a proposal:
The computational methodology is based on free software packages and the code needed for automatisation and analysis, that is written as part of the proposed work, will be made available on Github under an open source license.  Furthermore, Company X will make the approach available on the AWS cloud computing platform, which will allow the non-expert to use the approach without installation or investment in in-house computational resources and greatly increase usage. Company X handles the set-up, billing for on-demand CPU-time, usage-statistics, and provides a rudimentary GUI for the approach for a one-time fee of $2000, which is included in the budget.
Anyway, just some thoughts.  Have I missed other ways of getting a relatively CPU-intensive computational chemistry method in the hands of non-experts?


This work is licensed under a Creative Commons Attribution 4.0

Saturday, January 7, 2017

Planned papers for 2017

A year ago I thought I'd probably publish three papers in 2016:

Listed as probable in 2016
1. Benchmarking of PM6 and DFTB3 for barrier heights computed using enzyme active site models.
2. pKa prediction using PM6 - part 1
3. Protein structure refinement using ProCS15 - starting from x-ray structure

and this basically turned out to be correct, as you can see from the links, except that paper number 3 officially is published in 2017 because Chemical Science still uses issues. So I will have to list it as a 2017 paper, meaning I published two papers in 2016.  Not my best year.

Here's the plan for 2017

Accepted
1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable
3. Intermolecular Interactions in the Condensed Phase: Evaluation of Semi-empirical Quantum Mechanical Methods
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods
5. Benchmarking cost vs. accuracy for computation of NMR shielding constants by quantum mechanical methods
6. Improved prediction of chemical shifts using machine learning
7. PM6 for all elements in GAMESS, including PCM interface

Probably not in 2017
8. Protonator: an open source program for the rapid prediction of the dominant protonation states of organic molecules in aqueous solution
9. pKa prediction using semi-empirical methods: difficult cases
10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods
11. High throughput transition state determination using semi-empirical methods



This work is licensed under a Creative Commons Attribution 4.0


Reviews of Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

I have been remiss in posting reviews of my papers. I submitted the paper to Journal of Physical Chemistry A on November 2, 2016, received first round of reviews November 29, and second round of reviews December 12.  The paper was accepted January 5, 2017 and has appeared online.

Round 1
Reviewer(s)' Comments to Author:

Reviewer: 1

Recommendation: This paper is not recommended because it does not provide new physical insights.

Comments:
This is an interesting study on very important subject - prediction of pKa for drug-like molecules. Standard free energy of a molecule is determined as the sum of heat of formation/electronic energy and solvation free energy and these terms are obtained by various semiempirical QM (SQM) methods and two continuous solvent models. Author used SQM methods as a black box and compared them on the basis of their performance to predict pKa. This is, however, not justified since the SQM methods used described differently system under study. For example, PM6-DH+ describes well H-bonding and dispersion energy contrary to e.g. PM3 and AM1. Consequently, structures stabilized by H-bonding and dispersion will be described much better by the former method. Further, PM7 was parametrized to cover dispersion in core parametrization, contrary to PM6 (and PM3) where it should be included a posteriori by e.g. DH+ term. Consequently, PM7 should be also better suited than, e.g. PM6. The question arises how good those methods work and here performance of these methods should be compared with some higher-level method like DFT.

Further, SQM methods were in the last 5 years already used for protein - ligand interactions but these papers were not mentioned at all.

On the basis of above-mentioned arguments I cannot recommend the paper for publication in JPC.

Reviewer: 2

Recommendation: This paper is publishable subject to minor revisions noted.  Further review is not needed.

Comments:
This is simply excellent work on an important topic. The only thing is that the author could put the importance of his work in an even greater perspective. Semi-empirical methods are becoming increasingly important also in materials science and the pKa is of high importance also in this field, as it is a good indicator of general chemical stability (like it is used in organic chemistry) of molecular (especially organic) materials for technical applications. A recent example is the search for new organic electrolyte solvents for Lithium-air battery devices, where current design principles strongly rely on pKa values (see for instance http://pubs.rsc.org/en/Content/ArticleLanding/2015/CP/C5CP02937F#!divAbstract ).

Round 2
Reviewer(s)' Comments to Author:

Reviewer: 1

Recommendation: This paper is not recommended because it does not provide new physical insights.

Comments:
Since the ms was not modified according my comments I cannot recommend it for publication.


Reviewer: 3

Recommendation: This paper is publishable subject to minor revisions noted.  Further review is not needed.

Comments:
This paper evaluates a number of semi-empirical quantum mechanical (SQM) methods for their suitability in calculating the pKa’s of amine groups in drug-like molecules, with the hope that these methods can be used for high-throughput screening.  This paper is suitable for publication in the special issue, subject to minor revision.

(a) The paper shows that pKa’s calculated by some SQM methods is sufficiently accurate for high-throughput screening.

(b) Indicate the accuracy of related QM calculations (e.g. Eckert and Klamt) and the relative cost of QM vs SQM calculations (order of magnitude will do)

(c) How much better is the SQM approach than the empirical methods cited by the author? (add a comparison in the tables)

(d) The need for 26 reference compounds for 53 amine groups in 48 molecules is disturbingly high (so much so that the null hypothesis has errors only a factor of 2 larger than the best results). What are the errors in the SQM calculated pKa’s if a much smaller number of reference compounds are used? (e.g. 6 or less)  If the errors are acceptable, this could make it possible to automate the procedure so that it could be used to screen larger sets of molecules extracted from typical industrial databases (10,000 – 10,000,000 compounds).



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Reviews of Protein structure refinement using a quantum mechanics-based chemical shielding predictor

I have been remiss in posting reviews of my papers. I received this review on November 11, 2016 of a manuscript I submitted to Chemical Science on September 29, 2016.  The paper was accepted November 17 and has appeared online.

REVIEWER REPORT(S):
Referee: 1

Recommendation: Accept

Comments:
Review of 'Protein Structure Refinement Using a Quantum Mechanics-Based
          Chemical Shielding Predictor'


The authors present a method to refine protein structures with respect to
chemical shifts evaluated by their QM-based ProCS15 method. First applications
to a set of different protein structures showed that small structural changes
lead to a significant reduction of the RMSD.

Empirical methods to predict NMR shifts have shown to be able to deliver
results that correlate well with experimental at almost no computational cost,
in particular in comparison with quantum chemical methods. However, these
methods are also insensitive with respect to structural changes of the
molecular structure. In this work, the authors analyse their empirical ProCS15
method, which is parametrized based on quantum chemical reference calculations,
with respect to structural changes in the molecular geometry. First examples
show that their method has a similar high sensitivity with respect to structure
changes as quantum chemical methods. The results indicate that ProCS15 can
hold a 'predictive power' beyond previous empirical methods, i.e., in
applications to more exotic molecular geometries and conformations.

The manuscript is well written and of appropriate length, and certainly of
great interest for the readers of Chemical Science. The presented applications
have been thoroughly analyzed and results are well outlined for the reader.
Since I've have only a few comments/suggestions, no further revision prior to
publication is necessary. However, I would strongly suggest to consider my
suggestion on the ordering of sections (see below).

Comments:

+ My main point is actually regarding to the order of sections in the
 manuscript.  Since the different methods used are constantly refered to in
 the result-section, I would recommend to first outline the
 theory/computational methodology and then present the results of the
 illustrative calculations on the test systems.

+ In the summary, the authors mention that their method might be used to
 improve the accuracy of QM or QM/MM calculations of NMR chemical shifts.
 It is certainly difficult to judge the quality of the ProCS15-optimized
 structures objectively, i.e., without refering to secondary properties like
 NMR shifts. However, it would be interesting to see the impact of the
 structural changes in quantum chemical calculations.
 This point might be beyond the scope of this work, but is certainly worthwile
 to be considered by the authors as a future project.

+ Just a comment on the DFT-based reference calculations used to parametrize
 the ProCS15 method: It might be worthwile considering the use of the KT2
 functional by Keal and Tozer [JCP 119, 3015 (2003)] and the basis sets
 pcS-x/pcSseg-x by Frank Jensen [JCTC 4, 719 (2008):JCTC 10, 1074 (2014)].
 Both functional and basis sets are optimized for NMR chemical shift
 calculations. A benchmark of those method was done by Flaig et al. [JCTC 10,
 572 (2014)].


This work is licensed under a Creative Commons Attribution 3.0 Unported License.