Rowan pKa: Fast and Accurate Prediction of pKa Values with Minimal Empiricism

why we chose this problem; how Rowan pKa works; benchmarking on real systems; the future of ML + quantum chemistry

and

Mar 08, 2024

We’re excited to announce the release of Rowan pK_a, our new workflow for predicting the acid dissociation constants (pK_as) of small molecules! Developing this workflow has been a legitimate scientific undertaking, and in parallel to this launch we’re publishing a preprint describing our methodology both on ChemRxiv and on our website.

Here’s what pK_a in Rowan looks like - this calculation took about 160 s to run.

Understanding a molecule’s pK_a is incredibly important: pK_a values dictate whether a molecule will be ionized or neutral at a given pH, which can be used to predict solubility, membrane permeability, blood–brain-barrier penetration, hERG toxicity, phospholipidosis risk, and much more. Accurate pK_a predictions help medicinal chemists design compounds with the desired physicochemical and pharmacological parameters, making pK_a calculation a problem of “immense interest” in computational chemistry.

Generally, pK_a values are predicted through two different approaches: quantitative structure–property relationships (QSPR) generated by fits to empirical data, or quantum chemistry-based methods that directly compute the difference in energy between the protonated and deprotonated forms of the molecule. Both methods have strengths and weaknesses. QSPR methods are very fast and display good accuracy for compounds like those in their training set, but struggle to generalize to unseen molecules and often can’t handle effects that depend on 3D structure, like intramolecular hydrogen bonding. In contrast, quantum chemical methods for pK_a prediction often perform very well for unseen compounds and naturally handle conformational effects, but are often too slow to be practical.

The Rowan pK_a workflow follows the same philosophy as quantum chemistry-based pK_a prediction methods, but with one important difference: instead of actually running any quantum chemical calculations, we instead run calculations with AIMNet2, a recently released machine-learned interatomic potential from Olexandr Isayev and co-workers. AIMNet2 is orders of magnitude faster than density-functional theory and performs reasonably well on main-group thermochemistry benchmarks, making it perfectly suited for pK_a prediction.

Of course, building a black-box pK_a prediction workflow is a little complex, and we had to figure out how to make all the other steps fast enough that the speed of AIMNet2 can shine. Here’s a visual overview of everything that Rowan does in a pK_a calculation:

We’ve put a lot of work into benchmarking this method: in the associated preprint, you can see how Rowan pK_a performs on eight different datasets, including medicinal chemistry datasets selected to mimic real-world usage. At a high level, the mean absolute error is typically around 1 pKa unit, although as molecules get larger and more conformationally complex this error naturally increases.

Since Rowan pK_a takes conformations into account, unlike most QSPR methods, we can start to explore interesting effects in large molecules. We modeled some data from Andrei Yudin and co-workers showing that cyclization dramatically attenuated the basicity of a pyrrolidine embedded in a peptide. The match isn’t perfect, but Rowan pK_a gets the trend in basicity right!

Rowan pK_a is also fast enough for routine usage on libraries of compounds: we looked at the first 100 rings from Peter Ertl’s new database of medicinal chemistry-relevant ring systems, and found that the pK_a calculations took, on average, 23.1 s to complete (on a EC2 c5.2xlarge instance). Similarly, this pK_a calculation took only 30 s to generate three distinct microscopic values:

Our workflow is good, but it isn’t perfect: Rowan pK_a is not as fast as QSPR, and conformationally complex molecules can still take a long time. Predicting the pK_a values of upadacitinib, AbbVie’s Janus kinase inhibitor, involved 5 microscopic pK_a calculations, 824 distinct conformations, 60 AIMNet2 optimization/frequency calculations, and took 33 minutes in total. Rowan pK_aalso displays higher errors than state-of-the-art quantum chemistry-based approaches. AIMNet2 isn’t yet as accurate as high-level DFT methods with big basis sets, and these inaccuracies show up in the final values.

Nevertheless, we think that Rowan pK_a is an exciting and useful addition to the computer-assisted drug design toolbox, and we’re excited to start testing it out in the real world. If you already have a Rowan account, you can start running pK_a calculations today through the same interface you’re already familiar with—and if you don’t have a Rowan account, you can make one here! We’re working on building the infrastructure needed to allow for pK_a prediction at scale (i.e. not through the web browser); if you want to run lots of calculations, please contact us and we can figure out a solution for you and your group/company.

More generally, we’re excited to apply this same playbook—using general pre-trained ML potentials as a surrogate for electronic structure theory—to other tasks in computational chemistry. If there’s a particular problem that interests you, please reach out and let us know!

Jonathon Vandezande

Mar 8, 2024

Great work! This looks like a really interesting workflow, and is another nice point on the Pareto frontier.

Do you find that the AIMNet2 filtering of conformers to have a significant effect on the chosen conformers relative to the GFN2-xTB energies? My work in using GFN2-xTB conformational energies to screen catalysts has been promising, but I'm always wondering about using a generalized ML force-field to select conformers based on energy, as the GFN2-xTB energies are often lacking, but DFT energies are expensive.

Expand full comment

1 reply

1 more comment...

Rowan Newsletter

Discussion about this post