Rowan pKa: Fast and Accurate Prediction of pKa Values with Minimal Empiricism
why we chose this problem; how Rowan pKa works; benchmarking on real systems; the future of ML + quantum chemistry
We’re excited to announce the release of Rowan pKa, our new workflow for predicting the acid dissociation constants (pKas) of small molecules! Developing this workflow has been a legitimate scientific undertaking, and in parallel to this launch we’re publishing a preprint describing our methodology both on ChemRxiv and on our website.
Understanding a molecule’s pKa is incredibly important: pKa values dictate whether a molecule will be ionized or neutral at a given pH, which can be used to predict solubility, membrane permeability, blood–brain-barrier penetration, hERG toxicity, phospholipidosis risk, and much more. Accurate pKa predictions help medicinal chemists design compounds with the desired physicochemical and pharmacological parameters, making pKa calculation a problem of “immense interest” in computational chemistry.
Generally, pKa values are predicted through two different approaches: quantitative structure–property relationships (QSPR) generated by fits to empirical data, or quantum chemistry-based methods that directly compute the difference in energy between the protonated and deprotonated forms of the molecule. Both methods have strengths and weaknesses. QSPR methods are very fast and display good accuracy for compounds like those in their training set, but struggle to generalize to unseen molecules and often can’t handle effects that depend on 3D structure, like intramolecular hydrogen bonding. In contrast, quantum chemical methods for pKa prediction often perform very well for unseen compounds and naturally handle conformational effects, but are often too slow to be practical.
The Rowan pKa workflow follows the same philosophy as quantum chemistry-based pKa prediction methods, but with one important difference: instead of actually running any quantum chemical calculations, we instead run calculations with AIMNet2, a recently released machine-learned interatomic potential from Olexandr Isayev and co-workers. AIMNet2 is orders of magnitude faster than density-functional theory and performs reasonably well on main-group thermochemistry benchmarks, making it perfectly suited for pKa prediction.
Of course, building a black-box pKa prediction workflow is a little complex, and we had to figure out how to make all the other steps fast enough that the speed of AIMNet2 can shine. Here’s a visual overview of everything that Rowan does in a pKa calculation:
We’ve put a lot of work into benchmarking this method: in the associated preprint, you can see how Rowan pKa performs on eight different datasets, including medicinal chemistry datasets selected to mimic real-world usage. At a high level, the mean absolute error is typically around 1 pKa unit, although as molecules get larger and more conformationally complex this error naturally increases.
Since Rowan pKa takes conformations into account, unlike most QSPR methods, we can start to explore interesting effects in large molecules. We modeled some data from Andrei Yudin and co-workers showing that cyclization dramatically attenuated the basicity of a pyrrolidine embedded in a peptide. The match isn’t perfect, but Rowan pKa gets the trend in basicity right!
Rowan pKa is also fast enough for routine usage on libraries of compounds: we looked at the first 100 rings from Peter Ertl’s new database of medicinal chemistry-relevant ring systems, and found that the pKa calculations took, on average, 23.1 s to complete (on a EC2 c5.2xlarge instance). Similarly, this pKa calculation took only 30 s to generate three distinct microscopic values:
Our workflow is good, but it isn’t perfect: Rowan pKa is not as fast as QSPR, and conformationally complex molecules can still take a long time. Predicting the pKa values of upadacitinib, AbbVie’s Janus kinase inhibitor, involved 5 microscopic pKa calculations, 824 distinct conformations, 60 AIMNet2 optimization/frequency calculations, and took 33 minutes in total. Rowan pKa also displays higher errors than state-of-the-art quantum chemistry-based approaches. AIMNet2 isn’t yet as accurate as high-level DFT methods with big basis sets, and these inaccuracies show up in the final values.
Nevertheless, we think that Rowan pKa is an exciting and useful addition to the computer-assisted drug design toolbox, and we’re excited to start testing it out in the real world. If you already have a Rowan account, you can start running pKa calculations today through the same interface you’re already familiar with—and if you don’t have a Rowan account, you can make one here! We’re working on building the infrastructure needed to allow for pKa prediction at scale (i.e. not through the web browser); if you want to run lots of calculations, please contact us and we can figure out a solution for you and your group/company.
More generally, we’re excited to apply this same playbook—using general pre-trained ML potentials as a surrogate for electronic structure theory—to other tasks in computational chemistry. If there’s a particular problem that interests you, please reach out and let us know!
Great work! This looks like a really interesting workflow, and is another nice point on the Pareto frontier.
Do you find that the AIMNet2 filtering of conformers to have a significant effect on the chosen conformers relative to the GFN2-xTB energies? My work in using GFN2-xTB conformational energies to screen catalysts has been promising, but I'm always wondering about using a generalized ML force-field to select conformers based on energy, as the GFN2-xTB energies are often lacking, but DFT energies are expensive.