Descriptors & PCA
empirical vs physics-based modeling; molecular representations; our descriptors workflow; PCA for visualizing chemical space; metal-organic complex spin states; input structures with PubChem; AI Grant
At Rowan, we’re working on building a design and simulation platform for drug designers, materials scientists, and anyone else who works with atoms. As we’ve picked problems to work on, we’ve mainly focused thus far on problems where physics-based simulation can provide accurate answers quickly—pKa prediction, tautomer searching, conformer searching, reaction modeling, and so on. However, there are many important molecular properties that are out of scope for these quantum mechanics–style simulations or for which an empirical approach is far superior.
To solve these problems, people sometimes use a toolbox of data science and machine learning tools collectively known as cheminformatics. Basically, cheminformatics uses data to model the relationship between structures and properties statistically rather than physically. This typically begins by generating some numerical representation of the molecules in question to embed them in a high-dimensional vector space. Once you have these vector embeddings, you can generate maps of molecule space, evaluate the similarity of two sets, perform clustering, train models on categorization or regression tasks, and lots more. (We can’t summarize all of cheminformatics in a similar paragraph, but there are many great resources out there if you want to learn more!)
The above discussion purposely avoids specifying the nature of the molecular representations employed. In reality, there are a number of competing descriptor schemes out there:
Molecular fingerprints, which turn 2D graph-based representations of molecules into bit vectors, are often used. Fingerprints are very effective for properties that depend on local molecular structure, but lack knowledge of 3D structure.
Cheminformatic descriptors, like “number of hydrogen bonds” or fancier metrics, are also very commonly used. This can be very effective if the right descriptors are chosen, but finding the right ones can be challenging.
Quantum mechanics-based descriptors can be very helpful, as recent work from Green and co-workers shows, but are somewhat more expensive to compute.
And learned molecular representations like those generated by an autoencoder have attracted considerable attention in recent years (e.g. COATI).
Expert cheminformaticians typically have pipelines in place to generate descriptors as needed for their specific problems, but it’s considerably tougher for the average scientist to wander through a sea of packages, find the right tool for the job, and figure out what all the abbreviations mean.
Descriptors Workflow
All this to say—today, we’re excited to launch our descriptors workflow, which first optimizes molecular geometries with GFN2-xTB before calculating over 1,800 molecular descriptors using Mordred (paper, community-maintained GitHub) and xTB. This workflow calculates both 2D descriptors, which encode information about connectivity and atom types, and 3D descriptors, which encode information about molecular geometry.
Once you’ve submitted a descriptors calculation, our platform will run the optimization and calculate descriptors within minutes. Once completed, you can view the optimized structure, each descriptor calculated with Mordred, and the per-atom descriptors we’ve calculated with xTB.
You can select multiple calculations to view the results in tabular format. In this format, you can filter which descriptors are showing by name and module as well as sort by each column. To take these descriptors off Rowan for use elsewhere, you can download the entire table as a CSV. Cheminformatics descriptors typically have arcane and cryptic names, so we’ve added short descriptions of what each descriptor means—simply hover over the name to read them! (This is one of the features that we’re proudest of.)
Because it can get hard to select more than a few calculations by hand, we’ve added a “View all in folder” button to the top of the “Compare Descriptors Calculations” page. This view will hide the sidebar and load in all the descriptors calculations in your current folder (this requires patience, as it can take a minute).
Dimensionality Reduction with PCA
Once you’ve loaded in a number of descriptors, we’ve built a principal-component analysis (PCA) tool to help you visualize and understand your molecular data. PCA is a dimensionality reduction technique that reduces high-dimensional data into few, maximally information dense dimensions. On Rowan, we use the two most information-dense dimensions, PCA(1) and PCA(2), to plot your data on a simple scatterplot.
Once you’ve run a PCA analysis, you can hover over each point to view the associated structure, color the points on the scatterplot by any descriptor’s value, view the loadings of each PCA dimension, and download the coordinates of each point as a CSV to use with other software.
To playtest this workflow, I generated descriptors for the 103 compounds in this adenosine analogue library. (View them yourself here.) Visualizing this chemical space makes it easy to see the different sorts of compounds present in the library: urea-linked compounds and guanidine-linked compounds cluster separately, various properties vary in consistent ways throughout the clusters, and there’s a single compound that sits off by itself far from either cluster, which upon inspection has a completely unique quinone group that’s not found in any other compound. Exploring chemical space in this way has been used to great effect in selecting representative compounds or understanding the structure of various datasets (e.g. Merck informer libraries); it’s surprising how much insight can be gained!
The cheminformatics tooling ecosystem is broad, and we understand that what we’ve built here is just a baby step into a very complex field: this interface isn’t suited for vast libraries of compounds or sophisticated model training and hosting. If you have ideas for what we should do next to best support your work in cheminformatics or would like a custom solution for your business, we’d love to hear from you (you can reach us at contact@rowansci.com).
Predicting Spin States of Metal-Organic Complexes
Molecules can exist in a variety of spin states (the arrangement of the spins of the electrons within a molecule). Most organic molecules are closed-shell species, where all up [↑] and down [↓] spins of electrons are paired [↑↓] in orbitals. Most organic molecules that are radicals (having unpaired electrons) only have a single unpaired electron (doublet ground state), and thus no ambiguity in the correct spin state. However, metal-containing molecules often have many partially filled orbitals that are close in energy, and many different orbital occupations are possible, making the preferred spin-state non-obvious. Determining the lowest energy spin state of a molecule or complex is crucial for further understanding its geometry and properties. Open-shell species can have radically different structures than their closed-shell counterparts, and the number of unpaired electrons has significant effects on the reactivity and spectra of these species. Read Jonathon’s full blog post on predicting the spin states of metal-organic complexes and its use in synthesis and modeling.
PubChem and CID Input
Additionally, you can now input structures into Rowan from PubChem names and CIDs. This makes it easy to load in common chemicals or therapeutics and run calculations on them: there are over 100 million molecules in PubChem, so if you’re looking for a known molecule there’s a good chance that it’ll be present!
Other Rowan Updates
We’re really excited to share that we’ve been selected as one of the AI Grant batch 4 companies! This support gives us access to an incredible talent network and resources that will help us continue to build out our platform to better support scientific innovation.