Uni-pK_a

The official implementation of the model Uni-pK_a in the paper Bridging Machine Learning and Thermodynamics for Accurate pK_a Prediction.

Published paper at [JACS Au] Relevant preprint at [ChemRxiv] | Small molecule protonation state ranking demo at [Bohrium App] | Full datasets at [AISSquare]

This machine-learning-based pK_a prediction model achieves the state-of-the-art accuracy on several drug-like small molecule macro-pK_a datasets.

Two core components of Uni-pK_a framework are

A microstate enumerator to systematically build the protonation ensemble from a single structure.
A molecular machine learning model to predict the free energy of each single structure.

The model reaches the expected accuracy in the inference stage after the comprehensive data preparation by the enumerator, pretraining on the ChemBL dataset and finetuning on our Dwar-iBond dataset.

Microstate Enumerator

Introduction

It uses iterated template-matching algorithm to enumerate all the microstates in adjacent macrostates of a molecule's protonation ensemble from at least one microstate stored as SMILES.

The protonation template smarts_pattern.tsv modifies and augments the one in the paper MolGpka: A Web Server for Small Molecule pKa Prediction Using a Graph-Convolutional Neural Network and its open source implementation (MIT license) in the Github repository MolGpKa.

Usage

The recommended environment is

python = 3.8.13
rdkit = 2021.09.5
numpy = 1.20.3
pandas = 1.5.2

Reconstruct a plain pK_a dataset to the Uni-pK_a standard macro-pK_a format with fully enumerated microstates

cd enumerator
python main.py reconstruct -i <input> -o <output> -m <mode>

The <input> dataset is assumed be a csv-like file with a column storing SMILES. There are two cases allowed for each entry in the dataset.

It contains only one SMILES. The Enumerator helps to build the protonated/deprotonated macrostate and complete the original macrostate.
- When <mode> is "A", it will be considered as an acid (thrown into A pool).
- When <mode> is "B", it will be considered as a base (thrown into B pool).
It contains a string like "A1,...,Am>>B1,...Bn", where A1,...,Am are comma-separated SMILES of microstates in the acid macrostate (all thrown into A pool), and B1,...,Bn are comma-separated SMILES of microstates in the base macrostate(all thrown into B pool). The Enumerator helps to complete the both.

The <mode> "A" (default) or "B" determines which pool (A/B) is the reference structures and the starting point of the enumeration.

The <output> dataset is then constructed after the enumeration.

Build protonation ensembles from single molecules

Example:

cd enumerator
python main.py ensemble -i ../dataset/sampl6.tsv -o example_out.tsv -u 2 -l -2 -t simple_smarts_pattern.tsv

The input dataset is SAMPL6 dataset as example. Reconstructed pK_a dataset, or just any molecular dataset with an "SMILES" column with single molecular SMILES is supported as the input. In the output file, like example_out.tsv, columns include the original SMILES, and macrostates of total charge between the upper bound set by -u (default +2) and the lower bound set by -l (default -2). A simpler template is prepared as simple_smarts_pattern.tsv here for cleaner protonation ensembles which discard some rare structure motifs in the aqueous solution.

Machine Learning Model

Introduction

It is a Uni-Mol-based neural network. By embedding the neural network into thermodynamic relationship between the free energy and pK_a throughout the training and inference stages, the framework preserves physical consistency and adapts to multiple tasks.

Usage

The recommended environment is the docker image.

docker pull dptechnology/unimol:latest-pytorch1.11.0-cuda11.3

See details in Uni-Mol repository.

After the full datasets had been downloaded, use scripts/pretrain_pka_mlm_aml.sh to pretrain the model, use scripts/finetune_pka_aml.sh to finetune the model, use infer_test.sh to test the trained model on a macro-pK_a dataset, and use infer_free_energy.sh to infer the free energy of given structures for any pK_a-related tasks.

Todo

Ready-to-run training workflow

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dataset		dataset
enumerator		enumerator
image		image
scripts		scripts
unimol		unimol
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-pK_a

Microstate Enumerator

Introduction

Usage

Reconstruct a plain pK_a dataset to the Uni-pK_a standard macro-pK_a format with fully enumerated microstates

Build protonation ensembles from single molecules

Machine Learning Model

Introduction

Usage

Todo

About

Releases

Packages

Contributors 2

Languages

License

dptech-corp/Uni-pKa

Folders and files

Latest commit

History

Repository files navigation

Uni-pKa

Microstate Enumerator

Introduction

Usage

Reconstruct a plain pKa dataset to the Uni-pKa standard macro-pKa format with fully enumerated microstates

Build protonation ensembles from single molecules

Machine Learning Model

Introduction

Usage

Todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Uni-pK_a

Reconstruct a plain pK_a dataset to the Uni-pK_a standard macro-pK_a format with fully enumerated microstates

Packages