Recent Publications  


Editor for the Following Journals


Research Projects   

Taba: A Tool to Analyze the Binding Affinity     


The Science
The basic idea behind the Taba is that the determinant structural features responsible for ligand-binding affinity are already somehow imprinted in the three-dimensional structures of protein-ligand complexes. When we consider an ensemble of crystallographic structures, for which ligand-binding information data is available, we have the raw data that can be used by the program Taba to generate a target-based polynomial scoring function. To build this target-based polynomial scoring function, Taba reads all structures available for a biological system of interest and calculates the average distances for each type of pair of atoms. For instance, consider intermolecular Carbon-Carbon distances, where one Carbon belongs to the protein and the second one is in the ligand. Taba calculates the average intermolecular distance for Carbon-Carbon pair. Taba considers this length as the equilibrium distance for a Carbon-Carbon pair, taking an analogy with a mass-spring system. For a given structure, displacement from this equilibrium distance generates an increase in the energy of the system. Again, we consider this naïve analogy with the mass-spring system. We modeled our protein-ligand interactions as illustrated in the figure below.

Protein-ligand as a mass-spring system

Protein-ligand as a mass-spring system. We used the atomic coordinates for the complex CDK2-roscovitine (PDB: 2A4L)(De Azevedo et al., 1997).  

For each pair of atoms, Taba calculates the average intermolecular distances. These distances are considered the equilibrium distance for each pair of atoms. We have an equilibrium distance for Carbon-Carbon pair, another for Carbon-Oxygen pair, and so on. The animated figure below shows the oscillation of a mass-spring system, displacement from the equilibrium generates a restoring force that causes the system to move in the contrary direction, in a harmonic motion.


Mass-spring system in an undamped oscillation movement (the program Mathematica generated the above animation, the code is available here).

As we previously highlighted, to apply Taba we need to have an ensemble of crystallographic structures for which ligand-binding affinity is known. This set of structures is used to train our model. In the first round, Taba calculates the average distance for each pair of atoms. On a second round, Taba applies supervised machine learning techniques to determine the relative weights of each type of pair of atoms. Taba considers intermolecular distances for each pair of atoms as explanatory variables. The response variable is the log of binding affinity, for instance, log(Ki), where Ki is the inhibition constant. Taba considers the following atoms from the protein structure: C, N, O, S, and P. For the ligands, Tabas uses the following atoms: C, N, O, S, F, Cl, Br, I, and P.

The Experiment

For the use of the Taba, we adopt a specific concept of experiment. For the Taba, the experiment is a set of files in the Protein Data Bank (PDB) format (Berman et al., 2000), data with PDB access codes, ligand-binding information file, configuration file, transformed files for regression, and resulting files. In this way, when we refer to an experiment, we are seeing to a set of data generated for a set of PDBs of a particular protein family and their associated records. Every experiment has a specific folder with the name given by the user.

The tool

The Taba has of the main screen, where you can select the desired task and six other screens with various functionalities. In addition to the feature screens, we have a screen with help and another overall information about the Taba. To run an experiment, you should follow the order in which the buttons are on the main screen, from left to right.

The main features of the Taba

Experiment Management: Before starting any experiment, you will need to trigger this functionality that allows us to save the current experiment, open an existing experiment or even delete the current experiment. When erasing an experiment, check the need to save it first.

Downloading PDB files: This feature allows the user to download data from the site (Berman et al., 2000). Taba can download two types of files: the PDB file with the atomic coordinates and the second one with the binding-affinity information. This binding affinity can be the inhibition constant (Ki), half-maximal inhibitory concentration (IC50), half-maximal effective concentration (EC50), and dissociation constant (Kd).
The codes obtained on the site, following user-defined search criteria, must be pasted into the appropriate box on the download screen. Before this, the user must use the cleaning button to clear the code field and also the name of the experiment. After pasting the PDB file codes, the user must fill in the field with the name of the experiment using the save option. Then you can select the download button. When the physical progress bar is 100%, you may close this screen. Always when the download screen opens, the PDB codes of the current experiment will be loaded.

Generate files for regression: This functionality is essential for the generation of machine-learning models. Taba makes use of the scikit-learn library to implement supervised-machine learning techniques (Pedregosa et al., 2011). The user can select the maximum distance. Taba will consider the intermolecular distance between an atom of a ligand and the protein. The allowed values ​​in Angstroms are the following: 3.5, 4.5, 6.0, 7.5, and 9 Å. This feature will randomly generate two file sets, one for training and another for testing. For this, the user can select the seed that will generate this randomness. For each dataset (training and test) four files will be generated to be selected later for regression. Taba uses the binding information from three other databases: PDBbind (Wang et al., 2004), BindingDB (Liu et al., 2007), and Binding MOAD (Hu et al., 2005). The fourth file type groups these three together.


Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank. Nucleic Acids Res., 2000, 28(1), 235-242.   PubMed   

Hu, L.; Benson, M.L.; Smith, R.D.; Lerner, M.G.; Carlson, H.A. Binding MOAD (Mother Of All Databases). Proteins: Struct. Funct. Genet., 2005, 60(3):333-340.   PubMed   

De Azevedo WF, Leclerc S, Meijer L, Havlicek L, Strnad M, Kim SH. Inhibition of cyclin-dependent kinases by purine analogues: crystal structure of human cdk2 complexed with roscovitine. Eur J Biochem. 1997; 243(1-2): 518-526.   PubMed   

Liu, T.; Lin, Y.; Wen, X.; Jorrisen, R.N.; Gilson, M.K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res., 2007, 35 (Database issue), D198-201.   PubMed  

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Verplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011; 12: 2825-2830.   PDF    

Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem., 2004, 47(12), 2977-2980.   PubMed   

SAnDReS: Statistical Analysis of Docking Results and Scoring functions

SAnDReS code is available from theGitHub.


SAnDReS draws inspiration from several protein-ligand projects that we have been working on in the last two decades. These projects began in the 1990s with pioneering studies focused on intermolecular interactions between cyclin-dependent kinase and inhibitors (De Azevedo et al., 1996; 1997). SAnDReS is a free and open-source (GNU General Public License) computational environment for the development of machine-learning models for prediction of ligand-binding affinity. SAnDReS is also a tool for statistical analysis of docking simulations and evaluation of the predictive performance of computational models developed to calculate binding affinity. We have implemented machine learning techniques to generate regression models based on experimental binding affinity and scoring functions such as PLANTS and MolDock scores. The scikit-learn library has a wide spectrum of supervised machine learning techniques for regression, such as Stochastic Gradient Descent and Support Vector. SAnDReS was developed using Python programming language, and SciPy, NumPy, scikit-learn, and Matplotlib libraries. Data obtained from any protein-ligand docking program can be analyzed by SAnDReS, the only requisite is to have protein structures in Protein Data Bank (PDB) format, ligands in Structure Data File (SDF) format, docking and scoring function data in comma-separated values (CSV) format. This program has been applied to several datasets comprised of crystallographic structures for which there is information for the ligand-binding affinity, in order to generate scoring functions tailored to the biological system of interest (Xavier et al., 2016).  

Gallery of Plots Generated by the Program SAnDReS      

Scatter plots and ROC Curve generated by SAnDReS  

The flowchart below illustrates the main steps to integrate a molecular docking program and SAnDReS.

Flowchart for application of SAnDReS to analyze docking results and develop scoring functions. Grey boxes indicate tasks carried out by SAnDReS.

Below you have a list of biological systems being investigated using Taba and  SAnDReS.

-3-enol-pyruvoylshikimate-5-phosphate synthase (EPSP synthase) (EC   PubMed    IC50   Kd   Ki
-11-Beta-Hydroxysteroid Dehydrogenase (EC   PubMed   IC50   Ki        

-Acetylcholinesterase (EC   PubMed   MOTM   IC50   Kd   Ki   
-ADAM 17 Endopeptidase (EC   PubMed   IC50   Kd   Ki    
-Adenosine A2a Receptor (EC   PubMed   IC50   Kd   Ki           
-Adenosine Deaminase (EC   PubMed   IC50   Ki    
-Adenosylhomocysteinase (EC   PubMed    IC50   Ki   
-Aldehyde Reductase (EC   PubMed   IC50   Kd   Ki    
-Angiotensin-Converting Enzyme (EC   PubMed   IC50   Kd   Ki    

-Beta-2 Adrenergic Receptor (EC   PubMed   IC50   Kd   Ki   
-Beta-Glucocerebrosidase  (EC   PubMed   IC50   Ki   
-Beta-Lactamase (EC   PubMed   MOTM   IC50   Kd   Ki   
-Beta-Secretase 1 (EC   PubMed   MOTM    IC50   Kd   Ki   

-Carbonic Anhydrase II (EC   PubMed   MOTM   IC50   Kd   Ki     
-Caspase-3 (or Apopain) (EC   PubMed     MOTM   IC50   Kd   Ki    
-Catechol O-Methyltransferase (EC   PubMed   IC50   Ki    
-Chorismate Synthase (EC   PubMed   Kd   
-c-Jun N-Terminal Kinase 3 (EC   PubMed   IC50   Kd   Ki   
-Coagulation Factor X (EC   PubMed    IC50   Kd   Ki   
-Coagulation Factor VIIa (EC   PubMed   IC50   Kd   Ki    
-C-X-C Chemokine Receptor Type 4 (EC   PubMed   IC50   Kd   Ki    
-Cyclin-Dependent Kinase (EC   PubMed   IC50   Kd   Ki    
-Cyclooxygenase-1 and 2 (EC   PubMed   MOTM   IC50   Kd   Ki   
-Cytochrome P450 2C9 (EC 1.14.13.-)   PubMed   MOTM   IC50   Kd   Ki   

-Dihydrofolate Reductase (EC   PubMed   MOTM   IC50   Kd   Ki
-Dihydroorotate Dehydrogenase (EC   PubMed   IC50   Ki       
-Dipeptidyl Peptidase IV (EC   PubMed   IC50   Kd   Ki       
-Dopamine D3 Receptor (EC   PubMed   IC50   Kd   Ki    
-Dual Specificity Mitogen-Activated Protein Kinase Kinase 1 (EC   PubMed   IC50   Kd   Ki           

-Enoyl-[Acyl-Carrier-Protein] Reductase (NADH) (EC   PubMed   IC50   Kd   Ki       
-Epidermal Growth Factor Receptor ErbB1 (EC   PubMed   MOTM   IC50   Kd   Ki  

-Farnesyl Diphosphate Synthase (EC   PubMed   IC50   Kd   Ki             
-Fibroblast Growth Factor Receptor 1 (EC   PubMed   IC50   Kd   Ki           
-FK506-Binding Protein 1 (EC   PubMed   IC50   Kd   Ki           
-Focal Adhesion Kinase 1 (EC   PubMed   IC50   Kd   Ki           

-GAR Transformylase (EC   PubMed   Ki          
-Glucocorticoid Receptor (EC   PubMed   IC50   Kd   Ki          

-Hepatocyte Growth Factor Receptor (EC   PubMed   IC50   Kd   Ki              
-Hexokinase Type IV (EC   PubMed   Kd   Ki           
-High-Resolution Crystallographic Structures   DeltaG   IC50   Kd   Ki     
-Histone Deacetylase (EC   PubMed   IC50   Kd   Ki                  
-HMG-CoA Reductase (EC   PubMed   IC50   Kd   Ki           
-Human Immunodeficiency Virus Type 1 Integrase (EC 2.7.7.-)   PubMed   MOTM   IC50   Kd   Ki   
-Human Immunodeficiency Virus Type 1 Protease (EC   PubMed   MOTM   IC50   Kd   Ki              
-Human Immunodeficiency Virus Type 1 Reverse Transcriptase (EC   PubMed  MOTM   IC50   Kd   Ki          
-Hydrolases (EC 3.-.-.-)   PubMed   IC50   Kd   Ki        

-Insulin-Like Growth Factor I Receptor (EC   PubMed   IC50   Kd   Ki          
-Isomerases (EC 5.-.-.-)   PubMed   IC50   Kd   Ki          
-Kinases (EC 2.7.-.-)   PubMed   IC50   Kd   Ki         

-Leukotriene-A(4) Hydrolase (or LTA-4 Hydrolase) (EC   PubMed   IC50   Kd   Ki        
-Ligases (EC 6.-.-.-)   PubMed   MOTM   IC50   Kd   Ki             
-Lyases (EC 4.-.-.-)   PubMed   IC50   Kd   Ki        
-Macrophage Colony Stimulating Factor Receptor (EC   PubMed   IC50   Kd   Ki        
-MAP Kinase-Activated Protein Kinase 2 (EC   PubMed   IC50   Kd   Ki       
-MAP Kinase (EC   PubMed   IC50   Kd   Ki    
-Matrix Metalloproteinase 13 (EC 3.4.24.-)   PubMed   IC50   Kd   Ki       
-Monoamine Oxidase B (EC   PubMed   IC50   Kd   Ki       
-Muscle Glycogen Phosphorylase (EC   PubMed   MOTM   IC50   Kd   Ki         

-Nitric-Oxide Synthase (EC   PubMed   MOTM   IC50   Kd   Ki         
-Neuraminidase (EC   PubMed   MOTM   IC50   Kd   Ki   

-Oxidoreductases (EC 1.-.-.-)   PubMed   IC50   Kd   Ki            

-Peptide Deformylase (EC   PubMed   IC50   Kd   Ki               
-Peroxisome Proliferator-Activated Receptor (EC   PubMed   IC50   Kd   Ki     
-Phosphodiesterase 5A (EC   PubMed   IC50   Kd   Ki           
-Phospholipase A(2) Group IIA (EC   PubMed   IC50   Kd   Ki        
-Poly [ADP-ribose] Polymerase-1 (EC   PubMed   IC50   Kd   Ki            
-Protein Farnesyltransferase/Geranylgeranyltransferase Type I Alpha Subunit (EC or EC   PubMed   IC50   Kd   Ki        
-Protein Kinase C Beta (EC   PubMed   IC50   Kd   Ki            
-Purine Nucleoside Phophorylase (EC   PubMed   IC50   Kd   Ki         

-Renin (EC   PubMed   IC50   Kd   Ki    
-Rho-Associated Protein Kinase 1 (EC   PubMed   IC50   Kd   Ki    
-Serine/Threonine-Protein Kinase AKT(EC   PubMed   IC50   Kd   Ki     
-Serine/Threonine-Protein Kinase AKT2 (EC   PubMed   IC50   Kd   Ki    
-Serine/Threonine-Protein Kinase B-raf (EC   PubMed   MOTM   IC50   Kd   Ki       
-Serine/Threonine-Protein Kinase PLK1 (or Polo Kinase) (EC   PubMed   IC50   Kd   Ki     
-Serine/Threonine-Protein Kinase WEE1 (EC   PubMed   IC50   Kd   Ki     
-Shikimate Kinase (EC   PubMed   Kd   Ki    
-Stem Cell Growth Factor Receptor (EC   PubMed   IC50   Kd   Ki     

-TGF-Beta Receptor Type I (EC   PubMed   IC50   Kd   Ki     
-Thymidine Kinase (EC   PubMed   IC50   Kd   Ki    
-Thymidylate Synthase (EC   PubMed   IC50   Kd   Ki    
-Thrombin (or Fibrinogenase) (EC   PubMed   IC50   Kd   Ki    
-Transferases (EC 2.-.-.-)   PubMed   IC50   Kd   Ki     
-Trypsin I  (EC   PubMed   IC50   Kd   Ki      
-Tryptase Beta-1 (EC   PubMed   IC50   Ki        
-Tyrosine-Protein Kinase (EC   PubMed   IC50   Kd   Ki       

-Urokinase-Type Plasminogen Activator (EC   PubMed   IC50   Kd   Ki    

-Vascular Endothelial Growth Factor Receptor 2 (EC   PubMed   IC50   Kd   Ki  

Last updated on December 4th 2017 by Ms. Gabriela Bitencourt Ferreira.

Related Links  

     -A Database of Useful Decoys: Enhanced (DUDE)     
     -Enzyme Nomenclature Database (Expasy)     
     -Scikit-learn Machine Learning Techniques for Regression   
     -Protein Data Bank (PDB)     
     -UCI Machine Learning Repository   
     -Wolfram Demonstration Projects for Machine Learning   
     -Wolfram Demonstration Projects for Regression     
     -Wolfram Demonstration Projects for Stochastic Gradient Descent          

Evolutionary Algorithms Applied to the Study of Intermolecular Interactions  

The present research project aims to study protein-ligand interactions through the application of evolutionary algorithms and empirical scoring functions. It will be used structural information available at Protein Data Bank (PDB) and published binding affinity as well, in order to obtain training sets for empirical scoring functions to predict binding affinity. These scoring functions will be tuned using available binding-affinity data, sorted by enzymatic classes, which allows these functions to be specific for the molecular system to be simulated. In this way, we hope to give our humble contribution to the understanding of intermolecular interactions present in protein and ligands, a pivotal topic for computer-based drug design.    

Keywords: Protein, binding affinity, drug design, evolutionary algorithms, differential evolution

Funding: R$ 120.000,00 (one-hundred and twenty thousand reais)
Funding Agency: Conselho Nacional de Desenvolvimento Científico e Tecnológico - National Counsel of Technological and Scientific Development (
Period: From March/2015 to February/2019. 
Principal Investigator : Walter F. de Azevedo Jr., Ph.D   
Process Number: 308883/2014-4