Computer-based molecular modeling in drug discovery
Posted: Sun May 25, 2008 8:26 am
Computer modeling in drug discovery
I have been asked to comment on a treatment protocol being put forward based on computer modeling of small molecules in the ligand binding sites of receptors. I am not familiar with the protocol but I can comment on some of the many caveats that must be considered when interpreting computer-based modeling of drug-protein interactions. The bottom line is that the modeling by itself is only suggestive of possibilities, not proof. It requires a lot more work with experimental testing using in vitro assays, cell-based assays, structural biology work, and then it can proceed into animal models, and eventually into clinical trials.
First, the computer modeling begins with either 3D structures determined by X-ray of crystallographic results, NMR of those results, or a homology model developed based on a similar known structure. All of these approaches have their drawbacks in providing a true representation of the protein. The 3D coordinates are based on a pretty static view of the protein. The protein needs to be crystallized into a highly repetitive stacking of the homogeneous protein molecule so that, when the X-rays are directed at the crystal, a consistent pattern of diffractions can be seen as the X-rays hit the individual atoms and have their path altered. Doing this with millions of the molecules in a crystal structure then allows for the average position of each atom of the basic molecule to be mathematically determined. These are just average positions and can be off by the resolution distance given for a crystal structure, such as 2 angstroms, which is about the size of a typical atom-atom bond distance (like a carbon-carbon bond). So a structure with a 5 angstrom resolution is very crude. A structure is considered to be a good depiction if it is 2 angstroms or less. There are more than 45,000 protein structures posted on the Protein DataBank (www.rcsb.org) where most researchers go for proteins to model.
Now, in generating a crystal structure, the crystallographer (or structural biologist) has to work out the conditions to get the proteins to stack. If the proteins get too tightly packed, they precipitate out and are just a clump of junk. If the proteins are too loose, they are still in random positions relative to each other so the average positioning of atoms can not be determined. Even under the best circumstances, some of the amino acids at either end of the protein may remain undefined because they are still free to move about, giving random data as to their position relative to the rest of the protein. You'll notice this in many of the published structures, that often residues 1-4 or so are missing. To simplify the work, sometimes the crystallographer will work with only the supposed critical portion of the protein, such as the enzyme active site. This can usually get a more definitive picture of that portion of the protein but it can miss allosteric sites, those sites away from the protein's ligand binding site that, nevertheless, still have a significant effect on the protein activity.
As the crystallographer works out the conditions, the best crystals for X-ray work are often in very harsh conditions, far from the in vivo conditions. So the crystal may only form near 0oC (32oF) and with a high salt content in the buffer. Crystallographers often put heavy metal ions in too since these give strong diffraction patterns that help the crystallographer initially in seeing the overall repetitive pattern of the overall molecule, i.e., is the molecule always in the same orientation, or do two of them appear as a set in 180o mirror, or three at 120o or some even more complex repetitive pattern. By the time a crystal is obtained, the proteins may be so dense (but still soluble) that they distort each others atom positions due to the pressures of their stacking. So the static picture that results may have a distorted ligand binding site and be missing some sites on the protein that are important to overall activity. It may not have required cofactors in the final structure also. And the placement of key water molecules may not be there, where a specific water molecule in the structure can provide a electrostatic bridge that helps hold a ligand in the active site.
In the cell, the proteins are very flexible and have a lot of motion, motion of the side-chains off each amino acid, and motion of large domains of the protein. Many proteins are more active when they bind in homodimeric (binding with another of the same protein) or heterodimeric groups (binding with a different protein). Also, some proteins may be embedded in a lipid layer (detergent-like or oily-like) and then move into the more aqueous environment of the cell as they translocate to the nucleus to convey a signal. It is difficult to crystallize the protein in a heterogeneous environment like this but those different environments can be very important in the proteins' normal functioning. The motion of the protein needs to be taken into account since it can be very dramatic. Kinases are a popular target in cancer-related drug discovery but kinases are usually very flexible. The active sites often can be tight or open very wide, giving the possibility that drugs of a range of sizes could be effective inhibitors. This flexibility of protein structures is just now being addressed in virtual screening but is very important.
The software used in molecular modeling and virtual screening (docking thousands of small molecules into a protein's active site one at a time and calculating the potential binding energies to see if each molecule could be a good inhibitor) is very new, perhaps only in the past 10 years has it been effective and it is getting better. However, many virtual screening labs will use software from more than one vendor so that a concensus is obtained about particular molecules as drug candidates. Using several different software packages to do the same screening means several different algorithms are used in determining whether a particular molecule is a good potential inhibitor. The weaknesses of one algorithm may be compensated by the other algorithms so that the truly better molecules will show the best results virtually. These are only potential candidates. To go so far as to predict binding affinities and other kinetic data, (IC50 values) is well beyond the capabilities of the software now to think that they are accurate values. Virtual results always, always, always need experimental confirmation, always. We usually do not want to show the molecule structure to our chemists until we can confirm the molecule's effect experimentally. The chemists don't want to go on just computer-based (in silico) results.
Once a molecule confirms in the experimental assays, we will try it in cell-based assays to see if it gets into the cell, a major issue. And we will try to determine experimentally if it is hitting the target we expect it to hit. And we will take the molecule and try to co-crystallize it with the protein to see if it is docking in the active site as our original screening predicted.
So, what I am saying is that there is a lot of experimental work that must be done before validity can be given to the virtual work. When we do virtual screening a protein target using a library of 50,000 molecules for example, we do not claim that our virtual screening results give a true drug candidate in the top 5,000 molecules based on virtual scores. We think of it as enriching the top 10-20% to improve our chances of finding
a good molecule quickly without having to experimentally screening all 50,000 molecules. The virtual screening can at least eliminate those molecules in the 50,000 that are just completely wrong for the protein, such as being too big for the active site or the wrong electrostatic charges. This can save us a lot of time and money on experimental work, but doesn't guarantee that any of our molecules are good, and we may need to go to another collection of small molecules to test.
All that being said, I was able to use virtual screening to find a drug candidate targeting S-adenosylmethionine decarboxylase back in 2004. It is still being developed towards a preclinical drug, i.e. the chemists are trying various modifications to make it even more potent. In that project I took the 1,990 molecules of the NCI Diversity set, virtually screened them and then had our collaborators test 133 of the top 300. I believe our hit was number 76 out of that. So sometimes virtual screening can be a good start, but it always takes experimental work to confirm it.
Once you have a molecule of interest, how do you know it won't hit other proteins and cause side-effects? That too takes experimental cell-based work but there are attempts to use virtual screening to research this. We are developing a system that takes the protein structures from the Protein Databank and it will dock your molecule of interest into hundreds of different proteins, and report on those proteins for which there appears to be a strong interaction. Not many labs are doing this yet because of the amount of work involved in preparing the virtual structures of the proteins. We have over 800 proteins in our system now and we are trying to incorporate different conformations of them to show their flexibility. We want to get an NIH grant to develop this further and make it available to other researchers. In the meantime, I am going to meet with our internal computer guys to see if they will take over the project software development and let us focus on preparing the proteins for it. We have a unique approach to determining the important interactions. We should have a paper coming out on it soon. We presented the system to a software user conference a year ago and they were very interested in it and they liked our scoring approach better than other approaches that could be used. If the system works right, it can point to potential adverse effects. But it can also help with retargeting of known approved drugs, for example, an arthritis drug may hit some proteins that could be of interest in cancer research. We have validated our system with published data (i.e. trichostatinA, a known inhibitor of deacetylases finds the deacetylases in our protein collection; staurosporin, a known inhibitor of kinases, finds many of the kinases in our protein collection).
So, in conclusion, virtual modeling is helpful but needs to be tied to experimental confirmation. As for the protocol in question, I did not see details on the virtual modeling software used or on experimental confirmation done. I think the author was suggesting that his virtual work should be tested experimentally. The virtual work needs validation before one can develop therapies around it.
Wesley
I have been asked to comment on a treatment protocol being put forward based on computer modeling of small molecules in the ligand binding sites of receptors. I am not familiar with the protocol but I can comment on some of the many caveats that must be considered when interpreting computer-based modeling of drug-protein interactions. The bottom line is that the modeling by itself is only suggestive of possibilities, not proof. It requires a lot more work with experimental testing using in vitro assays, cell-based assays, structural biology work, and then it can proceed into animal models, and eventually into clinical trials.
First, the computer modeling begins with either 3D structures determined by X-ray of crystallographic results, NMR of those results, or a homology model developed based on a similar known structure. All of these approaches have their drawbacks in providing a true representation of the protein. The 3D coordinates are based on a pretty static view of the protein. The protein needs to be crystallized into a highly repetitive stacking of the homogeneous protein molecule so that, when the X-rays are directed at the crystal, a consistent pattern of diffractions can be seen as the X-rays hit the individual atoms and have their path altered. Doing this with millions of the molecules in a crystal structure then allows for the average position of each atom of the basic molecule to be mathematically determined. These are just average positions and can be off by the resolution distance given for a crystal structure, such as 2 angstroms, which is about the size of a typical atom-atom bond distance (like a carbon-carbon bond). So a structure with a 5 angstrom resolution is very crude. A structure is considered to be a good depiction if it is 2 angstroms or less. There are more than 45,000 protein structures posted on the Protein DataBank (www.rcsb.org) where most researchers go for proteins to model.
Now, in generating a crystal structure, the crystallographer (or structural biologist) has to work out the conditions to get the proteins to stack. If the proteins get too tightly packed, they precipitate out and are just a clump of junk. If the proteins are too loose, they are still in random positions relative to each other so the average positioning of atoms can not be determined. Even under the best circumstances, some of the amino acids at either end of the protein may remain undefined because they are still free to move about, giving random data as to their position relative to the rest of the protein. You'll notice this in many of the published structures, that often residues 1-4 or so are missing. To simplify the work, sometimes the crystallographer will work with only the supposed critical portion of the protein, such as the enzyme active site. This can usually get a more definitive picture of that portion of the protein but it can miss allosteric sites, those sites away from the protein's ligand binding site that, nevertheless, still have a significant effect on the protein activity.
As the crystallographer works out the conditions, the best crystals for X-ray work are often in very harsh conditions, far from the in vivo conditions. So the crystal may only form near 0oC (32oF) and with a high salt content in the buffer. Crystallographers often put heavy metal ions in too since these give strong diffraction patterns that help the crystallographer initially in seeing the overall repetitive pattern of the overall molecule, i.e., is the molecule always in the same orientation, or do two of them appear as a set in 180o mirror, or three at 120o or some even more complex repetitive pattern. By the time a crystal is obtained, the proteins may be so dense (but still soluble) that they distort each others atom positions due to the pressures of their stacking. So the static picture that results may have a distorted ligand binding site and be missing some sites on the protein that are important to overall activity. It may not have required cofactors in the final structure also. And the placement of key water molecules may not be there, where a specific water molecule in the structure can provide a electrostatic bridge that helps hold a ligand in the active site.
In the cell, the proteins are very flexible and have a lot of motion, motion of the side-chains off each amino acid, and motion of large domains of the protein. Many proteins are more active when they bind in homodimeric (binding with another of the same protein) or heterodimeric groups (binding with a different protein). Also, some proteins may be embedded in a lipid layer (detergent-like or oily-like) and then move into the more aqueous environment of the cell as they translocate to the nucleus to convey a signal. It is difficult to crystallize the protein in a heterogeneous environment like this but those different environments can be very important in the proteins' normal functioning. The motion of the protein needs to be taken into account since it can be very dramatic. Kinases are a popular target in cancer-related drug discovery but kinases are usually very flexible. The active sites often can be tight or open very wide, giving the possibility that drugs of a range of sizes could be effective inhibitors. This flexibility of protein structures is just now being addressed in virtual screening but is very important.
The software used in molecular modeling and virtual screening (docking thousands of small molecules into a protein's active site one at a time and calculating the potential binding energies to see if each molecule could be a good inhibitor) is very new, perhaps only in the past 10 years has it been effective and it is getting better. However, many virtual screening labs will use software from more than one vendor so that a concensus is obtained about particular molecules as drug candidates. Using several different software packages to do the same screening means several different algorithms are used in determining whether a particular molecule is a good potential inhibitor. The weaknesses of one algorithm may be compensated by the other algorithms so that the truly better molecules will show the best results virtually. These are only potential candidates. To go so far as to predict binding affinities and other kinetic data, (IC50 values) is well beyond the capabilities of the software now to think that they are accurate values. Virtual results always, always, always need experimental confirmation, always. We usually do not want to show the molecule structure to our chemists until we can confirm the molecule's effect experimentally. The chemists don't want to go on just computer-based (in silico) results.
Once a molecule confirms in the experimental assays, we will try it in cell-based assays to see if it gets into the cell, a major issue. And we will try to determine experimentally if it is hitting the target we expect it to hit. And we will take the molecule and try to co-crystallize it with the protein to see if it is docking in the active site as our original screening predicted.
So, what I am saying is that there is a lot of experimental work that must be done before validity can be given to the virtual work. When we do virtual screening a protein target using a library of 50,000 molecules for example, we do not claim that our virtual screening results give a true drug candidate in the top 5,000 molecules based on virtual scores. We think of it as enriching the top 10-20% to improve our chances of finding
a good molecule quickly without having to experimentally screening all 50,000 molecules. The virtual screening can at least eliminate those molecules in the 50,000 that are just completely wrong for the protein, such as being too big for the active site or the wrong electrostatic charges. This can save us a lot of time and money on experimental work, but doesn't guarantee that any of our molecules are good, and we may need to go to another collection of small molecules to test.
All that being said, I was able to use virtual screening to find a drug candidate targeting S-adenosylmethionine decarboxylase back in 2004. It is still being developed towards a preclinical drug, i.e. the chemists are trying various modifications to make it even more potent. In that project I took the 1,990 molecules of the NCI Diversity set, virtually screened them and then had our collaborators test 133 of the top 300. I believe our hit was number 76 out of that. So sometimes virtual screening can be a good start, but it always takes experimental work to confirm it.
Once you have a molecule of interest, how do you know it won't hit other proteins and cause side-effects? That too takes experimental cell-based work but there are attempts to use virtual screening to research this. We are developing a system that takes the protein structures from the Protein Databank and it will dock your molecule of interest into hundreds of different proteins, and report on those proteins for which there appears to be a strong interaction. Not many labs are doing this yet because of the amount of work involved in preparing the virtual structures of the proteins. We have over 800 proteins in our system now and we are trying to incorporate different conformations of them to show their flexibility. We want to get an NIH grant to develop this further and make it available to other researchers. In the meantime, I am going to meet with our internal computer guys to see if they will take over the project software development and let us focus on preparing the proteins for it. We have a unique approach to determining the important interactions. We should have a paper coming out on it soon. We presented the system to a software user conference a year ago and they were very interested in it and they liked our scoring approach better than other approaches that could be used. If the system works right, it can point to potential adverse effects. But it can also help with retargeting of known approved drugs, for example, an arthritis drug may hit some proteins that could be of interest in cancer research. We have validated our system with published data (i.e. trichostatinA, a known inhibitor of deacetylases finds the deacetylases in our protein collection; staurosporin, a known inhibitor of kinases, finds many of the kinases in our protein collection).
So, in conclusion, virtual modeling is helpful but needs to be tied to experimental confirmation. As for the protocol in question, I did not see details on the virtual modeling software used or on experimental confirmation done. I think the author was suggesting that his virtual work should be tested experimentally. The virtual work needs validation before one can develop therapies around it.
Wesley