PDB FAQ
From PDBWiki
Here you find answers to frequently asked questions about the Protein Data Bank (PDB) and working with structural models.
[edit] Help
Please contribute to this collection by adding questions and answers which are of general interest to users of macromolecular models. If you would like to ask a question for which you do not know the answer to, please do so on the PDB discussion page.
To add to this page, click the 'edit' button above. For more information about how to edit pages in a wiki and MediaWiki markup in particular see:
[edit] Basic topics
The following basic topics have been described in separate PDBWiki articles, and are therefore excluded from the FAQ.
- What is the PDB?
- What is a PDB code?
- What is an EC number?
- What is a biological unit?
- Wikipedia article about protein structure and structure determination
[edit] Questions & Answers
Mostly taken from the pdb-l mailing list
[edit] Technical
[edit] Q: How can I download (all / any) structures from the PDB?
A: There are many ways to obtain structures from the PDB. Your choice of method will probably depend on your requirements.
- Mirroring
- To download all, or a substantial portion of the PDB archive, you probably want to access the PDB FTP services, here. Information about these services are provided by the RCSB here and here. In addition to FTP access, to stay up to date with PDB releases you can use rsync. Although a detailed discussion of rsync is beyond the scope of this FAQ, an example rsync script is provided here, which you may freely modify for your personal use.
- From a given PDB ID or list of IDs
- Details of FTP / script interfaces. See also Automated Downloads of PDB Data.
- Of a given protein
- To find the structures of a given protein, you probably want to search the PDB. To perform such a search, visit the PDB homepage, or even just construct a URL like this one; http://www.rcsb.org/pdb/search/navbarsearch.do?inputQuickSearch=atp+synthase
- Having provided this trivial solution, it should be mentioned that there are many methods that would be applicable to this particular question, given the many possible different circumstances where this question could arise.
[edit] Q: How can I download the PDB in relational database format?
One possibility is to download all desired entries in mmCIF format from the PDB FTP site and then use
to create relational database tables. Supported DBMSs include MySQL and Oracle.
Update: OpenMMS is no longer supported by the RCSB. You may however find some useful information in the OpenMMS mailing list archives.
[edit] Q: How can I parse a PDB entry and extract data?
Parsing a PDB entry can be very simple or quite tricky, depending on the data that you want to extract. There are several options available, depending on your language of choice.
- Perl
- For a robust solution, try the STAR (CIF) Parser to parse the PDB entry from the mmCIF format. For reference an example script is included here.
- Java
- http://sw-tools.pdb.org/ ...
- Other
- ...
[edit] General
[edit] Q: What about structure X, Y, or Z?
Answering that question is the whole idea of PDBWiki! For example, see some of the latest discussions.
[edit] Q: How can I visualise a protein structure?
There are many different software packages for 'molecular visualization', and even whole communities who work exclusively in this area! For more information see the 'Molvist mailing list', where you can ask your question to that community directly.
Some popular software packages are:
- Chimera - feature rich, scriptable, high-quality graphics and animations
- FirstGlance in Jmol (linked to structure papers in Nature, the ConSurf Server that colors proteins by evolutionary conservation, and other resources) is very easy to use, automatically displays context-sensitive help and color keys, has tooltips, and makes visualization of non-covalent interactions especially easy. It has a limited set of canned views and does not allow customization of the molecular scene. It does show all salt bridges or cation-pi orbital interactions in a few clicks.
- Proteopedia.Org shows a rotatable, zoomable view of every entry in the PDB (in Jmol). Its Scene Authoring Tools make it easy to put customized molecular scenes online quickly. It is a wiki, so you can create new pages and annotate existing ones with text or new molecular scenes.
- Jmol is available both as a stand-alone application, and as a java applet that works in web browsers. Extremely powerful for visualization, including surfaces, molecular orbitals, and cavities, translucency, independent movement of multiple models, biological units and crystal symmetry operations, animation of chemical reactions and conformational changes, arbitrary non-molecular arrows, planes, labels, captions, etc. Very active development and email discussion.
- DeepView - has an easy to use menu interface. (Comment: I love DeepView for its powerful modeling capabilities, but for ease of use, I'd rate it 2 on a scale of 10. Luckily Gale Rhodes has written some excellent tutorials. These and other DeepView help are linked to the DeepView section of the World Index of Molecular Visualization Resources. Eric Martz 04:12, 25 June 2008 (KST))
- PyMol - open source, scriptable, high-quality graphics and animations
- RasMol - fast, lightweight, robust, scriptable
- RasTop - Derived from rasmol. Has a friendly graphical user interface and the possibility to save working sessions for latter uses. Otherwise pretty much like rasmol and uses similar scripting methods.
Wikipedia also maintains a list of molecular graphic visualisation tools.
See also the freeware and commercial software sections at the World Index of Molecular Visualization Resources - molvisindex.org.
[edit] Q: How can I calculate the RMSD between two structures?
- Command line tool for fast rmsd (and GDT) calculation, which does also clustering: maxcluster.
- In PyMol you can get the Ca-rmsd with 'align obj1////CA, obj2////CA, cycles=0'. If the sequences are not identical the superposition will be based on an internally computed sequence-alignment.
- Nice and clear rmsd explanation including a python implementation here
- See also wikipedia's Kabsch algorithm and the original reference:
Kabsch, Wolfgang, (1976) "A solution of the best rotation to relate two sets of vectors", Acta Crystallographica 32:922
[edit] Q: Where can I find software for superimposing multiple protein structures?
A: See Wikipedia article on structure alignment software.
Answers from pdb-l:
- I would use the MSDfold service. It allows multiple alignment between uploaded sets of files or with the whole PDB archive, SCOP sets etc.
- I don't know which software is best, but CEMC does the job.
- I would recommend VMD (visual molecular dynamics) for this.
- I myself am quite happy with the results of MultiProt
A very nice command-line app to either get one to one or multiple alignments is ProFit (http://www.bioinf.org.uk/software/profit/index.html). It does a very good job even with similar protein in different conformations. It is distributed as Linux or WIndows binary and it can be compiled in Mac OS X.
Some general considerations:
- Is the problem to superimpose multiple conformations of the same chain (such as predicted structures or NMR models)? Or is the problem to align and superimpose different proteins? What are you doing the superposition for? How good does it have to be? Are some parts of the chain more important than others to superimpose well? Do you know which parts those are?
[edit] Q: How can I produce an all-against-all protein structure similarity matrix?
- I believe that you can download DALI lite and run it on your own machine with whatever PDB files you want (also has an all-against-all option).
- Another program is MUSTANG, which can be run on linux and Mac OS X.
- There is a meta-server for structural similarity comparison of proteins - Protein Comparison, Knowledge, Similarity and Information (ProCKSI)
- Maxcluster is quite convenient and does clustering as well
[edit] Q: What is the R-factor?
Sources; [3]
In overview, the R-factor is a measure of how well a particular model structure fits the observed electron density. Or simply, "a measure of agreement between the crystallographic model and the original X-ray diffraction data".
For a detailed description of the R-factor, see the relevant section of Gale Rhodes's excellent on-line tutorial; R-factor and Free R-factor
[edit] Q: What is the B-factor?
The B-factor (or temperature factor) is an indicator of thermal motion about an atom.
However, it should be pointed out that the B-factor is a mix of real thermal displacement, static disorder (multiple but defined conformations) and dynamic disorder (no defined conformation), and all the overlap between these definitions.
For a detailed description of the B-factor, see the relevant section of Gale Rhodes's excellent on-line tutorial; Temperature factor
[edit] Searching
[edit] Q: Where can I find a subset of PDB structures suitable for my analysis?
A: This question is surprisingly common on the list but is almost impossible to answer without more details about the kind of analysis that you want to perform. Some issues to consider before you ask this question are;
- Do I want lots of very similar structures for a detailed analysis? i.e. Many structures of mutants of the same protein.
- Do I (rather) need lots of diverse structures to make a general analysis? See #Q: How can I obtain a non-redundant set of PDB structures?
- Is the quality of the structures critical to my analysis, or could I compare structures of different quality?
- Do I need a set of protein oligomers and or complexes?
- Lots more issues to consider...
Actually there is simply no generic answer to this question. The datasets used will be as varied as the analysis performed. However, you can use the above as a guide.
[edit] Q: How can I obtain a non-redundant (or representative) set of PDB structures?
Sources: [6] [7] (Some more to add.)
A: There are several sources for non-redundant subsets of PDB chains, including:
For non-redundant sets of protein complexes, try:
For non-redundant sets of domains, use SCOP or CATH. Specifically you can obtain non-redundant sets of SCOP domain sequences via the ASTRAL database server.
[edit] Q: How do I find structures which have been solved by multiple experimental methods (e.g. X-ray and NMR)?
- See http://www.pdb.org/pdb/statistics/clusterExpMethods.do
- Or use PDBWiki!: simply browse through Experimental method
[edit] Q: How do I find structures containing a specific ligand?
Searching for all instances of a given chemical compound?
- A name helps (see below). For that you could try http://www.chem.qmul.ac.uk/iupac/
- You can start from the PDB home site, select "Search" then "Ligands". Now you can draw your molecule in the great "ChemAxon" applet (here), select "substructure" in the bottom and push the "search" button.
- Similarly http://ligand-depot.rutgers.edu/sketch.html - This is the ChemAxon tool again, but has been suggested to work better here than above.
- A similar system can be found here as http://www.ebi.ac.uk/msd-srv/msdmotif/chem/
- Or http://www.ebi.ac.uk/msd-srv/msdchem
- I am pretty sure relibase can do this via a 2D/3D search, here.
Searching for a specific ligand conformation?
Sources;
[11]
More specifically, the rotational state of an amide (cis or trans, relative to a known atom)...
ValLigURL almost (but not quite) does what you want. Upload your ligand and it will retrieve all the copies of the same ligand from the pdb, superimpose them on yours, and report the results. By reducing the atoms in the search ligand to a set including just a single dihedral, and unchecking "Ignore ligands with different number of atoms," you can sort by RMS to find just what you are looking for.
Searching for all instances of a named compound?
This question gets into the whole issue of chemical codes in the PDB (the chemical component dictionary). This is a big topic. For example, see the remediated chemical component dictionary and the remediation discussion archive.
- Lots of links about 'ligand' resources to put here...
Searching for all ligands containing a specific group?
Sources;
[12]
Try MSDChem.
[edit] Q: How do I find structures for a certain source organism?
Sources; [13]
Several possibilities:
- Use PDBWiki!: browse through Source organism
- RCSB PDB's advanced search: use 'Source Organism' under 'Biology and Chemistry' subsection. In fact one can narrow down a search by many other criteria: ligands, EC, GO classification ...
- OCA database browser: use the 'Organism' field
- Jena library's search by Genus/Species classification
[edit] Q: How do I find structures with a specific function
Other than searching the PDB website directly, you can try to use the GO2PDB mapping provided by the Jena library. It allows searching by Gene Ontology terms.
[edit] Q: How do I find structures with unknown function
See Structures with unknown function
[edit] Analysis/Modelling
[edit] Q: How do I build a homology model?
Good question! In general it depends how good a model you need to make, and how much effort you are prepared to put into it. For many purposes, all you need is good sequence alignment.
Some suggestions are;
- Swiss-Model
- offers a range of levels of manual intervention.
- DeepView/SPDV
- gives you a lot of power to manually refine the model. The downside is it tends to crash a lot. On the upside it is been recently updated and offers a direct
- MolIDE
- offers a simple GUI-based modeling approach using PSIBLAST, LOOPY and SCWRL.
- MODELLER
- if you want total control. No GUI but probably one of the most stable and powerful methods available for free.
- What If
- allows to do homology modelling via a web interface.
[edit] Q: Where can I find a database of 3D models?
Sources; [16]
Try:
[edit] Q: What tool do you recommend for tertiary structure prediction?
Some servers with the best overall performance in the Casp7 assessment are:
[edit] Q: How do I find (potential) 3D functional sites in a protein structure?
A: See this Wikiomics article on Searching for 3D functional sites in a protein structure.
Some web servers to predict functional sites based on 3D structure Sources: [17] [18] :
- SPPIDER - Solvent accessibility based Protein-Protein Interface iDEntification and Recognition
- ELM - Functional site prediction
- ConSurf - Server for the Identification of Functional Regions in Proteins
- CASTp - Computed Atlas of Surface Topography of proteins
- PINUP - Protein binding site prediction with an empirical scoring function
- Protein-Protein Interaction Server
- InterPreTS - Prediction of the potential interaction of two proteins from their sequences
- PPI-Pred - Protein-Protein Interface (Binding Site) Prediction
- ProMate - Predicting the location of potential protein-protein binding sites for unbound proteins
- PRISM - protein interactions by structural matching
- PI2PE - cons-PPISP for finding protein-protein interfaces, or DISPLAR for protein-dna interfaces
And it is always useful to visit home site of CAPRI - Critical Assessment of PRediction of Interactions to learn more about groups and servers in this field.
[edit] Q: How do I define or select interacting residues?
Sources: [19] [20] [21] [22] [23] [24]
The most common way to answer this question is to design, create and publish your own database of interacting residues, and then query that database for the data you need. Alternatively, you can use one of the existing databases.
Some of those include (in no particular order); the Protein Interaction Calculator | Molsurfer | SCOPPI | CSU | CMA | InterPare | PIBASE | 3DID | ...
There is a page on MetaBase listing a similar set of resources under the entry for PSIMAP
Additionally, there are several molecular visualization tools that will allow you to measure explicit distances between atoms and to define selections of atoms within a specified distance of other selections. These tools allowing you to probe interacting residues interactively. FirstGlance in Jmol makes it very easy to visualize all the interactions with any selected moiety (in its Contacts dialog) and to measure distances. See Q: How can I visualise a protein structure?
There is also a standalone program called "Ligplot" (http://www.biochem.ucl.ac.uk/bsm/ligplot/ligplot.html). It's main aim is to generate 2D plots of a ligand and the protein residues that interact with it. It can highlight H bonds and hydrophobic contacts. If used with NACCESS (see Q: How do I find the solvent accessible surface area for a certain structure?) , it also displays accesible surface area. As a side product, it generates small PDB formated file containing just the ligand and the interacting residues.
[edit] Q: How do I get the electrostatic surface potential for a certain structure?
Sources; [25]
- See APBS tutorial
- Use PyMol: Pymol can generate the potential map for you while displaying the potential spectrum of your protein. Steps:
- Load your PDB file 'your.pdb' into PyMol
- From the main menu choose Plugin -> APBS tools
- Click 'Use another PQR' and click 'Choose Externally Generated PQR:' to load the pqr you just generated by pdb2pqr program (your.pqr). You can also go to the 'Configuration' tab to set your environment such as temp, dielectrics, etc.
- Click on 'Set grid'
- Click on 'Run APBS'
- Wait for a few seconds/minutes (depending on your molecule size)
- Then you can go to 'Visualization' panel to see the potential color in the PyMol Viewer once you click 'Update'
- There are several buttons for you to explore, probably you want to show 'Molecular Surface' of your protein so you click 'Show'
[edit] Q: How do I find the solvent accessible surface area for a certain structure?
A: There are a few options:
[edit] Q: How do I find/predict/display secondary structure?
[edit] Defining Secondary Structure Elements in structures
The PDB files in the RCSB contain information about the secondary structure of protein entries. This information is called the 'authors definition'. You should be careful if you use the secondary structure assignments provided by the authors in the PDB file. You might expect that they are of high quality because they were curated by people who know the proteins quite well. But I there are frequently overlaps between different secondary structure elements. Also, not all PDB entries contain secondary structure assignments.
You can assign secondary structures using a several different automatic methods:
- DSSP
- defines secondary structure from hydrogen bonding. It only has one adjustable parameter (the H-bond energy cutoff).
- STRIDE
- similar to DSSP, but also uses geometric criteria for the backbone, which allows better definition on low-quality structures where H-bonds may not be clear. STRIDE does not include the "S" (bend) category of DSSP.
- DEFINE
- which is not used these days worked from completely different principles.
A comparison of DSSP, STRIDE and DEFINE can be found in PMID 10081963. DSSP and STRIDE disagree on the boundaries of helices. A number of different methods are also compared in PMID 8332595. The original DSSP paper by Kabsch and Sander gives a very clear explanation of how the method works PMID 6667333.
The advantage of using DSSP or STRIDE over the structure-author definitions is they follow a consistent model. Of course, many authors base their secondary structure definitions on those from DSSP.
[edit] How long is a helix/sheet?
Source; [30]
Where may I find a statistical analysis of the length of alpha-helices and beta-strands in a representative set of globular proteins?
Please see: Sandeep Kumar and Manju Bansal. 1998. Geometrical and sequence characteristics of alpha helices in globular proteins. Biophysical Journal, 75, 4, 1935 - 1944. PMID 9746534
There are some statistics on this for DSSP, DEFINE and STRIDE secondary structure definitions for the Rost and Sander 126 protein set in:
Cuff, J. A. and Barton, G. J. (1999), "Evaluation and improvement of multiple sequence methods for protein secondary structure prediction",PROTEINS, 34, 508-519. PMID 10081963
The paper includes a Figure that shows the distribution of lengths.
[edit] Predicting secondary structure
Source; [31]
There are several nearly equivalent secondary-structure prediction techniques, all based on neural nets: PSIPRED (PMID 10493868) and the SAM-Txx (PMID 16187355) methods are among the best, but they are mostly very close in accuracy. See also YASSPP (PMID 16763996). PSIPRED can be downloaded here.
[edit] Displaying secondary structure on a sequence
If the sequence is from a known structure, you can use PDBsum. To display a multiple sequence alignment with given secondary structure elements, there are several choices:
- ALSCRIPT
- makelogo
- LOOK (developed by Chris Lee and S. Subbiah)
- Genemine
- Multalign Viewer
[edit] Meta
[edit] Q: Disclaimer?
If you find errors or omissions in this FAQ, please feel free to fix them.
[edit] Q: Who maintains this FAQ?
This page is part of PDBWiki. As such it is maintained as a collaborative effort of wiki users. Many of the questions and answers here are based on postings to the pdb-l mailing list at SDSC.
[edit] Q: How do I...?
See Help:Contents or post your question over at the PDB discussion page.
