Xiaohui Li: August 2007

Thursday, August 30, 2007

Topology of a protein structure

The topology of a protein structure is a highly simplified description of tis fold including only the sequence of a secondary structure elements, and their relative spatial positions and approximate orientations.

Structural comparison methods are sometimes applicable for interface comparisons because structural comparison involves many more residues, which is deemed to be a harder problem than interfaces.

Tuesday, August 28, 2007

Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces

Authors: A. Selim Aytuna, Attila Gursoy∗ and Ozlem Keskin∗ 2005
Basic idea: If A interacts with B and a's interface resembles A's and b's resembles B's, then predict a interacts with b.
Input: "template dataset" with known interacting interfaces; "target dataset" with interfaces we want to predict.

"The template dataset handles structure
and sequence conservation by combining two previously generated datasets:
the structurally non-redundant dataset of protein–protein interfaces extracted
from the PDB and the set of conserved residues on these interfaces (computational
hotspots). The target dataset is a sequentially non-redundant set of
all protein complexes and chains in the PDB."

"Proteins associate through binding sites. These sites are believed to
contribute to the biomolecular recognition and binding of proteins
by providing specific chemical and physical properties necessary
for these processes. "

Friday, August 24, 2007

Structure-based querying of proteins using wavelets

We elect to normalize the input signal, fixing the size of the idstance matrix to 128*128. This normalization occurs through interpolation or extrapolation, depending on whether the input protein was shorter or longer than 128 residues.

Interpolation is to smooth or average the excess points, while extrapolation is to generate additional data for proteins of smaller lengths.

128 was chosen because the proeins in the dataset are mostly shoter than 256, and a number that is power of 2 suits their use.

I have thought about enlarging smaller matrices by extrapolating the values that are present. Although continuous Markov Chain may promise some hope, the values generated are still faked ones. By doing this, we may artificially introduce noisy which perhaps distorts the results. So I think currently we stick to the superimposition method.

Friday, August 17, 2007

Another method to compare two matrices proposed by Marcel and Good operator definition

Three advices are valuable:

1. Start from small. I should take a look at short chains of similar interfaces (how to define similar interfaces) and get the right intuition.

2. Good comparison operator should, to some extend, output small distance value for two matrices generated from biologically known similar interfaces, large distance values for matrices generated from biologically known different interfaces.

3. The compatibility problem can be done in one of the following two ways:
i. Suppose A is a big matrix and B a small one. Locate in a sub-matrix C (compatible with B) in A such that dis(B, C) is smaller than any other sub-matrix from A.
ii.Select entries from A in the order that they appear in A to form a sub-matrix D, such that dis(B, D) is smaller than any other D-like sub-matrix from A.

After the results are got, they can be compared with those got from my enlarging small matrices method, and report the better one or both.

In this weekend, it is better to know the sizes of those 11,000+ matrices. This info will help in deciding if enlarging small matrices method is good.

Wednesday, August 15, 2007

Reject MatAlign

Reasons:
1. Comparing M1(p*q) and M2 (r*s) will generate p*r score matrices, each SM one Needleman-Wunsch algo, and p*r NW algo applied on these SM's. Totally 2*p*r NW algo, which is too computationally costly.

2. Distance matrices of proteins are symmetric; dealing with asymmetric matrices have to care about the transposed matrix.

Justification for adding dummies and time efficiency consideration

From the definition of the distance of two matrices ( sqrt( sum( (A[i, j]-B[i, j])^2))), we can see dummy values are eliminated when we compute the difference of corresponding terms in the two matrices. If in a particular position, one matrix has a dummy value, the other has real value, dummy value minus real value is a good measure of their difference.

Clustering takes O(N^2), which is the weakness against scalability. After data cleanup, there are 11,558 interfaces remain. Currently, clustering still takes reasonable time to complete.

Tuesday, August 14, 2007

Discussion with Prof. Tan

First disussion after I receive all data and code from Zeyar.

Ideas of comparing two matrices of different sizes (10<=n<=200):
1. Enlarge the smallers to 200*2oo matrices by adding dummy values (20A in this case).
2. Enlarge the smallers to 200*200 by repeating the values that appear in the matrices.
3. Enlarge the smallers to 200*200 by applying stochastic modelling (say Markov chain).

first post

My friend introduces this blog tool to me. This is the first post for my blog.
This blog will be absolutely private and record the HYP progress.