Xiaohui Li: 2007

Sunday, December 30, 2007

Method to get around the problems

As mentioned in the last post, there are two problems to be solved.
1. Store the distances separately for each matrix. Read the necessary files when needed.
2. Since the ordering matters, a matrix is randomly picked to cluster.

Two problems

I carried on with my own matrix comparison method, in which sliding window is used. Two problems have been identified:
1. I have only computed the distance between A and B where A is before B by name. The distance between B and A should also be computed.
2. In clustering, the ordering of the matrices' appearance is important. For instance, A, B and C are three matrices. Assume A is a single-member cluster. The distance between A and C is bigger than threshold (they are not in the same cluster). And that d(B, C) < d(A,B)< threshold. d(B,C) and d(A,B) are the smallest two distances between matrix B and others. If B appears before C, then B is clustered with A (d(A,B) satisfies the two conditions). If C appears before B, a new cluster is built with a single member C, and when B is examined, B will be clustered with C, since their distance is the smallest.

Friday, October 19, 2007

stuck a bit

After discussion with Prof. Tan, I found the clustering method is not a good method to predict protein structure, because clustering is based on matrices derived from 3-d structures, and the primary sequences are simply ignored. Unless after clustering analyisis, we could find substantial similarites among members' primary sequence, there is no conclusion of the correlation between the primary sequence and 3-d structrue. In other words, we cannot predict the struture given the clustering information.

So now, I am going to change my goal of predicting protein. The work flow should be as follows:
1.
Rerun Zeyar's program to get a correct output.
Meanwhile, analyze the clusters from the clusters directory.

2.Try split the matrices into 10x10 matrices and use one clustering, and see if can get a better clustering. CM--split-->sub matrices->clusters of sub matrices->bit vectors->clusters of bit vectors->clusters of CM->cluster analysis->the applicatioin mentioned in Zeyar's paper/thesis.

Friday, September 21, 2007

Subsequent work

1. Talk to Zeyar and find the problem with his code
2. Continue to develop the rest of the matrix comparison program; make it automatically load the files in the matrix directory, read the matrix and compare.
3. When clustering is done, analysize the clusters with methods that are similar to emerging patterns. Identify the substructures (domains) that are unique in the cluster under analysis. If not found, try to use the most often occured substructure in a domain. This method has to be compared with the BLAST-Homology methods and show that my method is superior in terms of accuracy.

Thursday, September 20, 2007

Xquery--An XML Query Language

Kosmix--A new searching tool

Kosmix exists quietly in the internet until recently I am doing a survey in database integration. Compared with google, it has the advantages of pre-organizing the searching results without compromising the speed. (At least in end-user level) Categories like trusted souces and advanced readings might be helpful. However, the limited results in each category may obscure many uses. Need to investigate more.

Tuesday, September 4, 2007

check for matalign, singular value decomposition and contact map overlap

They may be useful in comparing two matrices

Thursday, August 30, 2007

Topology of a protein structure

The topology of a protein structure is a highly simplified description of tis fold including only the sequence of a secondary structure elements, and their relative spatial positions and approximate orientations.

Structural comparison methods are sometimes applicable for interface comparisons because structural comparison involves many more residues, which is deemed to be a harder problem than interfaces.

Tuesday, August 28, 2007

Prediction of protein–protein interactions by combining structure and sequence conservation in protein interfaces

Authors: A. Selim Aytuna, Attila Gursoy∗ and Ozlem Keskin∗ 2005
Basic idea: If A interacts with B and a's interface resembles A's and b's resembles B's, then predict a interacts with b.
Input: "template dataset" with known interacting interfaces; "target dataset" with interfaces we want to predict.

"The template dataset handles structure
and sequence conservation by combining two previously generated datasets:
the structurally non-redundant dataset of protein–protein interfaces extracted
from the PDB and the set of conserved residues on these interfaces (computational
hotspots). The target dataset is a sequentially non-redundant set of
all protein complexes and chains in the PDB."

"Proteins associate through binding sites. These sites are believed to
contribute to the biomolecular recognition and binding of proteins
by providing specific chemical and physical properties necessary
for these processes. "

Friday, August 24, 2007

Structure-based querying of proteins using wavelets

We elect to normalize the input signal, fixing the size of the idstance matrix to 128*128. This normalization occurs through interpolation or extrapolation, depending on whether the input protein was shorter or longer than 128 residues.

Interpolation is to smooth or average the excess points, while extrapolation is to generate additional data for proteins of smaller lengths.

128 was chosen because the proeins in the dataset are mostly shoter than 256, and a number that is power of 2 suits their use.

I have thought about enlarging smaller matrices by extrapolating the values that are present. Although continuous Markov Chain may promise some hope, the values generated are still faked ones. By doing this, we may artificially introduce noisy which perhaps distorts the results. So I think currently we stick to the superimposition method.

Friday, August 17, 2007

Another method to compare two matrices proposed by Marcel and Good operator definition

Three advices are valuable:

1. Start from small. I should take a look at short chains of similar interfaces (how to define similar interfaces) and get the right intuition.

2. Good comparison operator should, to some extend, output small distance value for two matrices generated from biologically known similar interfaces, large distance values for matrices generated from biologically known different interfaces.

3. The compatibility problem can be done in one of the following two ways:
i. Suppose A is a big matrix and B a small one. Locate in a sub-matrix C (compatible with B) in A such that dis(B, C) is smaller than any other sub-matrix from A.
ii.Select entries from A in the order that they appear in A to form a sub-matrix D, such that dis(B, D) is smaller than any other D-like sub-matrix from A.

After the results are got, they can be compared with those got from my enlarging small matrices method, and report the better one or both.

In this weekend, it is better to know the sizes of those 11,000+ matrices. This info will help in deciding if enlarging small matrices method is good.

Wednesday, August 15, 2007

Reject MatAlign

Reasons:
1. Comparing M1(p*q) and M2 (r*s) will generate p*r score matrices, each SM one Needleman-Wunsch algo, and p*r NW algo applied on these SM's. Totally 2*p*r NW algo, which is too computationally costly.

2. Distance matrices of proteins are symmetric; dealing with asymmetric matrices have to care about the transposed matrix.

Justification for adding dummies and time efficiency consideration

From the definition of the distance of two matrices ( sqrt( sum( (A[i, j]-B[i, j])^2))), we can see dummy values are eliminated when we compute the difference of corresponding terms in the two matrices. If in a particular position, one matrix has a dummy value, the other has real value, dummy value minus real value is a good measure of their difference.

Clustering takes O(N^2), which is the weakness against scalability. After data cleanup, there are 11,558 interfaces remain. Currently, clustering still takes reasonable time to complete.

Tuesday, August 14, 2007

Discussion with Prof. Tan

First disussion after I receive all data and code from Zeyar.

Ideas of comparing two matrices of different sizes (10<=n<=200):
1. Enlarge the smallers to 200*2oo matrices by adding dummy values (20A in this case).
2. Enlarge the smallers to 200*200 by repeating the values that appear in the matrices.
3. Enlarge the smallers to 200*200 by applying stochastic modelling (say Markov chain).

first post

My friend introduces this blog tool to me. This is the first post for my blog.
This blog will be absolutely private and record the HYP progress.