Sunday, December 30, 2007

Method to get around the problems

As mentioned in the last post, there are two problems to be solved.
1. Store the distances separately for each matrix. Read the necessary files when needed.
2. Since the ordering matters, a matrix is randomly picked to cluster.

Two problems

I carried on with my own matrix comparison method, in which sliding window is used. Two problems have been identified:
1. I have only computed the distance between A and B where A is before B by name. The distance between B and A should also be computed.
2. In clustering, the ordering of the matrices' appearance is important. For instance, A, B and C are three matrices. Assume A is a single-member cluster. The distance between A and C is bigger than threshold (they are not in the same cluster). And that d(B, C) < d(A,B)< threshold. d(B,C) and d(A,B) are the smallest two distances between matrix B and others. If B appears before C, then B is clustered with A (d(A,B) satisfies the two conditions). If C appears before B, a new cluster is built with a single member C, and when B is examined, B will be clustered with C, since their distance is the smallest.

Friday, October 19, 2007

stuck a bit

After discussion with Prof. Tan, I found the clustering method is not a good method to predict protein structure, because clustering is based on matrices derived from 3-d structures, and the primary sequences are simply ignored. Unless after clustering analyisis, we could find substantial similarites among members' primary sequence, there is no conclusion of the correlation between the primary sequence and 3-d structrue. In other words, we cannot predict the struture given the clustering information.

So now, I am going to change my goal of predicting protein. The work flow should be as follows:
1.
Rerun Zeyar's program to get a correct output.
Meanwhile, analyze the clusters from the clusters directory.

2.Try split the matrices into 10x10 matrices and use one clustering, and see if can get a better clustering. CM--split-->sub matrices->clusters of sub matrices->bit vectors->clusters of bit vectors->clusters of CM->cluster analysis->the applicatioin mentioned in Zeyar's paper/thesis.

Friday, September 21, 2007

Subsequent work

1. Talk to Zeyar and find the problem with his code
2. Continue to develop the rest of the matrix comparison program; make it automatically load the files in the matrix directory, read the matrix and compare.
3. When clustering is done, analysize the clusters with methods that are similar to emerging patterns. Identify the substructures (domains) that are unique in the cluster under analysis. If not found, try to use the most often occured substructure in a domain. This method has to be compared with the BLAST-Homology methods and show that my method is superior in terms of accuracy.

Thursday, September 20, 2007

Xquery--An XML Query Language

Kosmix--A new searching tool

Kosmix exists quietly in the internet until recently I am doing a survey in database integration. Compared with google, it has the advantages of pre-organizing the searching results without compromising the speed. (At least in end-user level) Categories like trusted souces and advanced readings might be helpful. However, the limited results in each category may obscure many uses. Need to investigate more.

Tuesday, September 4, 2007