Sunday, December 30, 2007

Method to get around the problems

As mentioned in the last post, there are two problems to be solved.
1. Store the distances separately for each matrix. Read the necessary files when needed.
2. Since the ordering matters, a matrix is randomly picked to cluster.

Two problems

I carried on with my own matrix comparison method, in which sliding window is used. Two problems have been identified:
1. I have only computed the distance between A and B where A is before B by name. The distance between B and A should also be computed.
2. In clustering, the ordering of the matrices' appearance is important. For instance, A, B and C are three matrices. Assume A is a single-member cluster. The distance between A and C is bigger than threshold (they are not in the same cluster). And that d(B, C) < d(A,B)< threshold. d(B,C) and d(A,B) are the smallest two distances between matrix B and others. If B appears before C, then B is clustered with A (d(A,B) satisfies the two conditions). If C appears before B, a new cluster is built with a single member C, and when B is examined, B will be clustered with C, since their distance is the smallest.