Monday, March 24, 2008

Datamining routines

Routines are:
  1. Fill in blank cells (process missing data)
  2. Feature selection
  3. Remove outliers
  4. Train classifier
  5. Validate the classifier trained
After or before doing the step 2, one may want to discretize the features to make them categorical.

Each of the step may have many proposed algorithms to complete. One probably has to try many of them to get a high accuracy classifier.

3 comments:

Unknown said...

My colleague has tried discretize the features followed by density-based clustering for his dataset. He's got as high as 97% accuracy. Sometimes discretization is very useful!

Will Dwinnell said...

One important note: Outliers should be removed only from the training data.

Since outliers will be encountered in the field, and presumably the model will need to execute on them, outliers should not be removed from the test data.

-Will Dwinnell
Data Mining in MATLAB

Unknown said...

To Will, Thank you for leaving your comment here. Nice to meet you here in Blogger.

I agree that outliers should not be removed from test data. Actually outliers in test data may be not true outliers; perhaps they are caused by the inadequacy of the classifier trained.