Monday, July 28, 2008

configure outlook and/or thunderbird for NUSNET email

Besides the user and password, two most important points for setting up IMAP account:
incoming server: imap.nus.edu.sg
outgoing server: smtp.nus.edu.sg

Tuesday, April 8, 2008

A sample R session

# Data description:
#This example comes from a June 96 Consumer Reports article
rating 69 brands of beer. The article gives several pieces of
information about each beer, including price, number of calories,
alcohol content, bitterness, maltyness, a quality rating, a category,
and an indication of where the beer are available (which we'll ignore
for the time being).

The variables (with abbreviated names) are as follows:
price price in dollars
qlty quality rating (0 = worst, 100=best)
N available in Northern US? (1=Yes, 0=No)
E East, see N above
W West, see N above
S South, see N above
cal number of calories per 12 oz serving
alc percent alcohol
bitter bitterness (0=less bitter, 100=most bitter)
malty maltyness (0=less malty, 100=most malty)
class beer category (1,2,3,4,5,6)
1 = craft lager
2 = craft ale
3 = imported lager
4 = regular or ice beer
5 = light beer
6 = nonalcoholic

R code
beer <- read.table(
'http://www.student.math.uwaterloo.ca/~stat441/stat441_09_01/beer.dat'
,head=TRUE,row.names=1)
# head option is for column names, row.names=1 says use first column
# of data for names of the rows.

# type the name of the object to see it displayed on the screen
beer
# you can also see/edit the data in a spreadsheet-like viewer
edit(beer) # for looking only
beer <- edit(beer) # will actually save changes

# make a new variable with the types in no-numeric format
beer$type <- as.factor(
c('c.lager','c.ale','imp.lager','reg','light','nalc')[beer$class])

# basic summaries of each column
summary(beer)

# some graphical summaries
hist(beer$price)
library(lattice) #for histogram and xyplot commands

histogram(~malty|class,data=beer)

plot(beer$price,beer$quality)
plot(beer$malty,beer$bitter)

pairs(beer[,c('price','qlty','malty','bitter')])
pairs(beer[,c('price','qlty','malty','bitter')],col=beer$class,pch=19)
pairs(beer[,c('price','qlty','malty','bitter')],pch=beer$class)

xyplot(bitter~malty | type,data=beer)
coplot(bitter~malty | type,data=beer) # coplot is similar here
coplot(bitter~malty | price,data=beer) # coplot shows bitter vs. malty
# for different slices of "price"

# look at dependence of price on type
stripplot(price~type,data=beer,pch=19)

mylm <- lm(price~qlty+bitter+malty+type,data=beer)
summary(mylm)

Data Analysis Using R

Copy and paste stuff. Very useful and I don't want it buried in seas of files.

Basics

· R is case sensitive, object-oriented

· A command ends with a semi-colon (;). The last semi-colon can be omitted.

· A comment begins with # regardless of its location. The single quotes ('') and double quotes ("") are used interchangably.

· Packages contains data sets and functions, are accessed through library().

· Objects include vectors, lists, data frames, matrices (array), and factors.

· An R list is an object consisting of an ordered collection of objects known as its components. Lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists.

· Data frames are matrix-like structures, in which the columns can be of different types. A data frame is a list with class "data.frame".

· A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length.

· Matrices or more generally arrays are multi-dimensional generalizations of vectors. An array can be considered as a multiply subscripted collection of data entries

· The "pi" is the constant 3.141592654. The "NA" indicates a missing value (default).

· The "pkg" (package); "d" (data frame); "m" (matrix); "v" (vector), url, file (file), obj (objects), fit (fitted model), n (number); s (string).

Basic Commands

· * quit(); q()

· * help(command); help.start()

· * search(); help.search()

· * dir(); methods()

· * library(p); identify(); attach(); detatch()

· * remove(); rm()

· * start:end; c(); rep(); seq()

· * scan(); print(); str(); ls()

· * cat(); cat("concaternate", c, "and print", "\t")

· * options(prompt='.', continue="///", digits=10); getOption("width")

· * source(); source.url() /* run commands in a file */

·

Simple examples

· library() # list packages available

· library(car) # load a package

· list(data()) # list data sets in the current package

· summary(Davis)

· list(Davis)

· list(Davis$weight)

· stem(Davis[,2]) # equal to stem(Davis$weight)

· stem(Davis$height, scale=4)

· boxplot(Davis$weight)

· w<-Davis$weight

· h<-Davis$height

· plot(w ~ h)

· cor(Davis[,c(2:3)])

· cor.test(w,h)

· t.test(Davis[,2], mu=65)

· t.test(Davis$height, Davis$weight, mu=100, paired=FALSE)

· var.test(Davis$height, Davis$weight)

· d<=read.csv("c:/temp/R/nes.csv", header=TRUE)

· list(names(d)) # list variable names

OPERATOR/FUNCTION

Operators

· * <- (left assignment), -> (right assignment)

· * +, -, *, /, ^, %% (modulus)

· * >, >=, <, <=, == (equal), != (not equal)

· * & (and), | (or)

· * %*% (matrix product); %/% (division)

· * %o% (Outer product); %x% (Kronecker product)

· * %in% (Matching operator);

Functions

* abs(); sin(); cos(); tan(); exp(); sqrt(); min(); max()

* log(); log(v,10); log10(); log2(); log(v, base=10)

* mean(); sum(); median(); range(); var(); sd()

* rank(); ave(v, group); by(group)

* c(a, b, c); c(start:end); seq(start:end); seq(10, 100, by=5)

* rep(n, time); rep(7, 3); rep(start:end, time)

* rep(1:3, c(2,2,2)); rep(1:3, each=2); rep(1:3, c(1:3))

* seq(1,4); seq(1,10, by=2); seq(0,1, length=10)

* length(), sort(), order(); rev(v) ## to reverse

* dnorm(1.96); dt(1.96, 100); df(1.96, 1, 100); dchisq(1.96, 10)

* pnorm(1.96); pt(1.96, 100); pf(1.96, 1, 100); pchisq(1.96, 10)

* rpois(n, lamda); rnorm(n); rt(n, df); rt(n, df=c(1:10)); rexp(n)

* substring(s, start, stop); substr(s, start, stop); nchar(s)

* date()

* mode() ## type of object

INPUT OUTPUT

Reading Text Files

* source(f); /* to execute commands in the file */

* read.table(f); read.table.url(url)

* download.file(url); url.show(url)

* m<-read.table("f:/temp/cigar.txt", header=TRUE)

* m<-read.table('f:/temp/cigar.txt')

* names(m)<-c("a", "b", "c")

* read.csv(f, header=TRUE, sep=",", quote="\"", dec=".")

* read.csv2(f, header=TRUE, sep=";", quote="\"", dec=",")

* read.delim(f, header=TRUE, sep="\t", quote="\"", dec=".")

* read.delim2(f, header=TRUE, sep="\t", quote="\"", dec=",")

* m<-read.csv("nes2.csv, header=TRUE)

* read.fwf(file, widths=c(3,5,3), header="FALSE, sep="", as.is=FALSE)

* as.is=TRUE; as.is=T # not to be converted into a factor

* na.strings<-c(".", "NA", "", "#") # characters for missing

* cnt=count.fields(df); which(cnt=7);

Reading Data Frames

* load(d);

* data(d);

data(d, package="p")

* data.frame(v1, v2) /* to make a data frame out of vectors */

* m3<-data.frame(as.matrix(m[,2:4]))

* m2<-edit(m); m2<-edit(data.frame(m)) # modify the dataframe

* data.entry(df)

Handling Data

* m2<-match(v1, v2, nomatch=0) # data merging

* m2<-match(m[,1], m[,3])

merge(df1, df2, by=’name) #merge two data frames by common column

Writing Data

* cat(); print()

* cat("y x1 x2", "2 4 2", "5 2 7", file="sample.txt", sep="\n")

* write.(obj, f)

* write.table(df, file='firms.csv', sep=",", row.names=NA, col.names=NA)

* save(f, obj); save.image(f)

* sink(); format()

MATRICES

Defining Matrices

* m<-c(1, 2, 3, 4); c(1, 2, 3, 4)->m; assign("m", c(1, 2, 3, 4))

* m<-data.frame(column1=c(1,2,3), column2=c(4,5,6)); ## 2 by 3

* rep(c(1,2,3), 2); rep(c(1,2,3), each=2);

* rep(c(1,2,3), c(2,2,2,)); m<-c(c1=15, c2=54, c3=50)

* seq(1,4); seq(1,10, by=2); seq(0,1, length=10);

* intm<-1:4; intm<-numeric(); intm[1]m<-1; intm[2]m>-2

* strm

* blm<-c(T,F); blm<-v1>10; ## a boolean vector of TRUE and FALSE

* m<-scan()

* mm<-matrix(1:12,4); mm<-matrix(1:12, nrow=4)

* mm<-matrix(1:12, ncol=3); mm<-matrix(1:12, nrow=4)

* mm<-matrix(1:12, nrow=4, ncol=3); mm<-matrix(1:12, 4, 3)

* arrm<-array(1:10); arrm<-array(1:10, dim=c(2,5))

* cbind(); rbind(); gl(); expand.grid()

* list()

Referring Matrices

* m[,2]; v=m[2,]; m[-1, -3] ## to extract elements

* m[c(1, 5, 6)]; m2=m[-c(1, 5, 6)] ## to extract elements

* m<-c(c1=15, c2=54, c3=50); m<-c("c1", "c3")

* m2<-m$c2; m2<-m[,2]; m2<-m[,"c2"]; m2<-m[[2]]

* m[,3:5]; m3<-m[,c(3, 4, 5)]; m3<-m[,c("c3", "c4", "c5")]

* m<-c(4, 2, 4); names(m)<-c("Grape", "Pear", "Apple")

* m1$v2 /*variable 2 of the data frame 1*/

* white(); which.max(); which(min)

* attr(m, which); attributes(obj)

Matrix Functions

* t(); det(); rank(); eigen(); diag(); prod(); crossprod()

* sum(); mean(); var(); sd(); min(); max(); prod(); cumsum(); cumprod()

* is.na(m) ## to check if m contains a missing value

* rowsum(); colsum(); nrow(); ccol()

* dim(m); dimnames(m)

* merge(df1, df2)

* as.factor(); as.matrix(), as.vector(); /* conversion*/

* is.factor(); is.matrix(), is.vector();

* class(); unclass()

* na.omit(); na.fail(); unique(); table(); sample()

* as.array(); as.data.frame()

* as.numeric(); as.characters(); as.logical(); as.complex()

REGRESSION

Ordinary Least Squares (OLS)

* lm(); glm()

* m.ols<-lm(v1~v2+v3, data=m) ## linear model

* lm(v1~v2+v3, data=m); summary(lm(v1~v2+v3, data=m)); summary(m.ols)

* names(m.ols); coef(m.ols); fitted(m.ols); resid(m.ols)

* predict(fit); AIC(fit); logLik(fit); deviance(fit)

* model.matrix(v1~v2+v3, data=m)

* m.ols2<-model.matrix(v1~v2+v3, data=m); summary(m.ols2)

Binary Response Regressions

* m.logit<-glm(v1~v2+v3,family=binomial(link=logit),data=m)

* summary(m.logit); coef(m.logit); fitted(m.logit); resid(m.logit)

* lsfit(v1,v2)

* nls(); m.nonlin<-lm(v1~v2+v2^2, data=m)

* anova(m.ols, m.nonlin)

* m.qr<-qr(m) ## QR Decomposition of a Matrix

STATISTICS

Descriptives

* summary(m); fivenum(m)

* stem(v); boxplot(v); boxplot(v1, v2); hist(v)

* qqnorm(v); qqline(v)

* rug(); lines()

* table() /*to make a table*/

* tabulate()

Multivariate Analysis

* cor(m); cor(sqrt(m)) ## Pearson correlation

* cor.test(v1, v2)

* prcomp() /* Principal components in the mva package*/

* kmeans() /* Kmeans cluster analysis in the mva package*/

* factanal() /* Factor analysis in the mva package*/

* cancor() /* Canonical correlation in the mva package*/

Categorical Data Analysis

* chisq.test(v1,v2) ## Pearson Chi-squared Test

* fisher.test(v1,v2) ## Fisher Exact Test

* friedman.test(v1,v2) ## Friedman Test

* prop.test(); binom.test() ## sign test

* kruskal.test(v1,v2) ## Kruskal-Wallis Rank Sum Test

* wilcox.test(v1,v2) ## Wilcoxon Rank Sum (Mann-Whitney) Test

* ks.test(v1,v2) ## Two Sample Kolmogorov-Smirnov Test

* bartlett.test(v1,v2) ## Bartlett Test for Homogeneity of Variances

ANOVA

T-test

* t.test(v1,v2); t.test(v1,v2, var.equal=FALSE)

* t.test(v1,v2, mu=0 paired=FALSE)

* t.test(v1.v2, mu=10, paired=F, var.equal=T)

* power.t.test(v1,v2); pairwise.t.test()

* var.test(v1,v2) ## F test for equal variance

ANOVA

* m.anova<-aov(v1~v2+v3, data=m)

* aov(); anova()

* summary(m.anova)

* power.anova.test() ## Power calculations for balanced one-way ANOVA tests

PROGRAMMING

Modules

frame_name<-function(arguments) {...}

mile.to.km<-function(mile) {mile*8/5}

km<-mile.to.km(c(35, 55, 75))

Flow Control

if (condition) {...} else if (condition) {...} else {...}

while (condition ) {...} # {} may be omitted for a single line expression

for (index in start:end) {...}

for (i in 1:100) {sum <- sum + i}

repeat {...}

switch (statement, list)

Programming Functions

* expression(); parse(); deparse(); eval()

* optim() /* general-purpose optimization */

* nlm() /* Newton algorithm */

* lm() /* linear models */

* nls() /* nonlinear least squares model */

GRAPHICS

Plotting

* plot(y~x, data=m, pch=16) # plotting character (pch)

* pairs(m) # scatterplot matrix

* xyrange<-range(m) # to get range of m

* plot(y~x, data=m, xlim=xyrange, ylim=xyrange)

* abline(0,1)

* plot((0:10), sin((1:10)*pi, type="1") # 1 joins the points

* barplot(); boxplot(); stem(); hist();

* matplot() /* matrix plot */

* pairs(m) /* scatterplots */

* coplot() /* conditional plot */

* stripplot() /* strip plot */

* qqplot(); qqnorm(); qqline() /* quantile0quantile plot */

Options

* points() # to add points to a plot

* lines() # to add lines

* text() # to add texts

* mtext() # to add margin texts

* axis() # to control axis

* par(cex=1.25 mex=1.25)

* par(mfrow=c(2,2), mfcol=c(1,1))

Monday, March 24, 2008

Lessons from today's presentation

  1. Don't speak too fast. Make yourself clear.
  2. Case study is very important to demonstrate the usefulness of your program, especially if you want to convince your audience.
  3. Explain the concepts clear before going to your results, otherwise the audience would be confused and lost in the presentation
  4. For any anomalies in the results, better to investigate the cause. They are popular places where people would ask question.
  5. Explain the particular reason for choosing, e.g. hierarchical clustering. Why not other clustering methods? (The reason is that k-means and density-based clustering require spatial points; I've only got the distance matrix.
  6. Demonstrate improvements over past work.

Datamining routines

Routines are:
  1. Fill in blank cells (process missing data)
  2. Feature selection
  3. Remove outliers
  4. Train classifier
  5. Validate the classifier trained
After or before doing the step 2, one may want to discretize the features to make them categorical.

Each of the step may have many proposed algorithms to complete. One probably has to try many of them to get a high accuracy classifier.

Sunday, March 23, 2008

Archving NUS Web mail in Thunderbird

0. Create a fold "NUS Emails Archive" under "Local Folders"

My folder structure in Thunderbird become:
Local Folders
->Unsent
->Trash
->NUS Emails Archive
NUS Email
->Inbox
->Drafts
->Sent

  1. Click Inbox under NUS Email
  2. Drag all emails you want to archive locally to NUS Email Archive
  3. Go to "C:\Documents and Settings\user name\Application Data\Thunderbird\Profiles\2r1eew7w.default\Mail\Local Folders" to view the file "NUS Emails Archive" and its size change.

Thursday, March 20, 2008

Rediscovery of the power of R

R is really nice and powerful in producing graphs. A lot to learn!

Google Help : Cheat Sheet

From http://www.google.com/help/cheatsheet.html

Here's a quick list of some of our most popular tools to help refine and improve your search. For additional help with Google Web Search or any other Google product, you can visit our main Google Help page.

OPERATOR EXAMPLE FINDS PAGES CONTAINING...
vacation hawaii the words vacation and Hawaii .
Maui OR Hawaii either the word Maui or the word Hawaii
"To each his own" the exact phrase to each his own
virus computer the word virus but NOT the word computer
+sock Only the word sock, and not the plural or any tenses or synonyms
~auto loan loan info for both the word auto and its synonyms: truck, car, etc.
define:computer definitions of the word computer from around the Web.
red * blue the words red and blue separated by one or more words.
I'm Feeling Lucky Takes you directly to first web page returned for your query.
CALCULATOR OPERATORS MEANING TYPE INTO SEARCH BOX
+ addition 45 + 39
- subtraction 45 – 39
* multiplication 45 * 39
/ division 45 / 39
% of percentage of 45% of 39
^ raise to a power 2^5
(2 to the 5th power)
ADVANCED OPERATORS MEANING WHAT TO TYPE INTO SEARCH BOX (& DESCRIPTION OF RESULTS)
site: Search only one website admission site:www.stanford.edu
(Search Stanford Univ. site for admissions info.)
[#][#] Search within a
range of numbers
DVD player $100..150
(Search for DVD players between $100 and $150)
link: linked pages link:www.stanford.edu
(Find pages that link to the Stanford University website.)
info: Info about a page info:www.stanford.edu
(Find information about the Stanford University website.)
related: Related pages related:www.stanford.edu
(Find websites related to the Stanford University website.)

©2008 Google

Wednesday, March 12, 2008

Medicel

"Medicel develops technologies and methods to help scientists turn data into information and knowledge." A sophisticated platform in which data from various sources e.g. HPLC, MS can be processed. It would be wonderful if they also have a data mining feature to run commonly used machine learning and data mining algorithms. From a developer's perspective, this platform is excellent, because it satisfies various daily requirements. It is great with 100 engineers developing 7 years. With moderate training, a junior engineer without much biological background can process tremendous amount of data from biological and medical experiments in various formats very efficiently. The cost of the software is also very high --- half a million.

Doctors do not like this software, though, for reasons that I can understand. From a doctor's perspective, it is good enough to have a small tool to complete a specific task. Simplicity is beauty. Too many functions and features are too distracting. They would think I am a doctor, not a programmer. Give me minimum advice, and I can get the work done.

It is still a valuable tool for a bioinformatician to do case analysis. Some ideas in that software can benefit us in our future plan of designing a similar but far simpler platform.

Future projects and project management tools

1. Design biological platform so that softwares and program in analyzing gene expression etc. can be integrated. (propose to James)
2. Outlier analysis (Try to find its application in biological context; customer relation management is a potential application area )
3. Program that accepts the IC values and outputs the desired cutting point so that gene selection achieves certain accuracy
4. Imaging processing tool to analyse the cancer cell lines
5. Android (pending)

Potential project management portals are: SourceForge.net and Assembla. wetpaint is a nice project management wiki, but Assembla provides both wiki and subversion, which are more suitable for team projects.

Wednesday, February 27, 2008

New clustering algo

I am going to replace current clustering algo inherited from Zeyar's with Neighbor joining. K-means or k-medoids cannot be applied because the points' coordinates are not known, only the distance matrix is known.