r - Check if one sample exists in the sample space (From PCA or other cluster analysis) -

- January 15, 2015

i have 200 50 matrix, here 200 means 200 compounds (row) , 50 means 50 independent varialbes (column), , use 200 * 50 matrix cluster analysis (e.g. k-mean etc.), can plot show distributions these 2000 compounds.

my question when have new compound, have same 50 independent variable 200 * 50 matrix, how can test if new compound located in cluster space?

thanks.

edit: plz note not need find element in data.frame. think first step cluster data (for example, using pca , plot(pca1, pca2)), test if new record located in plot or out. like picture, (2) belongs cluster , (1) not belong cluster space, this.

here simple solution:

step1: setup data

set.seed(1) refdata <- data.frame(matrix(runif(200*50),nrow=200))  newrec01 <- refdata[11,]    # record exists in data newrec02 <- runif(50)       # record not exist in data

step2: testing:

true %in% sapply(1:nrow(refdata),function(i) all(newrec01 == refdata[i,])) true %in% sapply(1:nrow(refdata),function(i) all(newrec02 == refdata[i,]))

if needed can package in function:

checknewrec <- function(refdata, newrec) {   true %in% sapply(1:nrow(refdata),function(i) all(newrec == refdata[i,])) }  checknewrec(refdata, newrec01) checknewrec(refdata, newrec02)

edit: based on new input below, try following:

prep: code comments:

  <- rbind(refdata, newrec02)     pca <- prcomp(all)    pca1 <- pca$x[, 1]    pca2 <- pca$x[, 2]    pca1.in <- pca1[-length(pca1)]   pca2.in <- pca2[-length(pca2)]

now need define cluster in way. simplicity, lets assume single cluster.

step1: find out centroid of refdata:

  cent <- c(mean(pca1.in),mean(pca2.in))

step2: find out distance of data points center of refdata:

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2

step3: need choose cut off distance center beyond new incoming record considered "outside" cluster. simplicity, taking decision @ 95th % quantile:

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1))

step4: decision has been made on classification of newrec, can plot it:

  plot(pca1, pca2)    points(pca1[length(pca1)], pca2[length(pca2)],           col = ifelse(dec, "red", "green"),pch="x")

additionally, verify our decision, lets plot errors, , see newrec fall!!

  hist(ssq, main="error histogram",xlab="square error")   points(pca1[length(pca1)], pca2[length(pca2)],          col = ifelse(dec, "red", "green"),pch="x")   text(pca1[length(pca1)], pca2[length(pca2)],labels="new rec",col="red",pos=3)

hope helps!!

Search This Blog

GHI

r - Check if one sample exists in the sample space (From PCA or other cluster analysis) -

Comments

Post a Comment

Popular posts from this blog

reporting services - Visible Export Data Feed option SSRS report -

git - Initial Commit: "fatal: could not create leading directories of ..." -

parcelable - Unmarshalling unknown type code exception while reading parcel values in Android 4.4+ -