r - Check if one sample exists in the sample space (From PCA or other cluster analysis) -


i have 200 50 matrix, here 200 means 200 compounds (row) , 50 means 50 independent varialbes (column), , use 200 * 50 matrix cluster analysis (e.g. k-mean etc.), can plot show distributions these 2000 compounds.

my question when have new compound, have same 50 independent variable 200 * 50 matrix, how can test if new compound located in cluster space?

thanks.

edit: plz note not need find element in data.frame. think first step cluster data (for example, using pca , plot(pca1, pca2)), test if new record located in plot or out. like picture, (2) belongs cluster , (1) not belong cluster space, this.

here simple solution:

step1: setup data

set.seed(1) refdata <- data.frame(matrix(runif(200*50),nrow=200))  newrec01 <- refdata[11,]    # record exists in data newrec02 <- runif(50)       # record not exist in data 

step2: testing:

true %in% sapply(1:nrow(refdata),function(i) all(newrec01 == refdata[i,])) true %in% sapply(1:nrow(refdata),function(i) all(newrec02 == refdata[i,])) 

if needed can package in function:

checknewrec <- function(refdata, newrec) {   true %in% sapply(1:nrow(refdata),function(i) all(newrec == refdata[i,])) }  checknewrec(refdata, newrec01) checknewrec(refdata, newrec02) 

edit: based on new input below, try following:

prep: code comments:

  <- rbind(refdata, newrec02)     pca <- prcomp(all)    pca1 <- pca$x[, 1]    pca2 <- pca$x[, 2]    pca1.in <- pca1[-length(pca1)]   pca2.in <- pca2[-length(pca2)] 

now need define cluster in way. simplicity, lets assume single cluster.

step1: find out centroid of refdata:

  cent <- c(mean(pca1.in),mean(pca2.in)) 

step2: find out distance of data points center of refdata:

  ssq <- (pca1 - mean(pca1.in))^2 + (pca2 - mean(pca2.in))^2 

step3: need choose cut off distance center beyond new incoming record considered "outside" cluster. simplicity, taking decision @ 95th % quantile:

  dec <- (quantile(head(ssq,-1), 0.95) > tail(ssq,1))  

step4: decision has been made on classification of newrec, can plot it:

  plot(pca1, pca2)    points(pca1[length(pca1)], pca2[length(pca2)],           col = ifelse(dec, "red", "green"),pch="x") 

additionally, verify our decision, lets plot errors, , see newrec fall!!

  hist(ssq, main="error histogram",xlab="square error")   points(pca1[length(pca1)], pca2[length(pca2)],          col = ifelse(dec, "red", "green"),pch="x")   text(pca1[length(pca1)], pca2[length(pca2)],labels="new rec",col="red",pos=3) 

hope helps!!


Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -