r - Clustered Standard Errors with data containing NAs -


i'm unable cluster standard errors using r , guidance based on post. cl function returns error:

error in tapply(x, cluster1, sum) : arguments must have same length 

after reading on tapply i'm still not sure why cluster argument wrong length, , causing error.

here link data set i'm using.

https://www.dropbox.com/s/y2od7um9pp4vn0s/ec%201820%20-%20dd%20data%20with%20controls.csv

here r code:

# read in data charter<-read.csv(file.choose()) view(charter) colnames(charter)  # standardize naep scores charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=t))/sd(charter$naep, na.rm=t)  # change nas in year.passed column 2014 charter$year.passed[is.na(charter$year.passed)]<-2014  # add column indicator in treatment (passed legislation) charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)  # fit model charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter) summary(charter.model) # account clustered standard errors state cl(dat=charter, fm=charter.model, cluster=charter$state)  # accounting controls charter.model.controls<-lm(naep~factor)  # clustered standard errors # ---------  # function calculates clustered standard errors # source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/ cl   <- function(dat, fm, cluster){   require(sandwich, quietly = true)   require(lmtest, quietly = true)   m <- length(unique(cluster))   n <- length(cluster)   k <- fm$rank   dfc <- (m/(m-1))*((n-1)/(n-k))   print(k)   uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));   vcovcl <- dfc*sandwich(fm, meat=crossprod(uj)/n)   coeftest(fm, vcovcl)  }  # calculate clustered standard errors  cl(charter, charter.model, charter$state) 

the inner workings of function little on head.

when execute code, notice there missing observations in linear model:

> summary(charter.model)  call: lm(formula = naep ~ factor(year) + factor(state) + treatment,      data = charter)  residuals:      min       1q   median       3q      max  -15.2420  -1.6740  -0.2024   1.8345  12.3580   coefficients:                             estimate std. error t value pr(>|t|)     (intercept)                 250.4983     1.2115 206.767  < 2e-16 *** factor(year)1992              3.7970     0.7198   5.275 2.17e-07 *** factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***  [..]  residual standard error: 3.128 on 404 degrees of freedom   (759 observations deleted due missingness) multiple r-squared:  0.9337,    adjusted r-squared:  0.9239  f-statistic: 94.85 on 60 , 404 df,  p-value: < 2.2e-16 

this causing error in tapply(x, cluster1, sum) : arguments must have same length error message see.

in cl(dat=charter, fm=charter.model, cluster=charter$state) cluster variable charter$state should have exact same length number of observations used in regression estimation (which due nas not same number of rows in original data frame).


to work around can following.

  1. first off you're using older version of arai's function (cl) (see fama-macbeth , cluster-robust (by firm , time) standard errors in r references both old or new versions, latter being called clx).

  2. second think arai's original approach function bit convoluted, , doesn't follow standard interface of vcov* functions sandwich. that's why came modified version of clx. made code bit more readable, , interface more expect sandwich vcov* function:

    vcovcl <- function(x, cluster.by, type="sss", dfcw=1){     # r-codes (www.r-project.org) computing     # clustered-standard errors. mahmood arai, jan 26, 2008.      # arguments of function are:     # fitted model, cluster1 , cluster2     # need install libraries `sandwich' , `lmtest'      # reweighting var-cov matrix within model     require(sandwich)     cluster <- cluster.by     m <- length(unique(cluster))        n <- length(cluster)     stopifnot(n == length(x$residuals))     k <- x$rank     ##only stata small-sample correction supported right      ##see plm >= 1.5-4     stopifnot(type=="sss")       if(type=="sss"){         dfc <- (m/(m-1))*((n-1)/(n-k))     }     uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))     mycov <- dfc * sandwich(x, meat=crossprod(uj)/n) * dfcw     return(mycov) } 

if try function on data see catches specific issue:

> coeftest(charter.model, vcov=function(x) vcovcl(x, charter$state))  error: n == length(x$residuals) not true 

to avoid issue proceed follows:

> charter.x <- na.omit(charter[ , c("state",                                    all.vars(formula(charter.model)))]) > coeftest(charter.model, vcov=function(x) vcovcl(x, charter.x$state))   t test of coefficients:                                 estimate  std. error     t value  pr(>|t|)     (intercept)                  2.5050e+02  9.3781e-01  2.6711e+02 < 2.2e-16 *** factor(year)1992             3.7970e+00  5.6019e-01  6.7780e+00 4.330e-11 *** factor(year)1996             7.0436e+00  8.8574e-01  7.9522e+00 1.856e-14 *** factor(year)2000             8.4313e+00  1.0906e+00  7.7311e+00 8.560e-14 *** factor(year)2003             1.2392e+01  1.1670e+00  1.0619e+01 < 2.2e-16 *** factor(year)2005             1.3490e+01  1.1747e+00  1.1484e+01 < 2.2e-16 *** factor(year)2007             1.6334e+01  1.2469e+00  1.3100e+01 < 2.2e-16 *** factor(year)2009             1.8118e+01  1.2556e+00  1.4430e+01 < 2.2e-16 *** factor(year)2011             1.9110e+01  1.3459e+00  1.4199e+01 < 2.2e-16 *** factor(year)2013             1.9301e+01  1.4896e+00  1.2957e+01 < 2.2e-16 *** factor(state)alaska          1.4178e+01  8.7686e-01  1.6169e+01 < 2.2e-16 *** factor(state)arizona         8.6313e+00  8.1439e-01  1.0598e+01 < 2.2e-16 *** factor(state)arkansas        4.3313e+00  8.1439e-01  5.3185e+00 1.736e-07 *** factor(state)california      3.1103e+00  9.1619e-01  3.3948e+00 0.0007549 *** factor(state)colorado        1.7939e+01  7.9736e-01  2.2498e+01 < 2.2e-16 *** factor(state)connecticut     1.8031e+01  8.1439e-01  2.2141e+01 < 2.2e-16 *** factor(state)d.c.           -1.8369e+01  8.1439e-01 -2.2555e+01 < 2.2e-16 *** factor(state)delaware        1.2050e+01  7.9736e-01  1.5113e+01 < 2.2e-16 *** factor(state)florida         7.3838e+00  7.9736e-01  9.2602e+00 < 2.2e-16 *** factor(state)georgia         6.4313e+00  8.1439e-01  7.8971e+00 2.724e-14 *** factor(state)hawaii          3.3313e+00  8.1439e-01  4.0906e+00 5.196e-05 *** factor(state)idaho           1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 *** factor(state)illinois        1.2670e+01  8.2224e-01  1.5409e+01 < 2.2e-16 *** factor(state)indianna        1.7174e+01  6.1079e-01  2.8117e+01 < 2.2e-16 *** factor(state)iowa            2.0074e+01  6.8460e-01  2.9322e+01 < 2.2e-16 *** factor(state)kansas          2.0123e+01  8.6796e-01  2.3184e+01 < 2.2e-16 *** factor(state)kentucky        1.0200e+01  4.1999e-14  2.4287e+14 < 2.2e-16 *** factor(state)louisiana      -1.6866e-01  8.1439e-01 -2.0710e-01 0.8360322     factor(state)maine           2.0231e+01  1.7564e-01  1.1518e+02 < 2.2e-16 *** factor(state)maryland        1.4274e+01  6.1079e-01  2.3369e+01 < 2.2e-16 *** factor(state)massachusetts   2.4868e+01  8.3960e-01  2.9619e+01 < 2.2e-16 *** factor(state)michigan        1.2031e+01  8.1439e-01  1.4773e+01 < 2.2e-16 *** factor(state)minnesota       2.5110e+01  9.1619e-01  2.7407e+01 < 2.2e-16 *** factor(state)mississippi    -3.5470e+00  1.7564e-01 -2.0195e+01 < 2.2e-16 *** factor(state)missouri        1.3447e+01  7.2706e-01  1.8495e+01 < 2.2e-16 *** factor(state)montana         2.2512e+01  8.4814e-01  2.6543e+01 < 2.2e-16 *** factor(state)nebraska        1.9600e+01  4.3105e-14  4.5471e+14 < 2.2e-16 *** factor(state)nevada          4.9800e+00  8.6796e-01  5.7375e+00 1.887e-08 *** factor(state)new hampshire   2.2026e+01  7.6338e-01  2.8853e+01 < 2.2e-16 *** factor(state)new jersey      2.0651e+01  7.6338e-01  2.7052e+01 < 2.2e-16 *** factor(state)new mexico      1.5313e+00  8.1439e-01  1.8803e+00 0.0607809 .   factor(state)new york        1.2152e+01  7.1259e-01  1.7054e+01 < 2.2e-16 *** factor(state)north carolina  1.2231e+01  8.1439e-01  1.5019e+01 < 2.2e-16 *** factor(state)north dakota    2.4278e+01  1.0420e-01  2.3299e+02 < 2.2e-16 *** factor(state)ohio            1.7118e+01  7.8321e-01  2.1857e+01 < 2.2e-16 *** factor(state)oklahoma        8.4518e+00  7.8321e-01  1.0791e+01 < 2.2e-16 *** factor(state)oregon          1.6535e+01  7.3538e-01  2.2486e+01 < 2.2e-16 *** factor(state)pennsylvania    1.6651e+01  7.6338e-01  2.1812e+01 < 2.2e-16 *** factor(state)rhode island    9.5313e+00  8.1439e-01  1.1704e+01 < 2.2e-16 *** factor(state)south carolina  9.5346e+00  8.3960e-01  1.1356e+01 < 2.2e-16 *** factor(state)south dakota    2.1211e+01  3.5103e-01  6.0425e+01 < 2.2e-16 *** factor(state)tennessee       4.9148e+00  6.1473e-01  7.9951e+00 1.375e-14 *** factor(state)texas           1.4231e+01  8.1439e-01  1.7475e+01 < 2.2e-16 *** factor(state)utah            1.5114e+01  7.2706e-01  2.0787e+01 < 2.2e-16 *** factor(state)vermont         2.3474e+01  2.0299e-01  1.1564e+02 < 2.2e-16 *** factor(state)virginia        1.6252e+01  7.1259e-01  2.2807e+01 < 2.2e-16 *** factor(state)washington      1.9073e+01  1.8183e-01  1.0489e+02 < 2.2e-16 *** factor(state)west virginia   5.0000e+00  4.2022e-14  1.1899e+14 < 2.2e-16 *** factor(state)wisconsin       1.9994e+01  8.2447e-01  2.4251e+01 < 2.2e-16 *** factor(state)wyoming         1.8231e+01  8.1439e-01  2.2386e+01 < 2.2e-16 *** treatment                    1.2108e+00  1.0180e+00  1.1894e+00 0.2349682     --- signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

it's not nice, gets job done. cl work fine , yield same result above:

cl(dat=charter, fm=charter.model, cluster=charter.x$state) 

a better way go use multiwayvcov package. per packages's website, improvement upon arai's code:

transparent handling of observations dropped due missingness

using petersen data simulated nas , cluster.vcov():

library("lmtest") library("multiwayvcov")  data(petersen) set.seed(123) petersen[ sample(1:5000, 15), 3] <- na  m1 <- lm(y ~ x, data = petersen) summary(m1) ##  ## call: ## lm(formula = y ~ x, data = petersen) ##  ## residuals: ##    min     1q median     3q    max  ## -6.759 -1.371 -0.018  1.340  8.680  ##  ## coefficients: ##             estimate std. error t value pr(>|t|)     ## (intercept)  0.02793    0.02842   0.983    0.326     ## x            1.03635    0.02865  36.175   <2e-16 *** ## --- ## signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ##  ## residual standard error: 2.007 on 4983 degrees of freedom ##   (15 observations deleted due missingness) ## multiple r-squared:  0.208,  adjusted r-squared:  0.2078  ## f-statistic:  1309 on 1 , 4983 df,  p-value: < 2.2e-16  coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid)) ##  ## t test of coefficients: ##  ##             estimate std. error t value pr(>|t|)     ## (intercept) 0.027932   0.067198  0.4157   0.6777     ## x           1.036354   0.050700 20.4407   <2e-16 *** ## --- ## signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

for different approach using plm package see:


Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -