Beware of intercept-only random effects

In my experience, many users of Linear Mixed Effect (LME) models tend to assume that the random effects only influence the intercept, and not the slopes, of the model. This practice might lead to falsely rejecting null hypotheses with a high probability, if the true model in fact has a more complex random effect structure. Further, such intercept-only random effect models might be worse than working with a simple linear model instead.

Data from the classical "sleepstudy" dataset, which is used as an example in the lme4 package. Here we have 18 groups (patients) with 10 observations in each group. The response variable has clear group-specific differences both in the mean and in the association with the covariate.

Linear mixed effect (LME) models are very common in a wide range of applications. Typically, they are used when the data have some natural group structure, for instance multiple measurements on the same location or on the same patient, and one wishes to allow for potential differences between these groups. The group-differences might conceivably manifest themselves both in the means (intercept) and in the association with covariates (slopes), but I have the impression that the vast majority of LME-applications use random effects on the intercept only. From some applications I have been involved in, I have formed the slightly unprecise impression that such intercept-only models can be particularly misleading, and further, that they can be worse than ignoring the group-structure altogether and use a simple linear model. Here, "misleading" (or "worse") expresses that the models appear to produce unrealistically small p-values for many of the coefficients. This post is an attempt at investigating this impression in a very simplified set-up.

Assume that we have a single covariate $x$, $n$ groups indexed by $i=1,...,n$, and $m_i$ observations in each group. In our set-up, the question of main interest is to test whether the coefficient of $x$ is different from zero or not. We let our true data-generating model be a linear mixed effect model with random effects on both the slope and the intercept,

$\text {True model: } y_i = \beta_0 + b_{0,i} + (\beta_1 + b_{1,i})x_i + \epsilon_i,$

$y_i$, $x_i$ and $\epsilon_i$ are $m_i \times 1$ vectors. The fixed effect of $x$, $\beta_1$, is the parameter of main interest. The random errors, $\epsilon_i$, are drawn independently from $N_{m_i} (0, \sigma^2 I)$ and the random effects $b_i = (b_{0,i},b_{1_i})$ are drawn independently from $N_2 (0, \sigma^2 D)$, where D is the $2 \times 2$ covariance matrix of the random effects,

$D = \begin{bmatrix} \gamma_0 & \gamma_{01} \\ \gamma_{01} & \gamma_{1} \end{bmatrix}.$

In addition to the true model we will consider two candidate models,

$\text{Intercept only: } y_i = \beta_{I,0} + b_{0,I,i} + \beta_{I,1} x_i + \epsilon_i, \quad \epsilon_i \sim N_{m_i}(0, \sigma^2_I I), \quad b_{0,I,i} \sim N(0,\sigma_I^2 \gamma_I);\\ \text {Linear model: } y_i = \beta_{L,0} + \beta_{L,1} x_i + \epsilon_i, \quad \epsilon_i \sim N_{m_i}(0, \sigma^2_L I).$

The subscripts $I$ and $L$ signal that the parameters do not necessarily mean exactly the same thing as in the true model. One might think that the intercept-only model, being in some sense closer to the true model, is better than the simple linear model. The point of this post is to show that this is not necessarily the case.

As mentioned above, we are interested in testing whether $\beta_1$ is different some zero. Therefore we will study the behaviour of the three test statistics for $\beta_1$ from the three models under the true data-generating mechanism,

$Z_T = \frac{\widehat \beta_1}{SE_{T}}, \qquad Z_I = \frac{\widehat \beta_{I,1}}{SE_{I}}, \qquad Z_L = \frac{\widehat \beta_{L,1}}{SE_{L}}.$

The estimators here are the usual maximum likelihood estimators; the standard errors are obtained from the inverse (observed) Fisher Information matrix as usual. These quantities thus correspond to the test-statistics coming out of traditional R packages and functions, like ${\rm lme4}$ for the LME models, and $\rm lm$ for the linear model. The following results are obtained from large-sample approximations, i.e. when we assume that the number of groups $n$ increases towards infinity. In a finite sample there might be other things going on in addition to the effect we will see here. By using some of the results in "Focused model selection for linear mixed models, with an application to whale ecology." by Cunen, Walløe and Hjort, we have worked out the (asymptotic) distribution of each of the three test statistic above under the true LME model.

From now on we consider the case where $\beta_1=0$, so there really is no association between the response and the covariate. In that case we expect that we should wrongly reject the null hypothesis of no association between $x$ and $y$ 5% of the times. Equivalently, we know that the distribution of $Z_T$ will be approximately standard normal (and with the approximation getting better as $n$ increases). Under the true LME model, the test statistics $Z_I$ and $Z_L$ have distributions that are different from the standard normal, see for example the following figure.

Distribution of the three test statistics, with $\beta_0=1$, $\sigma=1$, $D=[2, 0; 0, 1]$, $\bar x =1$, $sd(x)=0.5$.

All three are centered at zero, so there is no systematic bias in the estimation of $\beta_1$ (at least asymptotically), but the two wrong models will produce large test statistics at a much higher rate than when using the true model. Of course that is hardly surprising, since these models are, as I just said, wrong and we cannot expect the ordinary guarantees to hold. What is perhaps a bit surprising is that for this choice of $\beta$, $\sigma$, $D$ (and $x$) the intercept-only model is worse than the simple linear model: it has a much higher chance of producing a large test statistics which will lead to rejection of the null hypothesis. This is un-intuitive since we might feel that the intercept-only model is somehow closer to the true model than the linear one, which does not take the group structure into account at all.

The precise fate of the two "wrong" test statistics depends a lot on where we are in the parameter space, and also on characteristics of the covariate. See a couple of different scenarios in the figure below. Sometimes both wrong models are very bad, sometimes both are rather good, and sometimes $Z_I$ is indeed closer to the standard normal than $Z_L$. Still, from my brief investigations the phenomenon I have described here seems to be reasonably widespread.

Distribution of the three test statistics. Left: with $\beta_0=1$, $\sigma=1$, $D=[2, 0; 0, 1]$, $\bar x =1$, $sd(x)=2$. Right: with $\beta_0=1$, $\sigma=5$, $D=[5, 0; 0, 0.1]$, $\bar x =1$, $sd(x)=1$.

I hope to come back with more details in a future post, but for now I can indicate the following that the two wrong models are particularly bad when:

$\gamma_0$ and $\gamma_1$ are large (see notation above);
$\sigma$ is small;
the group sizes $m_i$ are big;
the spread in the covariate is large.

In most settings I have investigated, $Z_I$ is further from a standard normal than $Z_L$, unless $\sigma$ is particularly large.