Home > pymc-multilevel-example

pymc-multilevel-example

Pymc-multilevel-example is a project mainly written in Python, it's free.

Example Multilevel Model in PyMC

======================= PyMC Multilevel Example

A nice example of using PyMC for multilevel (aka "Random Effects") modeling came through on the PyMC mailing list today. I've put it into a git repo so that I can play around with it a little, and collect up the feedback that the list generates.

The original message, from Scott B, lays out the situation with refreshing precision. I continue to be impressed with the skill-level that new PyMC users frequently begin with (for fun, compare to the questions on the NetworkX mailing list sometime... I guess this is an indication of the relative success in the popularization of network science and Bayesian statistics).

Anyways, here is the original message::

 Hi All,

 New to this group and to PyMC (and mostly new to Python). In any case,
 I'm writing to ask for help in specifying a multilevel model (mixed
 model) for what I call partially clustered designs. An example of a
 partially clustered design is a 2-condition randomized psychotherapy
 study where subjects are randomly assigned to conditions. In condition
 1, subjects are treated in small groups of say 8 people a piece. In
 the condition 2, usually a control, subjects are only assessed on the
 outcome (they don't receive an intervention). Thus you have clustering
 in condition 1 but not condition 2.

 The model for a 2-condition study looks like (just a single time point
 to keep things simpler):

 y_ij = b_0 + b_1*Tx + u_j*Tx + e_ij

 where y_ij is the outcome for the ith person in cluster j (in most
 multilevel modeling software and in PROC MCMC in SAS, subjects in the
 unclustered condition are all in clusters of just 1 person), b_0 is
 the overall intercept, b_1 is the treatment effect, Tx is a dummy
 coded variable coded as 1 for the clustered condition and 0 for the
 unclustered condition, u_j is the random effect for cluster j, and
 e_ij is the residual error. The variance among clusters is \sigma^2_u
 and the residual term is \sigma^2_e (ideally you would estimate a
 unique residual by condition).

 Because u_j interacts with Tx, the random effect is only contributes
 to the clustered condition.

 In my PyMC model, I expressed the model in matrix form - I find it
 easier to deal with especially for dealing with the cluster effects.
 Namely:

 Y = XB + ZU + R

 where X is an n x 2 design matrix for the overall intercept and
 intervention effect, B is a 1 x 2 matrix of regression coefficients, Z
 is an n x c design matrix for the cluster effects (where c is equal to
 the number of clusters in the clustered condition), and U is a c x 1
 matrix of cluster effects. The way I've written the model below, I
 have R as an n x n diagonal matrix with \sigma^2_e on the diagonal and
 0's on the off-diagonal.

 All priors below are based on a model fit in PROC MCMC in SAS. I'm
 trying to replicate the analyses in PyMC so I don't want to change the
 priors.

 The data for my code can be downloaded here:
 http://dl.dropbox.com/u/613463/run1.csv

 The data consist of 200 total subjects. 100 in the clustered condition
 and 100 in the unclustered. In the clustered condition there are 10
 clusters of 10 people each. There is a single outcome variable.

 The PyMC set-up file can be downloaded here (also pasted below):
 http://dl.dropbox.com/u/613463/analysis1.py

 The file for running the PyMC set-up file can be downloaded here:
 http://dl.dropbox.com/u/613463/run_analysis1.py

 I have 3 specific questions about the model:

 1 - Given the description of the model, have I successfully specified
 the model? The results are quite similar to PROC MCMC, though the
 cluster variance (\sigma^2_u) differs more than I would expect due to
 Monte Carlo error. The differences make me wonder if I haven't got it
 quite right in PyMC.

 2 - Is there a better (or more efficient) way to set up the model? The
 model runs quickly but I am trying to learn Python and to get better
 at optimizing how to set up models (especially multilevel models).

 3 - How can change my specification so that I can estimate unique
 residual variances for clustered and unclustered conditions? Right now
 I've estimated just a single residual variance. But I typically want
 separate estimates for the residual variances per intervention
 condition.

 Thanks so much for your help. My code follows my signature.

 Best,
 Scott

Here are my notes on the matter:

  1. This code worked for me without any modification. :) Although when I tried to run it twice in the same ipython session, it gave me strange complaints. (for pymc version 2.1alpha, wall time 78s). For the newest version in the git repo (pymc version 2.2grad, commit ca77b7aa28c75f6d0e8172dd1f1c3f2cf7358135, wall time 75s) it didn't complain.

  2. I find the data wrangling section of the model quite opaque. If there is a difference between the PROC MCMC and PyMC results, this is the first place I would look. I've been searching for the most transparent ways to deal with data in Python, so I can share some of my findings as applied to this block of code.

  3. The model could probably be faster. This is the time for me to recall the two cardinal rules of program optimization: 1) Don't Optimize, and 2) (for experts only) Don't Optimize Yet.

    That said, the biggest change to the time PyMC takes to run is in the step methods. But adjusting this is a delicate operation. More on this to follow.

  4. Changing the specification is the true power of the PyMC approach, and why this is worth the extra effort, since a random effects model like yours is one line of STATA. So I'd like to write out in detail how to change it. More on this to follow.

  5. Indentation should be 4 spaces. Diverging from this inane detail will make python people itchy.

I think that speeding up the MCMC is more fun than cleaning up the data wrangling, so I'll assume that the data is being set up as desired for now and mess around with the step methods. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/03d0e64653a99085ac5f391acd0ad5839ddfaae4 is a small example of this, which doesn't speed things up, but could mean less samples are required of the chain.

To actually speed things up (and also potentially reduce the number of samples required), I'm going to combine the beta0 and beta1 stochs. This reduces the wall time to 58s, but the diagnostic plots make me think the it needs a longer burn-in period. I like to throw out the first half of my samples, so I'll make that change, too. Pro tip: I added a reload(analysis1) to the run_analysis1.py script, so that my changes to the model automatically make it into ipython when I say time run run_analysis1. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/6bd55cef9c5d854f7755924c705ab8e111e20b3c

I'm not keeping careful track of if the results change as I modify the step methods, which is probably not a recommended practice. But it is fun to see the methods get faster and faster as I group more and more in the AM step method. (43s wall time for M.B and M.U in one AM, commit here; 32s wall time for all stochs together, but the results are not pretty) https://github.com/aflaxman/pymc-multilevel-example/commit/49dea954c0c681ad74a2b427a2e6ca594e501c4f

Sometimes I've had good luck with choosing initial values for the MCMC by maximizing the posterior or something related to it. Here is code to do that, which doesn't take very long (36s wall time, and respectible looking results are back) https://github.com/aflaxman/pymc-multilevel-example/commit/e8c376b8685c64c7183a8aeb14ec05b095042652

So that little burst of messing around has halved the run time. Too bad I didn't work on the things I was supposed to be doing for the last hour.

Another hour of messing around has given me some confidence that 200K samples is certainly enough, and 20K samples probably is enough to guide model development. To assess this, I relied on the autocorrelation plot of the stoch traces, which I added to the Matplot.plot routine, since I think everyone should use it approach. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/e8545ca5bec6e3a1e5dbb7ba20bb957bc67c0d95

Now I'm ready to try simplifying the data wrangling section, and then everything will be in order to extend the model to have different variance for the clustered and unclustered observations.

I like to use pylab's csv2rec function for loading data files. It deals nicely with the column headings, and converts data to the correct type pretty routinely. I'm going to do two commits in the process of changing this code, one which checks carefully that I'm not breaking anything, and then another to yield a nice short block of code. Commits linked here https://github.com/aflaxman/pymc-multilevel-example/commit/8c06499b287e1bbd9dd7a4146bc627d6e9187ad8 and here https://github.com/aflaxman/pymc-multilevel-example/commit/11ff14dab2377491e8d4d155acfd74b7ca6c93d0

Now, on to changing the model! I added a separate variance parameter for the treatment and control groups in a slightly sneaky way. It only took 10 keystrokes (roughly), but maybe it would be better to use more code and have a clearer approach. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/a46531085c673e62458b8b28153f50aebae036de

Before: B: [-0.07 0.3 ] u: [-0.14 0.02 0.3 0.5 0.36 -1.02 0.33 0.23 -0.28 -0.35] var_e1: 0.94 var_u: 0.4

After: B: [-0.06 0.31] u: [-0.13 0. 0.28 0.53 0.39 -1.08 0.34 0.22 -0.31 -0.35] var_e1: [ 1.12 0.79] var_u: 0.41

Since I sneakily used the indicator run1.treat to index the var_e1 array, var_e1[0] is the control variance and var_e1[1] is the treatment variance. So the change in the model doesn't change any of the parameter means, but it does show that the variance of the control group is higher than for the treatment group.

Too bad I didn't generate before-and-after numbers on the uncertainty intervals, that is something that might have changed in an interesting way with this more complicated model.

In conclusion, here is why I think that this approach is worth the extra effort, compared to using xtmixed in STATA. Although the STATA model is almost done when figure out the single line xtmixed y treat ||j:, that is not quite what you want. The control groups are not supposed to have random effects, so you have to find a hack for that, maybe xtmixed y treat ||j: treat, nointercept. And then you want to add different variances for treatment and control, which is doable, but also hacky, maybe with xtmixed y treat ||j: treat, nointercept ||treat:. And now you've got a very short line, but one that is going to be very hard to understand in the future. And what if you want to fiddle with the priors? Or change the noise model? Much better to have 30 clear lines of PyMC, in my opinion.

p.s. Django and Rails have gotten a lot of mileage out of emphasizing convention in frequently performed tasks, and I think that PyMC models could also benefit from this approach. I'm sure I can't develop our conventions myself, but I have changed all the file names to move towards what I think we might want them to look like. Commit linked here https://github.com/aflaxman/pymc-multilevel-example/commit/603112c1de3d0fd8fa8fdc76a21ec3f290d30a06 My analyses often have these basic parts: data, model, fitting code, graphics code. Maybe your do to.