Estimating Ordinal Regression Models with rstanarm
Jonah Gabry and Ben Goodrich
2026-03-09
Source:vignettes/polr.Rmd
polr.RmdIntroduction
This vignette explains how to estimate models for ordinal outcomes
using the stan_polr function in the
rstanarm package.
Steps 3 and 4 are covered in more depth by the vignette entitled “How to Use the rstanarm Package”. This vignette focuses on Step 1.
One of the strengths of doing MCMC with Stan — as opposed to a Gibbs sampler — is that reparameterizations are essentially costless, which allows the user to specify priors on parameters that are either more intuitive, numerically stable, or computationally efficient without changing the posterior distribution of the parameters that enter the likelihood. Advantageous parameterizations are already built into the Stan programs used in the rstanarm package, so it is just a matter of using these vignettes to explain how the priors work in the context of these reparameterizations.
Likelihood
Ordinal outcomes fall in one of categories. One way to motivate an ordinal model is to introduce a latent variable, , that is related to the observed outcomes via an observation mechanism: where is a vector of cutpoints of length .
Then is modeled as a linear function of predictors where has mean zero and unit scale but can be specified as being drawn from one of several distributions. Note that there is no “intercept” in this model since the data cannot distinguish an intercept from the cutpoints. However, if , then can be referred to as either the cutpoint or the intercept.
A Bayesian can treat as another unknown parameter, although for computational efficiency the Stan code essentially integrates each out of the posterior distribution, leaving the posterior distribution of and . Nevertheless, it is useful to motivate the model theoretically as if were just an unknown parameter with a distribution truncated by the relevant element(s) of .
Priors
If were observed we would simply have a linear regression model for it, and the description of the priors in the vignette entitled “Estimating Linear Models with the rstanarm Package” would apply directly. Another way to say the same thing is conditional on a realization of , we have a linear regression model and the description of the priors in the other vignette does apply (and should be read before continuing with this subsection).
The stan_lm function essentially specifies a prior on
,
where
is the upper triangular matrix in the QR decomposition of the design
matrix,
.
Furthermore, in stan_lm,
where
is the proportion of variance in the outcome that is attributable to the
coefficients in a linear model.
The main difference in the context of a model for an ordinal outcome is that the scale of is not identified by the data. Thus, the ordinal model specifies that , which implies that is an intermediate parameter rather than a primitive parameter.
It is somewhat more difficult to specify a prior value for the in an ordinal model because refers to the proportion of variance in the that is attributable to the predictors under a linear model. In general, the tends to be lower in an ordinal model than in a linear model where the continuous outcome is observed.
The other difference is that an ordinal model does not have a global intercept but rather a vector of cutpoints. The implied prior on these cutpoints used by the rstanarm package is somewhat novel. The user instead specifies a Dirichlet prior on , which is to say the prior probability of the outcome falling in each of the categories given that the predictors are at their sample means. The Dirichlet prior is for a simplex random variable, whose elements are non-negative and sum to . The Dirichlet PDF can be written as where is a simplex vector such that .
The Dirichlet prior is one of the easiest to specify because the so-called “concentration” hyperparameters can be interpreted as prior counts, i.e., prior observations for each of the J categories (although they need not be integers). If for every (the default used by rstanarm) then the Dirichlet prior is jointly uniform over the space of these simplexes. This corresponds to a prior count of one observation falling in each of the ordinal categories when the predictors are at their sample means and conveys the reasonable but weak prior information that no category has probability zero. If, for each , then the prior mode is that the categories are equiprobable, with prior probability of the outcome falling in each of the categories. The larger the value of the more sharply peaked the distribution is at the mode.
The -th cutpoint is then given by where is an inverse CDF function, which depends on the assumed distribution of . Common choices include the normal and logistic distributions. The scale parameter of this distribution is again . In short, by making each a function of , it allows us to specify a Dirichlet prior on , which is simpler than specifying a prior on directly.
Example
In this section, we start with an ordinal model of tobacco
consumption as a function of age and alcohol consumption. Frequentist
estimates can be obtained using the polr function in the
MASS package:
To obtain Bayesian estimates, we prepend stan_ and
specify the priors:
library(rstanarm)
post0 <- stan_polr(tobgp ~ agegp + alcgp, data = esoph,
prior = R2(0.25), prior_counts = dirichlet(1),
seed = 12345)
print(post0, digits = 1)The point estimates, represented by the posterior medians, are
qualitatively similar to the maximum-likelihood estimates but are
somewhat shrunk toward zero due to the regularizing prior on the
coefficients. Since these cutpoints are actually known, it
would be more appropriate for the model to take that into account, but
stan_polr does not currently support that.
Next, we utilize an example from the MASS package where low birthweight is the binary outcome of interest. First, we recode some of the variables:
data("birthwt", package = "MASS")
birthwt$race <- factor(birthwt$race, levels = 1:3,
labels = c("white", "black", "other"))
birthwt$bwt <- birthwt$bwt / 1000 # convert from grams to kilograms
birthwt$low <- factor(birthwt$low, levels = 0:1, labels = c("no", "yes"))It is usually a good idea to rescale variables by constants so that all the numbers are in single or double digits. We start by estimating a linear model for birthweight in kilograms, flipping the sign so that positive coefficients are associated with lower birthweights.
post1 <- stan_lm(-bwt ~ smoke + age + race + ptl + ht + ftv,
data = birthwt, prior = R2(0.5),
seed = 12345)
print(post1)Next, we estimate an “ordinal” model for the incidence of low
birthweight, which is defined as a birth weight of less than
kilograms. Even though this outcome is binary, a binary variable is a
special case of an ordinal variable with
categories and is acceptable to stan_polr. We can think of
bwt as something proportional to
and pretend that it is not observed, forcing us to estimate an ordinal
model.
post2 <- stan_polr(low ~ smoke + age + race + ptl + ht + ftv, data = birthwt,
prior = R2(0.5), prior_counts = dirichlet(c(1,1)),
method = "probit", seed = 12345)This prior seems to have worked well in this case because none of the points in the plot are above , which would have indicated the the posterior is very sensitive to those observations. If we compare the estimated coefficients,
they have the same signs and similar magnitudes, with the exception
of the “Intercept”. In an ordinal model where the outcome only has
categories, this “Intercept” is actually
,
but it is more conventional to call it the “Intercept” so that it agrees
with stan_glm when
family = binomial(link = 'probit'). Recall that
in an ordinal model, so if we rescale the coefficients from a linear
model by dividing by the posterior median of
,
the resulting coefficients are even closer to those of the ordinal
model.
This illustrates the fundamental similarity between a linear model for a continuous observed outcome and a linear model for a latent that generates an ordinal observed outcome. The main difference is when the outcome is continuous and observed, we can estimate the scale of the errors meaningfully. When the outcome is ordinal, we can only fix the scale of the latent errors to arbitrarily.
Finally, when
,
the stan_polr function allows you to specify
non-NULL values of the shape and
rate arguments, which implies a “scobit” likelihood where
the probability of success is given by
,
where
is the logistic CDF and
is a skewing parameter that has a gamma prior with a given
shape and rate. If
,
then the relationship between
and the probability of success is asymmetric. In principle, it seems
appropriate to estimate
but in practice, a lot of data is needed to estimate
with adequate precision. In the previous example, if we specify
shape = 2 and rate = 2 to reflect the prior
beliefs that
is expected to be
but has a variance of
,
then the loo calculation yields many Pareto shape
parameters that are excessively large. However, with more than
observations, such a model may be more fruitful.
Conclusion
The posterior distribution for an ordinal model requires priors on
the coefficients and the cutpoints. The priors used by the
stan_polr function are unconventional but should work well
for a variety of problems. The prior on the coefficients is essentially
the same as that used by the stan_lm function but omits a
scale parameter because the standard deviation of the latent
is not identified by the data. The cutpoints are conditionally
deterministic given a simplex vector for the probability of falling in
each of the
ordinal categories given that the predictors are at their sample means.
Thus, a Dirichlet prior — which is relatively easy to specify and has a
good default of jointly uniform — on this simplex completes the
posterior distribution.
This approach provides an alternative to stan_glm with
family = binomial() even if the outcome variable has only
two categories. The stan_glm function has more options for
the prior on the coefficients and the prior on the intercept (which can
be interpreted as the first cutpoint when
).
However, it may be more difficult to obtain efficient sampling with
those priors.