The ideas here are based on Chapter 9 of Särndal, Swensson and
Wretman, which deals with two-phase sampling and the regression
estimation of population totals. Here we are interested in multiphase
sampling and in settings where many statistics may be computed with the
same set of calibrated weights, so the priorities and notation are
slightly different.
As in the book, we consider the setting where the sampling
probabilities are known (or well-estimated) in advance and raking
adjustments are small. When sampling probabilities
are grossly incorrect (eg: missing data/non-response) so that the raking
adjustments are not small, we would want different computations such as
those of Chang & Kott (Biometrika, 2008).
Estimation
In multiphase sampling we have a sequence of
nested subsamples, with sampling probabilities that can depend on any
data at previous phases. We write
,
,
,
… ,
,…,
as the sampling probabilities for observation
at each phase, and similarly for pairwise probabilities. In general, we
use a
subscript for quantities related to the
th
phase of sampling and a
subscript for quantities related to the cumulative effect of sampling
from phase 1 through phase
.
All sums are over the population unless otherwise specified; restriction
to samples is achieved with the sampling indicators
(and
).
We also write
for the forward cumulative pairwise probabilities.
The product
is not in general the marginal sampling probability for
observation
,
because each
can depend on data up to phase
,
so
is a random variable. To get the true marginal probabilities we would
need to integrate out all the dependence on data measured at
intermediate phases. However, it is still true that
,
which is the key fact we need to estimate totals
It’s also still true (per Särndal et al) that the variance of a total
can be estimated by something very like the Horvitz-Thompson formula
where
,
,
and
In the absence of raking we could speed up computation using a
recursive relationship for constructing
from the weighted covariances
and so on at each phase, subscripted down to the subsample remaining at
phase
.
This also has the advantage of simplicity. However, we can’t do this
precomputation with raking, and it has the disadvantage of not giving
components of variance at each phase, which we like having. We do use
the recursive combination for constructing
in multistage designs within a single phase.
Per-phase variances
We need to use the a summation over phases. We consider an estimated
total as a telescoping sum over phases
where each term is the error incurred by one phase of sampling. That
is
and in general
These are all uncorrelated, because each depends on sampling only at one
phase. They aren’t independent, because the available data for sampling
at phase
depends on all previous phases, but they do form a martingale difference
sequence. I will write
for
,
the weighted observation at phase
,
so that we have
The variance of this sum is thus the sum of variances
and each variance is (conditional on the sampling so far) of the usual
Horvitz-Thompson form
which could be estimated at phase
by
That’s still not enough, because we don’t necessarily have
until phase
,
so we need to weight down to phase
Raking
Raking of phase
to phase
involves estimating raking adjustments
that satisfy calibration constraints on variables
available at phase
.
Let
and
be the projections on to and orthogonal to the space spanned by
given weights
and let
and
.
The calibrated total estimate is
Applying the calibration constraints to the second term we have
and since
is not random (to first order) conditional on phase
the estimated variance contribution for phase
comes from just the first term and is
with the variances of the other phases being unaffected.