Skip to contents

The ideas here are based on Chapter 9 of Särndal, Swensson and Wretman, which deals with two-phase sampling and the regression estimation of population totals. Here we are interested in multiphase sampling and in settings where many statistics may be computed with the same set of calibrated weights, so the priorities and notation are slightly different.

As in the book, we consider the setting where the sampling probabilities are known (or well-estimated) in advance and raking adjustments are small. When sampling probabilities π\pi are grossly incorrect (eg: missing data/non-response) so that the raking adjustments are not small, we would want different computations such as those of Chang & Kott (Biometrika, 2008).

Estimation

In multiphase sampling we have a sequence of KK nested subsamples, with sampling probabilities that can depend on any data at previous phases. We write πi,1\pi_{i,1}, πi,2|1\pi_{i,2|1}, πi,3|2\pi_{i,3|2}, … , πi,k|k1\pi_{i, k|k-1},…,πi,K|K1\pi_{i,K|K-1} as the sampling probabilities for observation ii at each phase, and similarly for pairwise probabilities. In general, we use a ,k|k1,k|k-1 subscript for quantities related to the kkth phase of sampling and a ,k,k subscript for quantities related to the cumulative effect of sampling from phase 1 through phase kk. All sums are over the population unless otherwise specified; restriction to samples is achieved with the sampling indicators Ri,kR_{i,k} (and RiRi,KR_i\equiv R_{i,K}). We also write πij,k=πij,K*/πij,k*==k+1Kπij,*\pi^\dagger_{ij,k}=\pi^*_{ij,K}/\pi^*_{ij,k}=\prod_{\ell=k+1}^K \pi^*_{ij,\ell} for the forward cumulative pairwise probabilities.

The product πi*πi,K*=πi,1πi,2|1πi,3|2...πi,K|K1\pi_i^*\equiv\pi^*_{i,K}=\pi_{i,1}\pi_{i,2|1}\pi_{i,3|2}...\pi_{i,K|K-1} is not in general the marginal sampling probability for observation ii, because each πk|k1\pi_{k|k-1} can depend on data up to phase k1k-1, so πi*=πi,K*\pi_i^*=\pi_{i,K}^* is a random variable. To get the true marginal probabilities we would need to integrate out all the dependence on data measured at intermediate phases. However, it is still true that E[Ri,k/πi,k*]=1E[R_{i,k}/\pi_{i,k}^*]=1, which is the key fact we need to estimate totals T̂X=iRiπi*Xi\hat T_X = \sum_i \frac{R_i}{\pi_i^*}X_i

It’s also still true (per Särndal et al) that the variance of a total can be estimated by something very like the Horvitz-Thompson formula var̂[T̂X]=i,jX̌iX̌jΔ̌ij\widehat{\mathrm{var}}[\hat T_X]= \sum_{i,j} \check{X}_i\check{X}_j\check{\Delta}_{ij} where X̌i=Xi/πi*\check{X}_i=X_i/\pi_i^*, Δij=πij*=πi*πj*\Delta_{ij}=\pi^*_{ij}=\pi_i^*\pi_j^*, and Δ̌ij=Δij/πij*=(1πi*πj*/πij*)\check{\Delta}_{ij}=\Delta_{ij}/\pi^*_{ij}=(1-\pi^*_{i}\pi^*_j/\pi^*_{ij})

In the absence of raking we could speed up computation using a recursive relationship for constructing Δ̌\check{\Delta} from the weighted covariances Δ̌1,Δ̌2|1,\check{\Delta}_1, \check{\Delta}_{2|1}, and so on at each phase, subscripted down to the subsample remaining at phase KK. This also has the advantage of simplicity. However, we can’t do this precomputation with raking, and it has the disadvantage of not giving components of variance at each phase, which we like having. We do use the recursive combination for constructing Δ̌k|k1\check{\Delta}_{k|k-1} in multistage designs within a single phase.

Per-phase variances

We need to use the a summation over phases. We consider an estimated total as a telescoping sum over phases T̂XTX=(T̂X,1TX)+(T̂X,2T̂X,1)++(T̂XT̂X,K1)\hat T_X-T_X= \left(\hat T_{X,1}-T_X\right)+ \left(\hat T_{X,2}-\hat T_{X,1}\right)+\cdots+\left(\hat T_{X}-\hat T_{X,K-1}\right) where each term is the error incurred by one phase of sampling. That is

T̂X,1TX=iRi,1πi,1*xiixi\hat T_{X,1}-T_X=\sum_i \frac{R_{i,1}}{\pi^*_{i,1}}x_i- \sum_i x_iT̂X,2T̂X,1=iRi,2πi,2*xiiRi,1πi,1*xi\hat T_{X,2}-\hat T_{X,1}=\sum_i \frac{R_{i,2}}{\pi^*_{i,2}}x_i- \sum_i \frac{R_{i,1}}{\pi^*_{i,1}}x_i and in general T̂X,kT̂X,k1=iRi,kπi,k*xiiRi,k1πi,k1*xi\hat T_{X,k}-\hat T_{X,k-1}=\sum_i \frac{R_{i,k}}{\pi^*_{i,k}}x_i- \sum_i \frac{R_{i,k-1}}{\pi^*_{i,k-1}}x_i These are all uncorrelated, because each depends on sampling only at one phase. They aren’t independent, because the available data for sampling at phase kk depends on all previous phases, but they do form a martingale difference sequence. I will write x̌i,k\check{x}_{i,k} for xi/πi,k*x_i/\pi^*_{i,k}, the weighted observation at phase kk, so that we have T̂X,kT̂X,k1=iRi,kx̌i,kiRi,k1x̌i,k1\hat T_{X,k}-\hat T_{X,k-1}=\sum_i R_{i,k}\check{x}_{i,k}- \sum_i R_{i,k-1}\check{x}_{i,k-1}

The variance of this sum is thus the sum of variances var[T̂X]=var[T̂X,1TX]+var[T̂X,2T̂X,1]++var[T̂XT̂X,K1]\mathrm{var}\left[\hat T_X\right]= \mathrm{var}\left[\hat T_{X,1}-T_X\right]+ \mathrm{var}\left[\hat T_{X,2}-\hat T_{X,1}\right]+\cdots+\mathrm{var}\left[\hat T_{X}-\hat T_{X,K-1}\right] and each variance is (conditional on the sampling so far) of the usual Horvitz-Thompson form var[T̂X,kT̂X,k1]=i,jRi,k1Rj,k1cov[Ri,k|k1,Rj,k|k1]x̌i,kx̌j,k\mathrm{var}\left[\hat T_{X,k}-\hat T_{X,k-1}\right]=\sum_{i,j}R_{i,k-1}R_{j,k-1}\mathrm{cov}\left[R_{i,k|k-1},R_{j,k|k-1}\right]\check{x}_{i,k}\check{x}_{j,k} which could be estimated at phase kk by var̂k[T̂X,kT̂X,k1]=i,jRi,kRj,kπij,k|k1cov[Ri,k|k1,Rj,k|k1]x̌i,kx̌j,k=i,jRi,kRj,kΔ̌ij,k|k1x̌i,kx̌j,k\widehat{\mathrm{var}}_k\left[\hat T_{X,k}-\hat T_{X,k-1}\right]=\sum_{i,j}\frac{R_{i,k}R_{j,k}}{\pi_{ij,k|k-1}}\mathrm{cov}\left[R_{i,k|k-1},R_{j,k|k-1}\right]\check{x}_{i,k}\check{x}_{j,k}=\sum_{i,j}R_{i,k}R_{j,k}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k} That’s still not enough, because we don’t necessarily have xx until phase KK, so we need to weight down to phase KKvar̂K[T̂X,kT̂X,k1]=i,jRi,kRj,kRi,KRj,Kπij,K*/πij,k*Δ̌ij,k|k1x̌i,kx̌j,k=i,jRi,KRj,Kπij,kΔ̌ij,k|k1x̌i,kx̌j,k\widehat{\mathrm{var}}_K\left[\hat T_{X,k}-\hat T_{X,k-1}\right]=\sum_{i,j}R_{i,k}R_{j,k}\frac{R_{i,K}R_{j,K}}{\pi^*_{ij,K}/\pi^*_{ij,k}}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k}=\sum_{i,j}\frac{R_{i,K}R_{j,K}}{\pi^\dagger_{ij,k}}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k}

Raking

Raking of phase kk to phase k1k-1 involves estimating raking adjustments gi,k|k1g_{i,k|k-1} that satisfy calibration constraints on variables AiA_i available at phase k1k-1. iRi,kgi,k|k11πi,k*Ai=iRi,k11πi,k1*Ai\sum_i R_{i,k}g_{i,k|k-1}\frac{1}{\pi_{i,k}^*}A_i=\sum_i R_{i,k-1}\frac{1}{\pi^*_{i,k-1}}A_i Let ΠA\Pi_A and ΠA\Pi_\bar A be the projections on to and orthogonal to the space spanned by AA given weights 1/πk1*1/\pi_{k-1}^* and let x̂i=ΠAxi\hat x_i=\Pi_Ax_i and ei=ΠAxie_i=\Pi_\bar Ax_i. The calibrated total estimate is T̂X=iRigi,k|k1πi*xi=iRigi,k|k1x̌i=iRigi,k|k1ěi,k+iRi,kgi,k|k1x̂̌i,k\hat T_X=\sum_i\frac{R_ig_{i,k|k-1}}{\pi^*_i}x_i=\sum_iR_ig_{i,k|k-1}\check{x}_i=\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k}g_{i,k|k-1}\check{\hat x}_{i,k} Applying the calibration constraints to the second term we have iRigi,k|k1ěi,k+iRi,kgi,k|k1x̂̌i,k=iRigi,k|k1ěi,k+iRi,k1x̂̌i,k1\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k}g_{i,k|k-1}\check{\hat x}_{i,k}=\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k-1}\check{\hat x}_{i,k-1} and since x̂i,k1\hat x_{i,k-1} is not random (to first order) conditional on phase k1k-1 the estimated variance contribution for phase kk comes from just the first term and is var̂k|k1[T̂X]=i,jRi,KRj,Kπjk,Kgi,k|k1gj,k|k1ěi,kěj,kΔ̌ij,k|k1\widehat{\textrm{var}}_{k|k-1}\left[\hat T_X\right]=\sum_{i,j}\frac{R_{i,K}R_{j,K}}{\pi^\dagger_{jk,K}}g_{i,k|k-1}g_{j,k|k-1}\check{e}_{i,k}\check{e}_{j,k}\check{\Delta}_{ij,k|k-1} with the variances of the other phases being unaffected.