# Multivariate allocation in two-stage sampling design

Two-stage sampling design with stratification of PSUs are very common for households surveys in Official Statistics. However, their sample allocations are usually a non-easy task, because the households surveys have multi-domains and multi-purpose objectives, so they have to provide accurate estimates for different variables and different domains (e.g. geographical areas such as national, regional, etc.).

The whole sample size of a survey, $$n$$, is often an exogenous information, because it is usually defined by budget and, sometimes, also by logistic constraints.

Instead, the allocation of the sample $$n$$ among the strata ($$h=1,\dots,L$$), generally the lower level at which estimates are required, can be defined in different ways (mainly with uniform, proportional or optimal allocation). In Official Statistics, optimal allocation is the most widely used method, especially because information on the size of the strata and the variance of target variables in the strata can be obtained. In particular, the variances of the target variables or at least proxy variables for each stratum can be computed, for instance, from register data or previous occasions of the same survey.

The main idea for optimal allocation is that strata with larger size and larger variability with respect to the target variables need more sample size to provide better estimates. Since, in our case, data from previous surveys occasions were available, an optimal allocation could be deployed.

R2BEAT extends the Neyman (Neyman (1934)) - Tschuprow (Tschprow (1923)) allocation method to the case of several variables, adopting a generalization of the Bethel’s proposal (Bethel (1989)). Besides, enables the determination of optimal sample allocation for two-stage stratified samples, by using the function beat.2st.

Before using this function, a preliminary step is needed. In fact, it is necessary to define some parameters for performing the function (see the package documentation for the input formats). In particular:

1. population size of each stratum in terms of households or individuals depending on the aim of the survey N);
2. strata to be censused (CENS);
3. cost of interviews per stratum (COST);
4. number of PSUs to be selected in each stratum (MINCS);
5. average dimension of SSUs in each stratum (DELTA, $$\Delta$$);
6. minimum number of SSUs to be interviewed in each selected PSU (MINIMUM);
7. mean of each target variables in each strata (M, $$\hat{p}$$ or $$\hat{x}$$);
8. estimated population standard deviation of each target variable in each stratum (S, $$\hat{S}$$);
9. intra-class correlation coefficients for each target variable in each stratum (RHO, $$\hat{\rho}$$);
10. estimator effect for each target variable in each stratum (EFFST, $$\widehat{effst}$$).

To better understand the optimal allocation, a few words on points 7-10 are necessary. In particular, the mean can be estimated using the function svystatTM of the package ReGenesees. For the estimated population standard deviation, a distinction between binary and quantitative variable is needed. In the case of binary variables, the estimated population standard deviation of variable $$i$$ is equal to

$\hat{S} = \sqrt{\hat{p}_i \times (1 - \hat{p}_i)}$

where $$\hat{p}_i$$ is an estimate of the proportion of individuals with a given characteristic in the population strata, that is the mean of a binary variable. While, in the case of a quantitative variable, it is equal to $\begin{eqnarray} \label{S_quant} \hat{S} = \sqrt{\hat{M}_i^2 - \hat{\bar{Y}}_i^{2}} \end{eqnarray}$ where $$\hat{M}_i^2 = \frac{\sum_{k \in s} y_{ik}^2 \ w_k}{N}$$ and $$\hat{\bar{Y}}_i = \frac{ \sum_{k \in s} y_{ik} \ w_k}{N}$$ are the quadratic mean and the arithmetic mean estimated on sample data of the previous occasion(s) of the survey, respectively, with $$y_{ik}$$ the value of the target variable $$i$$ observed on the unit $$k$$, $$w_k$$ the corresponding sampling weight and $$N$$ the population size.

A crucial statistic for the optimal sample allocation in two-stage sampling design is the intra-class correlation coefficient, $$\hat{\rho}$$. Due to logistics, in household surveys the sample is not wide-spread in the whole country, but often ‘clusterised’ in PSUs and, then, in SSUs (clusters of family members). If the units of clusters are too similar to each another, it is not efficient to collect too many units from the same cluster. Anyway, sometimes a small lost in efficiency of the estimates is accepted, because it is paid off by a gain in the organization of the survey and, moreover, by cost reduction.

Therefore, to optimize the sample size, it is important to compute the intra-class correlation coefficient, because it provides a measure of data clustering in PSUs and SSUs. In general, if the value of $$\hat{\rho}$$ is close to 1 the clustering is high and it is convenient to collect only a few units in the cluster. On the contrary, if $$\hat{\rho}$$ is close to 0, collection of units from the same cluster does not affect the efficiency of the estimates.

To compute $$\hat{\rho}$$, another important statistic must be introduced: the design effect (DEFF, $$deff$$). The design effect measures how much the sampling variance under the adopted sampling design is inflated with respect to simple random sample ($$srs$$), with the same sample size. In formula: $\begin{eqnarray} \label{deff} deff (\hat{\bar{Y}}_{i}) & = & \frac{\text{var}(\hat{\bar{Y}}_{i})_{des}}{\text{var}(\hat{\bar{Y}}_{i})_{srs}} \\ \label{deff1} & = & 1+\hat{\rho}_i \ (b-1)) \end{eqnarray}$ where $$b$$ is the average cluster (i.e. PSU) size in terms of the final sampling units and $$\hat{\rho}_i$$ the intra-class correlation within the cluster (PSU) for the variable $$i$$.

The package R2BEAT takes into account a more general expression of $$deff$$. This expression refers to a typical situation for households surveys in which PSUs are assigned to Self-Representing (SR) strata, that is they are included for sure in the sample, or to Not-Self-Representing (NSR) strata, where they are selected by chance. In practice, this assignment is usually performed by comparing their measure of size (MOS) with respect to the threshold: $\begin{equation} \label{threshold} \lambda = \frac{\bar{m} \ \Delta}{f} \end{equation}$ where $$\bar{m}$$ is the minimum number of SSUs to be interviewed in each selected PSU (MINIMUM), $$f=n/N$$ is the sampling fraction and $$\Delta$$ (DELTA) is the average dimension of the SSU in terms of elementary survey units. Then DELTA must be set equal to 1 if, for the survey, the selection units are the same as the elementary units (that is, household-household or individuals-individuals), whereas it must be set equal to the average dimension of the households if the elementary units are individuals, while the selection units are the households.

PSUs with MOS exceeding the threshold are identified as SR, while the remaining PSUs are NSR.

Then the extended expression of $$deff$$ is $\begin{equation} deff (\hat{\bar{Y}}_{i}) = \frac{N_{SR}^2}{n_{SR}} (1 + (\hat{\rho}_{i,SR} \ (b_{SR} - 1)) + \frac{N_{NSR}^2}{n_{NSR}} (1 + (\hat{\rho}_{i,NSR} \ (b_{NSR} - 1)) \label{deff2} \end{equation}$

where, for $$SR$$ and $$NSR$$ strata,

The design effect is equal to 1 under the $$srs$$ design and increases for each additional stage of selection, due to intra-class correlation coefficient which is, usually, positive. It can be computed using the function svystatTM from ReGenesees, setting up the parameter deff=TRUE.

The intra-class correlation coefficient for $$NSR$$ can be derived from the expression of $$deff$$, that is $\begin{equation} \hat{\rho}_{i,NSR} = \frac{deff_i - 1}{b_{SNR}-1}. \end{equation}$ While it is not necessary to compute the intra-class correlation coefficient for $$SR$$ strata because, just one PSU is selected and the intra-class correlation is 1 by definition.

The last statistic to be defined is the estimator effect (EFFST, $$effst$$), that is, how much the sampling variance of the applied estimator under the adopted design is inflated or deflated with respect to the HT estimator, on the same sample. The $$effst$$ is equal to $\begin{equation} \label{effst} effst (\hat{\bar{Y}}_{i})_{st} = \frac{\text{var}(\hat{\bar{Y}}_{i})_{st}}{\text{var}(\hat{\bar{Y}}_{i})_{HT}} \\ \end{equation}$ It is an optional parameter for R2BEAT, but it useful to take into account, from the allocation phase, of the impact on the estimates of a different estimator other than the Horvitz-Thompson estimator ($$HT$$). Indeed, the most applied estimator for the households survey is the calibrated estimator that, through the use of auxiliary variables, provides better estimates than HT. Then, a reduction of the final sample size can be expected when applying the calibration estimator in place of the HT.

The optimal allocation is defined by R2BEAT solving the minimum optimization problem

$\left\lbrace \begin{array}{l} C = \text{min} \\ \sigma\left(\hat{\bar{Y}}_{i,h}\right) \leq \delta \left(\hat{\bar{Y}}_{i,h} \right) \hspace{0.5cm} \begin{array}{c} i = 1, \dots, J \\ h = 1, \dots, L \end{array} \end{array} \right.,$

where $$C$$ is the global cost of the survey (if COST is equal to 1 in all the strata, $$C=n$$) and $$\sigma\left(\hat{\bar{Y}}_{i,h}\right)$$ is the error we expect to observe estimating the $$Y_i$$ variable in the stratum $$h$$ (expected errors), that must be lower than the precision constraints defined by the user.

The solution is obtained with an iterative algorithm. Then, $$\hat{S}$$ is multiplied by the new design effect and the estimator effect and a new allocation is computed. The algorithm stops when the difference between two consecutive iterations is lower than a predefined threshold. At each step, an allocation is provided and the design effect is updated.

The package R2BEAT provides several outputs that help the evaluation of the allocation.

It is important to point out that, as stated at the beginning of this section, the $$n$$ is usually given. Then, to find the optimal allocation it is necessary to tune the precision constraints until the desired sample size is matched.

# First stage selection

Once performed the allocation step, the function StratSel allows to perform the selection of the Primary Stage Units (PSUs).

The function carries out the stratification of the PSUs, for each estimation domain, according to a size threshold. PSUs with measure of size exceeding the threshold are identified as SR (see also description in the previous section). The remaining NSR PSUs are ordered by their measure of size and divided into strata whose sizes are approximately equal to the threshold multiplied by the number of PSUs to be selected in each stratum (MINCS). In this way, strata are composed by PSUs having as equal as possible size. For each PSU, StratSel determines the size of final sample units on the basis of the size planned in input. Then the selection of a fixed number of PSUs per stratum is carried out using the Sampford’s method (unequal probabilities, without replacement, fixed sample size), implemented by the UPsampford function of the R package sampling.

The function produces an output list, reporting the overall information on selected PSUs. In particular, in the fourth output, it provides the first order inclusion probability for each PSU, that is, for the first selection stage.

# Second stage selection

The inclusion probability for the second stage can be easily obtained dividing the number of SSUs to be selected in the PSU by the measure of size of the PSU.

Then, the design weights for each unit in the sample are equal to the inverse of the product between the first order and the second stage inclusion probabilities.

The inclusion probability of the first and second stage of the sampling scheme are such that the overall inclusion probabilities are almost constant for each unit in the sample, thus obtaining self-weighting samples.

Once performed the first stage selection step, the function select_SSU allows to perform the selection of the Primary Stage Units (PSUs).