Multivariate optimal allocation for different domains in one and two stages stratified sample design. R2BEAT extends the Neyman (1934) – Tschuprow (1923) allocation method to the case of several variables, adopting a generalization of the Bethel’s proposal (1989). R2BEAT develops this methodology but, moreover, it allows to determine the sample allocation in the multivariate and multi-domains case of estimates for two-stage stratified samples. It also allows to perform both Primary Stage Units and Secondary Stage Units selection.

R2BEAT easily manages all the complexity due to the optimal sample allocation in two-stage sampling design, and provides several outputs for evaluating the allocation. Its name stands for “R ‘to’ Bethel Extended Allocation for Two-stage”. It is an extension of another open-source software called Mauss-R (Multivariate Allocation of Units in Sampling Surveys), implemented by ISTAT researchers (https://www.istat.it/en/methods-and-tools/methods-and-it-tools/design/design-tools/mauss-r). Mauss-R determines the optimal sample allocation in multivariate and multi-domains estimation, for one-stage stratified samples.

To complete the suite of tools developed by Istat in order to cover the stratified sample design, we cite SamplingStrata (https://CRAN.R-project.org/package=SamplingStrata), that allows to jointly optimize both the stratification of the sampling frame and the allocation, still in the multivariate multidomain case (only for one-stage designs), and MultiWay.Sample.Allocation, that allows to determine the optimal sample allocation for multi-way stratified sampling designs and incomplete stratified sampling designs (https://www.istat.it/en/methods-and-tools/methods-and-it-tools/design/design-tools/multiwaysampleallocation).

For a complete illustration of the methodology see the vignette “R2BEAT methodology and use” (https://barcaroli.github.io/R2BEAT/articles/R2BEAT_methodology.html).

A complete example is illustrated in the vignette “Two-stage sampling design workflow” (https://barcaroli.github.io/R2BEAT/articles/R2BEAT_workflow.html).

Installation

You can install the released version of R2BEAT from CRAN with:

or the last version of R2BEAT from github with:

install.packages("devtools")
devtools::install_github("barcaroli/R2BEAT")

Examples with different scenarios

Jupyter notebook: Binder

Comparative evaluation of R2BEAT with packages “PracTools” and “samplesize4surveys”

Jupyter notebook: Binder

Example

library(R2BEAT)

#-------------------------------------------------------------------------------
# Read sampling frame
library(readr)
pop <- read_rds("https://github.com/barcaroli/R2BEAT_workflows/blob/master/pop.RDS?raw=true")
str(pop)
# 'data.frame': 2258507 obs. of  13 variables:
#   $ region       : Factor w/ 3 levels "north","center",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ province     : Factor w/ 6 levels "north_1","north_2",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ municipality : num  1 1 1 1 1 1 1 1 1 1 ...
# $ id_hh        : Factor w/ 963018 levels "H1","H10","H100",..: 1 1 1 2 3 3 3 3 1114 1114 ...
# $ id_ind       : int  1 2 3 4 5 6 7 8 9 10 ...
# $ stratum      : Factor w/ 24 levels "1000","2000",..: 12 12 12 12 12 12 12 12 12 12 ...
# $ stratum_label: chr  "north_1_6" "north_1_6" "north_1_6" "north_1_6" ...
# $ sex          : int  1 2 1 2 1 1 2 2 1 1 ...
# $ cl_age       : Factor w/ 8 levels "(0,14]","(14,24]",..: 3 7 8 5 4 6 6 4 4 1 ...
# $ active       : num  1 1 0 1 1 1 1 1 1 0 ...
# $ income_hh    : num  30488 30488 30488 21756 29871 ...
# $ unemployed   : num  0 0 0 0 0 0 0 0 0 0 ...
# $ inactive     : num  0 0 1 0 0 0 0 0 0 1 ...
#-------------------------------------------------------------------------------
# Precision constraints
cv <- as.data.frame(list(DOM=c("DOM1","DOM2"),
                         CV1=c(0.02,0.03),
                         CV2=c(0.03,0.06),
                         CV3=c(0.03,0.06),
                         CV4=c(0.05,0.08)))
cv
#    DOM  CV1  CV2  CV3  CV4
# 1 DOM1 0.02 0.03 0.03 0.05
# 2 DOM2 0.03 0.06 0.06 0.08
#-------------------------------------------------------------------------------
# Preparation
samp_frame <- pop
samp_frame$one <- 1
id_PSU <- "municipality"  
id_SSU <- "id_ind"        
strata_var <- "stratum"   
target_vars <- c("income_hh","active","inactive","unemployed")   
deff_var <- "stratum"     
domain_var <- "region"  
delta =  1       # households = survey units
minimum <- 50    # minimum number of SSUs to be interviewed in each selected PSU
deff_sugg <- 1.5 # suggestion for the deff value
inp <- prepareInputToAllocation1(samp_frame,
                                 id_PSU,
                                 id_SSU,
                                 strata_var,
                                 target_vars,
                                 deff_var,
                                 domain_var,
                                 minimum,
                                 delta,
                                 deff_sugg)
#-------------------------------------------------------------------------------
# Optimal allocation
alloc <- beat.2st(stratif = inp$strata, 
                  errors = cv, 
                  des_file = inp$des_file, 
                  psu_file = inp$psu_file, 
                  rho = inp$rho, 
                  deft_start = NULL,
                  effst = inp$effst, 
                  minPSUstrat = 2,
                  minnumstrat = 50)
#   iterations PSU_SR PSU NSR PSU Total  SSU
# 1          0      0       0         0 7887
# 2          1     31     104       135 8328
# 3          2     39     104       143 8317
# 4          3     38     104       142 8320
#-------------------------------------------------------------------------------
# First stage selection
sample_1st <- select_PSU(alloc, type="ALLOC", pps=TRUE, plot=TRUE)
sample_1st$PSU_stats
#    STRATUM PSU PSU_SR PSU_NSR  SSU SSU_SR SSU_NSR
# 1     1000   2      2       0  286    286       0
# 2     2000   9      3       6  452    152     300
# 3     3000   4      0       4  200      0     200
# 4     4000   2      0       2  100      0     100
# 5     5000   2      2       0  219    219       0
# 6     6000   2      0       2  100      0     100
# 7     7000   2      0       2  100      0     100
# 8     8000   2      0       2  100      0     100
# 9     9000   1      1       0  557    557       0
# 10   10000   6      6       0  587    587       0
# 11   11000  26      2      24 1300    100    1200
# 12   12000   8      0       8  400      0     400
# 13   13000   1      1       0  703    703       0
# 14   14000   4      4       0  577    577       0
# 15   15000  27      9      18 1361    461     900
# 16   16000  18      0      18  900      0     900
# 17   17000   1      1       0  154    154       0
# 18   18000   4      2       2  200    100     100
# 19   19000   7      1       6  350     50     300
# 20   20000   4      0       4  200      0     200
# 21   21000   1      1       0  125    125       0
# 22   22000   3      3       0  150    150       0
# 23   23000   4      0       4  200      0     200
# 24   24000   2      0       2  100      0     100
# 25   Total 142     38     104 9421   4221    5200
#-------------------------------------------------------------------------------
# Second stage selection
sample_2st <- select_SSU(df=pop,
                         PSU_code="municipality",
                         SSU_code="id_ind",
                         PSU_sampled=sample_1st$sample_PSU)
head(sample_2st)
#   municipality id_ind region province  id_hh stratum stratum_label sex  cl_age active income_hh
# 1           10   1570  north  north_1 H49617   12000     north_1_6   2 (64,74]      1  25393.18
# 2           10   1632  north  north_1 H49638   12000     north_1_6   1 (64,74]      1  24261.16
# 3           10   1638  north  north_1 H49639   12000     north_1_6   1 (24,34]      1  60761.11
# 4           10   1690  north  north_1 H49656   12000     north_1_6   2 (34,44]      1  40065.25
# 5           10   1709  north  north_1 H49665   12000     north_1_6   1 (44,54]      1  13320.58
# 6           10   1767  north  north_1 H49682   12000     north_1_6   1 (34,44]      1  27106.55
#   unemployed inactive  Prob_1st   Prob_2st    Prob_tot weight SR nSR stratum_2
# 1          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1
# 2          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1
# 3          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1
# 4          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1
# 5          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1
# 6          0        0 0.2363018 0.02665245 0.006298022 158.78  0   1   12000-1