Optimum stratification using decision tree

Authors
Abstract
Stratified sampling is one of the most widely used sampling designs. In some cases, it is up to the researcher to determine the boundaries of the strata, and in some cases, the population is already stratified. The optimal classification is obtained for a situation of strata boundries, where the variance of the population mean (or total) estimator reaches its lowest value. In traditional methods, the variance of the estimator is considered as a function of the strata boiundries for the response variable, in order to reach the minimum of the variance, equations are obtained which are often solved by numerical methods. The first deficiency of this method is not considering all auxiliary variables. For example, in estimating the average income, classifying the society based on factors such as gender and job history can not only increase the efficiency of the estimator, but also make the interpretability and generalizability of the results easier. The second one is complex equations that do not have a closed and understandable solutions

n this paper, we have tried to construct the optimal classification based on a new criterion that is a combination of variance and a penalty for increasing the number of strata, so that important auxiliary variables in the formation of the decision tree determine the boundries of the strata. The classification process starts from the saturated tree and with successive pruning until reaching the root node, the number of strata decreases, the optimal stratification is achieved based on the introduced combined criterion.
Keywords

Gupt, B. K., & Ahamed, Md. I. "Optimum stratification for a generalized auxiliary variable proportional allocation under a superpopulation model", Communications in Statistics - Theory and Methods, 51(10), (2022) 3269–3284. https://doi.org/10.1080/03610926.2020.1793203

Dalenius, T. "The Problem of Optimum Stratification", Scandinavian Actuarial Journal, (3–4), (1950).203–213. https://doi.org/10.1080/03461238.1950.10432042

Danish, F., Jan, R., Daniyal, M., and Tawiah, K. "Optimum Stratification Using Dynamic Programming with a Mixture of Ratio and Product Estimators under Super Population Model", Mathematical Problems in Engineering (2023)

Khan, M. G. M., Prasad, V. D., Rao, D. K. "On Optimum Stratification", World Academy of Science, Engineering and Technology International Journal of Mathematical and Computational Sciences. 8(3), (2014). 508-512



Reddy, K.G., Khan, M. G. M, Khan, S. Optimum strata boundaries and sample sizes in health surveys using auxiliary variables. PLoS One. Apr 5;13(4) (2018) e0194787. doi: 10.1371/journal.pone.0194787. PMID: 29621265; PMCID: PMC5886534.

Särndal, C.-E., Swensson, B., & Wretman, J.. "Model assisted survey sampling", Springer-Verlag Publishing. (1992) https://doi.org/10.1007/978-1-4612-4378-6

Singh, R., Sukhatme, B.V. "Optimum stratification", Ann Inst Stat Math 21, 515–528 (1969). https://doi.org/10.1007/BF02532275

Singh, R., Sukhatme, B.V. "Optimum stratification with ratio and regression methods of estimation", Ann Inst Stat Math 25, 627–633 (1973). https://doi.org/10.1007/BF02479404