Multiple imputation to fill in missing data in soil physico-hydrical properties database
Palavras-chave:
Soil database. Incomplete data. Markov Chain Monte Carlo. Missing predictors.Resumo
Missing values in databases is a common issue and almost inevitable. Multiple imputation (MI) is an efficient
statistical method for estimating missing values in an incomplete dataset. To test this approach for a soil database, we
hypothesized that the imputation of missing data provides a statistically more accurate database than the complete case analysis
(CCA). The overall goal of our study was to evaluate the efficiency of the MI using the MICE (Multivariate Imputation by
Chained Equations) algorithm to fill in missing data in a database of soil physico-hydrical properties, and to show that it is more
feasible to perform the imputation than the CCA. Preliminary analyses were performed to check the suitability of the proposed
algorithm. Imputation of the missing data of each variable was adjusted using linear regression models. The variables with
missing data comprise the model as the dependent variable and the other variables, which were correlated with the same, enter
as covariates. The analysis was performed by comparing the values of the estimates, their standard errors and 95% confidence
intervals. The pattern missing was multivariate and arbitrary and, organic matter was the variable with the largest amount of
missing data. The significance of the covariates varied depending on the variable to be estimated. The results showed that the
MICE presented better performance than CCA, since, although the statistical comparison of the two methods was similar,
multiple imputation maintains the size of the database and preserves the general distribution.