Abstract
Regression analysis is often affected by high dimensionality, severe multicollinearity, and a large proportion of missing data. These problems may mask important relationships and even lead to biased conclusions. This paper proposes a novel computationally efficient method that integrates data imputation and variable selection to address these issues. More specifically, the proposed method incorporates a new multiple imputation algorithm based on matrix completion (Multiple Accelerated Inexact Soft-Impute), a more stable and accurate new randomized lasso method (Hybrid Random Lasso), and a consistent method to integrate a variable selection method with multiple imputation. Compared to existing methodologies, the proposed approach offers greater accuracy and consistency through mechanisms that enhances robustness against different missing data patterns and sampling variations. The method is applied to analyze the Asian American minority subgroup in the 2017 National Youth Risk Behavior Survey, where key risk factors related to the intention for suicide among Asian Americans are studied. Through simulations and real data analyses on various regression and classification settings, the proposed method demonstrates enhanced accuracy, consistency, and efficiency in both variable selection and prediction. Copyright © 2023 Elsevier B.V. All rights reserved.
Original language | English |
---|---|
Article number | 107877 |
Journal | Computational Statistics and Data Analysis |
Volume | 192 |
DOIs | |
Publication status | Published - Apr 2024 |
Citation
Liang, L., Zhuang, Y., & Yu, P. L. H. (2024). Variable selection for high-dimensional incomplete data. Computational Statistics and Data Analysis, 192, Article 107877. https://doi.org/10.1016/j.csda.2023.107877Keywords
- High-dimensional
- Missing data
- Variable selection
- Multiple imputation
- Randomized lasso