Ebad Fardzadeh Haghish: mlim: Multiple Imputation with Automated Machine Learning

Image may contain: Person, Forehead, Nose, Cheek, Smile.

Supervised and unsupervised machine learning algorithms have been commonly used for performing a single imputation, replacing missing data with most plausible values. However, yet, there has not been any attempt to administer automated machine learning algorithms to fine-tune a model and address some of the practical challenges of working with factor data. In this presentation, I will introduce mlim, an R package that can use a handful of machine learning algorithms for performing single or bootstrapping-based multiple imputations. mlim supports Elastic Net, Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, Deep Learning, and Stacked Ensemble algorithms for performing single or multiple imputation. I will discuss the pros and cons of using these state-of-the-art machine learning algorithms for missing data imputation. Moreover, I will compare the algorithm of mlim with other well-known R packages such as missForest to see whether 1) fine-tuning a model for imputing each variable (feature) and 2) automatically balancing factor variables that suffer from class imbalance can lead to lower imputation error and fairer imputations. mlim is already available on CRAN, but much more recent version of the package is available on GitHub (https://github.com/haghish/mlim). 


About the speaker:

Haghish is a Post-Doc researcher at the Department of Psychology, University of Oslo. His current research centers on machine learning applications for mental health research, with a particular interest in rare diseases, which corresponds to his research interest in severe class imbalance problem. Haghish is also an avid statistical software developer for both R and Stata.

Published Sep. 26, 2022 10:53 AM - Last modified Sep. 26, 2022 10:53 AM