Linn Cecilie Bergersen: Preselection in High-dimensional Penalized Regression Problems Guided by Freezing

Linn Cecilie Bergersen (Matematisk Institutt, Universitetet i Oslo) skal snakke om

Preselection in High-dimensional Penalized Regression Problems Guided by Freezing

Abstract

Relating genomic measurements as gene expressions or SNPs to a specific phenotype of interest, often involves having a large number P of covariates compared to the sample size n. While P>>n problems can be solved by penalized regression methods like the lasso, challenges still remain if P is so large that the design matrix cannot be treated by standard statistical software. For example in genome-wide association studies, the number of SNPs can be more than 1 million and it is often necessary to reduce the number of covariates prior to the analysis. This is often called preselection. We introduce the concept of freezing which enables reliable preselection of covariates in lasso-type problems. Our rule works in combination with cross-validation to choose the optimal amount of tuning with respect to prediction performance and finds the solution of the full problem with P covariates using only a subset p<<P of them. By investigating freezing patterns, we are able to avoid preselection bias, even if variables are preselected based on univariate relevance measures connected to the response. We demonstrate the concept in simulation experiments and observe impressive data reduction rates, without loosing variables that are actually selected in the full problem. We also apply our rule to genomic data, including an ultra high-dimensional regression setting where we are not able to fit the full regression model.

This is joint work with Ismaïl Ahmed, Arnoldo Frigessi, Ingrid K. Glad and Sylvia Richardson.

 

Published Apr. 13, 2012 3:45 PM - Last modified June 4, 2012 9:32 PM