Selected topics in high-dimensional statistical learning.

DSpace/Manakin Repository

BEARdocs is currently undergoing a scheduled upgrade. We expect the upgrade to be completed no later than Monday, March 2nd, 2015. During this time you will be able to access existing documents, but will not be able to log in or submit new documents.

Show simple item record

dc.contributor.advisor Young, Dean M. Ramey, John A. 2012-08
dc.description.abstract Advances in microarray technology have equipped researchers to measure gene expression levels simultaneously from thousands of genes, yielding increasingly large and complex data sets. However, due to the cost and time required to obtain individual observations, the sample sizes of the resulting data sets are often much smaller than the number of gene expressions measured. Hence, due to the curse of dimensionality [Bellman, 1961], the analysis of these data sets with classic multivariate statistical methods is challenging and, at times, impossible. Consequently, numerous supervised and unsupervised learning methods have been proposed to improve upon classic methods. In Chapter 2 we formulate a clustering stability evaluation method based on decision-theoretic principles to assess the quality of clusters proposed by a clustering algorithm used to identify subtypes of cancer for diagnosis. We demonstrate that our proposed clustering-evaluation method is better suited to comparing clustering algorithms and to providing superior interpretability compared to the figure of merit (FOM) method from Yeung, Haynor, and Ruzzo [2001] and the cluster stability evaluation method from Hennig [2007] using three artificial data sets and a well- known microarray data set from Khan et al. [2001]. In Chapter 3 we investigate model selection of the regularized discriminant analysis (RDA) classifier proposed by Friedman [1989]. Using four small-sample, high-dimensional data sets, we compare the classification performance of RDA models selected with five conditional error-rate estimators to models selected with the leave-one-out (LOO) error-rate estimator, which has been recommended for RDA model selection by Friedman [1989]. We recommend the 10-fold cross-validation (CV ) estimator and the bootstrap CV estimator from Fu, Carroll, and Wang [2005] for model selection with the RDA classifier. In Chapters 4 and 5 we consider the diagonal linear discriminant analysis (DLDA) classifier, the shrinkage-based DLDA (SDLDA) classifier from Pang, Tong, and Zhao [2009], and the shrinkage-mean-based DLDA (SmDLDA) classifier from Tong, Chen, and Zhao [2012]. We propose four alternative classifiers and demonstrate that they are often superior to the diagonal classifiers using six well-known microarray data sets because they preserve off-diagonal classificatory information by nearly simultaneously diagonalizing the sample covariance matrix of each class. en_US
dc.publisher en
dc.rights Baylor University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. Contact for inquiries about permission. en_US
dc.subject Supervised learning. en_US
dc.subject Unsupervised learning. en_US
dc.subject Clustering. en_US
dc.subject Clustering stability. en_US
dc.subject Clustering evaluation. en_US
dc.subject Classification. en_US
dc.subject Naive Bayes classifier. en_US
dc.subject Regularized discriminant analysis. en_US
dc.subject Diagonal discriminant analysis. en_US
dc.subject Error-rate estimation. en_US
dc.subject Gene expression data. en_US
dc.subject Microarray data. en_US
dc.title Selected topics in high-dimensional statistical learning. en_US
dc.type Thesis en_US Ph.D. en_US
dc.rights.accessrights Worldwide access en_US
dc.contributor.department Statistical Sciences. en_US
dc.contributor.schools Baylor University. Dept. of Statistical Sciences. en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search BEARdocs

Advanced Search


My Account