• English
    • Persian
  • English 
    • English
    • Persian
  • Login
View Item 
  •   KR-TBZMED Home
  • School of Health and Nutrition
  • Theses(HN)
  • View Item
  •   KR-TBZMED Home
  • School of Health and Nutrition
  • Theses(HN)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Comparison of Regularization and Machine Learning Approaches in Variable Selection and Prediction and its Application in Biological Data Analysis

Thumbnail
View/Open
پایان نامه نسخه نهایی.pdf (9.949Mb)
Date
2022
Author
Hamidi, Farzaneh
Metadata
Show full item record
Abstract
Introduction: Early diagnosis of ovarian cancer and genes affecting it play a very key role in the treatment and life of the patient. By using gene expression data extracted from microarray technology and machine learning algorithms, it is possible to provide new and intelligent methods in the health and treatment system that can diagnose ovarian cancer with high accuracy. Objectives: Comparison of regularization and machine learning approaches (LASSO, Elastic net and Boruta) in variable selection and prediction and its application in ovarian cancer microarray data. Methods: We used the Boruta, LASSO and Elastic net to select the most critical miRNAs related to GC in the training sample that produce the highest prediction accuracy. We used SMOTE random oversampling to balance the outcome in the GSE106817 data. We then used five-fold cross-validation to find the optimal hyper parameters on DT, RF, LR, XGBT, and ANN to choose the best approaches in the balanced sample using the most important variables selected by Boruta, LASSO and Elastic net. Once the prediction models were developed, we applied them on the test sample GSE113486 and GSE113740 to verify the accuracy of developed prediction approach. We looked for an algorithm that may generate a higher predictive power among the 5 ML algorithms in terms of the yielded areas under the ROC curves (AUCs). Sensitivity, specificity, positive predictive value, negative predictive value, misclassification rate, and Kappa were assessed. The guidelines of developing transparent multivariable prediction models were followed for this analysis. We used “Boruta” and “Glmnet” package in R software. This study also investigates the shrinkage strategy, focusing on the regularized linear regression versions LASSO and Elastic Net and also a wrapper method named Boruta that implementing a novel feature selection algorithm for finding all relevant variable. The algorithm is a wrapper around a Random Forest classification algorithm. It iteratively removes the variables which are proved by a statistical test to be less important than random probes. The performance of these techniques has been studied with simulating environment is discussed in the section 3 and the next section provides summary of the results. Result: By using the mentioned methods, a set of very small and important variables was obtained, based on the evaluation criteria, the obtained results had considerable validity and value, and the obtained miRNAs were identified as potential strong biomarkers for ovarian cancer. All microarrays individually had significant expression levels in cancer cases (p=0.001 and ROC>90%) and in the original data set (p=0.001 and ROC>98%) and in external evaluation data (p=0.001 and ROC>95). %) which can be said that all 5 classification models using these microarrays had high and significant AUC. The simulation results according to the box diagrams showed that when the sample size increases, in high correlations, the performance of Lasso is better than Elasticnet and then Bruta, while in low correlations, Bruta performs better than Elasticnet and Lasso. Also, according to the results of this study, on the scenarios that were designed with high dimensions, we found that in high dimensions, when the correlation is strong, Bruta is better than Elastic Net, and after that, Lasso performs well, while in low and weak correlations, Elastic Net performs better than Bruta and Lasso. Conclusion: The findings of this study provided significant evidence that a set of serum miRNA profile extracts are promising diagnostic biomarkers for ovarian cancer. The simulation phase of the study showed that, based on conditions such as high correlation and high dimensions, Boruta has a better performance than ElasticNet and Lasso.
URI
http://dspace.tbzmed.ac.ir:80/xmlui/handle/123456789/66835
Collections
  • Theses(HN)

Knowledge repository of Tabriz University of Medical Sciences using DSpace software copyright © 2018  HTMLMAP
Contact Us | Send Feedback
Theme by 
Atmire NV
 

 

Browse

All of KR-TBZMEDCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

LoginRegister

Knowledge repository of Tabriz University of Medical Sciences using DSpace software copyright © 2018  HTMLMAP
Contact Us | Send Feedback
Theme by 
Atmire NV