Application of Partial least squares based methods to analyze traffic data and comparison their performance with common methods

Jamali-Dolatabad, Milad

Date

2020

Author

Jamali-Dolatabad, Milad

Metadata

Show full item record

Abstract

Introduction: Regression methods are widely used to assess the relationships between variables. High-dimension data gives rise to some issues such as multicollinearity, lack of interpretability of the effect of variables, and low validity in the results obtained in conventional methods. Traffic accident, practically, has been considered one of the critical unsolved problems throughout the world and particularly in Iran. Objectives: This study aimed to introduce regression methods based on Partial Least Squares in order to analyze the traffic data and compare the efficiency of this approachs with conventional methods in predicting traffic accident-related mortality rate based on factors related to pedestrians, vehicles, drivers, passengers, and accidents according to traffic accident data during the years 2013 and 2014 in East and West Azerbaijan and Ardabil. Methods: In this case-control study, the police-recorded accident data of East and West Azerbaijan and Ardabil in 2013 and 2014 was used. Descriptive indicators were used to present the overall data description in the data analysis phase, the total number of death considered as cases, and three times of the number of cases were randomly selected from the non-dead individuals as controls. At first, statistical assamptions (missing values, outliers, and multiple correlations) were assessed and to handel missing data issue, three approach (replacement using multivariate modeling (MICE), replacement with the Nonlinear Iterative Partial Least Squares (NIPALS) method, and deletion of missing values) were adopted. then in order to predict the death of injured cases by using accident-related factors and the characteristics of the injured persons in all groups (pedestrian, drivers, and passenger), Ridge and Lasso models, Principal component analysis and two approaches of the Partial Least Squares method (R-PLS, PLS-DA) were used. In order to validation of the models, 70% of the data were considered as training sets and 30% of the data were considered as test set. Models were fitted using trainingdata, and goodness of fit indicators (sensitivity, specificity, area under ROC curve and accuracy) were calculated using test set. Finally, general linear models were used to compare the results of validation indicators. All statistical analyses wre performed using R Statistical Software (version4.0.0) and “glmnet”, “mixOmics”, “mice” and “plsgenomics”packages . Results: The mean of area under the ROC curve for conventional logistics model was 0.840. for RR method (0.861), LR method (0.856), PCA method (0.778), PLSDA method (0.848) and the R-PLS method was equal to 0.839. The highest value was attributed to LR method and the lowest value was attributed to PCA method. The mean difference of this index in models was statistically significant (P=0.040) and PCA method had a significantly lower area under ROC curve compared to other models. Moreover, considering dealing with missing values, the mean area under Roc curve in data with deleted missing values approch was (0.833), imputation with MICE (0.842) and imputation with NIPALS algorithm (0.837). Overall, the mean area under curve in missing value imputation approches was higher than the missing value deletion approch, however, this difference was not statistically significant (P=0.897). The mean sensitivity index (P<0.001) and accuracy (P<0.001) were also significantly different, and these indexes had the lowest average in the PCA method. The mean of specificity index in different models were significantly different (p=0.043), however in terms of this index, the highest mean was attributed to PCA model. In addition, based on the results of this study, Factors such as intercity or intra-city accident, the type of road, location of the accident, and the type of vehicle involved in the accident had a significant impact on the severity of the accident in all three groups of data sets. Conclusion: This study showed that all models have acceptable and approximately similar results but supervised models perform better than non-supervised models. The models used in this study (RR, LR, PCA, PLS-DA and R-PLS) have better performance in estimating the effect of predictors and in considering the effect of low impact variables. Using these methods is recommended in studies with a high number of predictive variables and data with multicolinearity. Practically, in allgroups, the most important factors affecting the severity of the accident were due to the specifications of the location of the accident; so, providing solutions to improve the quality of relief service depending on the location of the accident can be of great help in reducing the traffic accidents mortality rate. Other variables such as vehicle features, collision characteristics, injured person characteristics contributed to the severity of the accident as well. Interventions in vehicles’ and environment’s safety improvement, as well as individuals’ safety knowledge promotion can be greatly helpful in reducing the mortality rate due to traffic accidents

URI

http://dspace.tbzmed.ac.ir:8080/xmlui/handle/123456789/62173

Collections

Theses(HN)