Application of Partial least squares based methods to analyze traffic data and comparison their performance with common methods
Abstract
Introduction: Regression methods are widely used to assess the relationships
between variables. High-dimension data gives rise to some issues such as
multicollinearity, lack of interpretability of the effect of variables, and low validity
in the results obtained in conventional methods. Traffic accident, practically, has
been considered one of the critical unsolved problems throughout the world and
particularly in Iran.
Objectives: This study aimed to introduce regression methods based on Partial
Least Squares in order to analyze the traffic data and compare the efficiency of this
approachs with conventional methods in predicting traffic accident-related
mortality rate based on factors related to pedestrians, vehicles, drivers, passengers,
and accidents according to traffic accident data during the years 2013 and 2014 in
East and West Azerbaijan and Ardabil.
Methods: In this case-control study, the police-recorded accident data of East and
West Azerbaijan and Ardabil in 2013 and 2014 was used. Descriptive indicators
were used to present the overall data description in the data analysis phase, the total
number of death considered as cases, and three times of the number of cases were
randomly selected from the non-dead individuals as controls. At first, statistical
assamptions (missing values, outliers, and multiple correlations) were assessed and
to handel missing data issue, three approach (replacement using multivariate
modeling (MICE), replacement with the Nonlinear Iterative Partial Least Squares
(NIPALS) method, and deletion of missing values) were adopted. then in order to
predict the death of injured cases by using accident-related factors and the
characteristics of the injured persons in all groups (pedestrian, drivers, and
passenger), Ridge and Lasso models, Principal component analysis and two
approaches of the Partial Least Squares method (R-PLS, PLS-DA) were used. In
order to validation of the models, 70% of the data were considered as training sets
and 30% of the data were considered as test set. Models were fitted using trainingdata, and goodness of fit indicators (sensitivity, specificity, area under ROC curve
and accuracy) were calculated using test set. Finally, general linear models were
used to compare the results of validation indicators. All statistical analyses wre
performed using R Statistical Software (version4.0.0) and “glmnet”, “mixOmics”,
“mice” and “plsgenomics”packages .
Results: The mean of area under the ROC curve for conventional logistics model
was 0.840. for RR method (0.861), LR method (0.856), PCA method (0.778), PLSDA method (0.848) and the R-PLS method was equal to 0.839. The highest value
was attributed to LR method and the lowest value was attributed to PCA method.
The mean difference of this index in models was statistically significant (P=0.040)
and PCA method had a significantly lower area under ROC curve compared to other
models. Moreover, considering dealing with missing values, the mean area under
Roc curve in data with deleted missing values approch was (0.833), imputation with
MICE (0.842) and imputation with NIPALS algorithm (0.837). Overall, the mean
area under curve in missing value imputation approches was higher than the missing
value deletion approch, however, this difference was not statistically significant
(P=0.897). The mean sensitivity index (P<0.001) and accuracy (P<0.001) were also
significantly different, and these indexes had the lowest average in the PCA
method. The mean of specificity index in different models were significantly
different (p=0.043), however in terms of this index, the highest mean was attributed
to PCA model. In addition, based on the results of this study, Factors such as intercity or intra-city accident, the type of road, location of the accident, and the type of
vehicle involved in the accident had a significant impact on the severity of the
accident in all three groups of data sets.
Conclusion: This study showed that all models have acceptable and approximately
similar results but supervised models perform better than non-supervised models.
The models used in this study (RR, LR, PCA, PLS-DA and R-PLS) have better
performance in estimating the effect of predictors and in considering the effect of
low impact variables. Using these methods is recommended in studies with a high
number of predictive variables and data with multicolinearity. Practically, in allgroups, the most important factors affecting the severity of the accident were due
to the specifications of the location of the accident; so, providing solutions to
improve the quality of relief service depending on the location of the accident can
be of great help in reducing the traffic accidents mortality rate. Other variables such
as vehicle features, collision characteristics, injured person characteristics
contributed to the severity of the accident as well. Interventions in vehicles’ and
environment’s safety improvement, as well as individuals’ safety knowledge
promotion can be greatly helpful in reducing the mortality rate due to traffic
accidents