Spatiotemporal Data Clustering Based on Mixture Model and Its Application in The Clustering of Air Polluants Data Registered in The Air Quality Monitoring Stations in Tabriz
Abstract
Introduction: Air pollution as one of the environmental pollution dimensions has adverse effects on human societies. Therefore, air quality monitoring, validation and analysis of monitored data are of great importance to assess the air quality and its effects on health. Lack of attention to the location and time dimensions in the data related to the concentration of pollutants reduces the validity of the results obtained in traditional clustering methods. The purpose of this study was to compare the efficiency of a new mixture model-based clustering method considering spatial and temporal dimensions (STM) for analyzing air pollution data with mixture model-based clustering considering only the time dimension (TM) and regardless of spatial and temporal dimensions (MCLUST) in O3 and PM10 in 2017 East Azerbaijan province, Tabriz.
Methods: In this methodological study, the air pollution data (O3 and PM10) which is recorded by East Azerbaijan Environmental Protection Organization in Tabriz during 2017 were used. In the analysis phase, in hourly data of pollutants, the wrong values were removed and the outlier values were excluded based on z-score. Several imputation approaches were considered to deal with the problem of missing values. Continuing the analysis, the geographical coordinates of the stations and the hour and day of recording the concentration of pollutants as spatial and temporal dimensions were entered into the STM and TM. Then, for each model, goodness of fit indicators were calculated and compared with each other. Finally, clusters obtained from the model with better fit in terms of relationship with meteorological parameters were analyzed using Mann-Kendall correlation coefficient at the significance level of 0.05 and 0.01 percent. All analyses were performed in R software version 4.0.2.
Results: In comparison with imputation methods, linear interpolation was better than other methods. Thus, the missing data was imputed using the linear interpolation method and the data series without missing value was entered into clustering analysis. In terms of absolute of BIC in both pollutants, the highest and lowest values were for STM and MCLUST, respectively. The number of O3 and PM10 clusters for STM was 3 and 4, TM 4 and 5 and MCLUST 9 and 9, respectively. In both pollutants, the number of clusters of STM was the lowest. The relationship between pollutants and temperature in clusters obtained by STM in O3 was positive and significant in all clusters, and in PM10 positive and significant in some clusters. In relation to relative humidity and rainfall, both pollutants are negative and significant at the significance level of 0.05 and 0.01. The fourth cluster of PM10 had the highest mean concentration, average temperature and lowest precipitation, and in contrast, the third cluster had the lowest mean concentration. The second cluster of O3 had the lowest mean concentration, average temperature, and wind speed, and in contrast, the third cluster had the highest mean concentration. The effect of temperature and rainfall was higher than other parameters in increasing and decreasing pollutant concentration.
Conclusions: In this method, both the time and location dimensions were entered the model simultaneously and there is no need to separately investigate the concentration of pollutants at different times and locations. Therefore, the volume of the studied results and the error rate in decision making decreases.
Keywords: Clustering, Air Pollutants, Mixture Model, Spatio-Temporal Model, Imputation Method, Tabriz.