HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER’S GRADUATION THESIS PM2.5 Prediction Using Genetic Algorithm- Based Feature Selection and Encoder- Decoder Model NGUYEN MINH HIEU hieu.vn Major: Data Science and Artificial Intelligence Thesis advisor: Dr. Nguyen Phi Le Institute: School of Information and Communication Technology HA NOI, 09/2021 Graduation Thesis Assignment Name: Nguyen Minh Hieu Phone: Email : hieu.vn Class: 20BKHDL-E Affiliation : Hanoi University of Science and Technology I – Nguyen Minh Hieu - hereby warrants that the work and presentation in this thesis performed by myself under the supervision of Dr. Nguyen Phi Le. All the results presented in this thesis are truthful and are not copied from any other works.
All references in this thesis including images, tables, figures and, quotes are clearly and fully documented in the bibliography. I will take full responsibility for even one copy that violates school regulations. Hanoi, 28th September, 2021 Author Nguyen Minh Hieu Attestation of thesis advisor : ……………………………. Hanoi, 28th September, 2021 Thesis Advisor Dr.
Nguyen Phi Le ii Acknowledgements First of all, I would like to deeply thank my family, especially my parents - who have worked hard to raise me. My parents have always been with me and created the best conditions for me to have all the necessities needed for my studies. Parents are the spiritual fulcrum, helping me to have a springboard to overcome difficulties and challenges. I would like to express my gratitude to my advisors, Dr.
Nguyen Phi Le for supporting my studies and research on this subject. She is very kindhearted and supportive person, who has guided me from the first day I worked with her. Moreover, I would like to thank Dr. Nguyen Thanh Hung, who has spent his precious time supporting, giving me advice and along with Dr.
Nguyen Phi Le, giving me opportunities to work in many amazing projects. My sincere thanks also go to all the people in the ICN laboratory of the BK. I have a wonderful time working with talented and special peers. I learned a lot from them and they always spread positive energy for me.
Finally, I would like to thank my friends who have always stood by me, shared joys and sorrows, and always supported and helped me all the time. Abstract The concentration of fine particulate matter (PM2.5), which represents inhalable particles with diameters of 2.5 micrometers and smaller, is a vital air quality index. Such particles can penetrate deep into the human lungs and severely affect human health. This paper studies accurate PM2.5 prediction, which can potentially contribute to reducing or avoiding the negative consequences.
Our approach’s novelty is to utilize the genetic algorithm (GA) and an encoder-decoder (E-D) model for PM2. The GA benefits feature selection and remove outliers to enhance the prediction accuracy. The encoder-decoder model with long short-term memory (LSTM), which relaxes the restrictions between the input and output of the model, can be used to effectively predict the PM2. We evaluate the proposed model on air quality datasets from Hanoi and Taiwan.
The evaluation results show that our model achieves excellent performance. By merely using the E-D model, we can obtain more accurate (up to 53.7%) predictions than those of previous works. Moreover, iii the GA in our model has the advantage of obtaining the optimal feature combination for predicting the PM2. By combining the GA-based feature selection algorithm and the E-D model, our proposed approach further improves the accuracy by at least 13.
Content Graduation Thesis Assignment.v List of Figures.viii List of Tables. x List of Equations.2 Existing solutions and problems.3 Goals and approaches.4 Structure of thesis.2 Machine learning overview.3 Deep learning overview.4 Long short-term memory.5 Encoder-Decoder model.1 The importance of features. Proposed Forecasting Framework (OFFGED).2 GA-based feature selection.3 Encoder-Decoder model-based prediction.1 Dataset and evaluation settings.2 Impact of the GA’s number of generations.3 Comparing feature selection algorithms.4 Comparing prediction models.1 Comparing ED-LSTM, AE-BiLSTM, and AC-LSTM.2 Comparing ED-LSTM and ST-DNN. Novel Training Strategy (LTS2).51 List of Figures Figure 1.
An example of an artificial neural network. Structure of RNN. Structure of the LSTM unit. The basic structure of the encoder-decoder model.
An example of feature extraction. An example of feature selection. An example of feature construction. The basic structure of Genetic Algorithm.
Overview of the proposed model. Encoding a feature combination (the white and gray cells represent the selected feature encoded by 1 and 0, respectively). Illustration of the GA’s crossover and mutation operations. Structure of the LSTM-based encoder-decoder model.
Impact of the number of generations. Comparison of feature selection algorithms. Comparison of GA-based feature selection and using all the features for the Hanoi dataset. Comparison between models using Hanoi dataset with all features.
Comparison between models using Hanoi dataset with feature selected by GA. MAE of the proposed model with different output lengths. Comparison between models using Taiwan dataset with features selected by [11]. Notation of the proposed GA-based training mechanism.
Training strategy – ¿ , shuffling =( true , false ). Training strategy – ¿ , shuffling =( false , false ). Training strategy – ¿ , shuffling =( true , true ). Training strategy – ¿ , shuffling =( false , true ).45 List of Tables Table 1.
Details of missing data in the datasets. ED-LSTM, AE-BiLSTM, and AC-LSTM use all features (Hanoi dataset). ED-LSTM, AE-BiLSTM and AC-LSTM use selected features by GA (Hanoi dataset). Comparing the MAE of the proposed ED-LSTM model and the ST-DNN model (using the features proposed by [11]).
Correlation of features. Hyperparameters of training strategy. Training strategy for different cases. Comparing proposed method combining new training strategy with related works.48 List of Equations Equation 1.
LSTM decoder equation of the first time step. LSTM decoder equation. Prediction result of one time step.24 Meaning Acronyms Abbreviations and terms LSTM Long-Short Term Memory RNN Recurrent Neural Network MAE Mean Absolute Error viii RMSE Root Mean Squared Error l Sequence Length h Horizon ix Introduction 1.5 forecasting problem Industrialization and urbanization have brought considerable convenience to human lives. However, they are generally associated with severe air pollution.
Accordingly, people have raised concerns about air quality, especially near living areas.5) is one of the most important indexes to evaluate the severity of air quality, which is directly related to human health.5 particles in the air can bypass the nose and throat and penetrate deep into the lungs, causing many diseases, such as cardiovascular disease and respiratory disease. In [1], the authors reveal that long-term exposure to PM2.5 may lead to heart attack and stroke. Therefore, accurate PM2.5 forecasting is crucial and may help governments and citizens find suitable solutions to control or prevent negative conditions.2 Existing solutions and problems PM2.5 forecasting is a time series prediction problem that is commonly solved using recurrent neural networks (RNNs), including LSTM [2]. The LSTM-based model has advantages in air quality prediction [3].
In [4], the authors also use LSTM but combine gas and PM2.5 concentrations to predict air quality in Taiwan. The work in [5] exploits deep learning to build a hybrid neural network model that can forecast PM2.5 multiple steps ahead. In [6], Yanlin et al. present a hybrid model that integrates graph convolutional networks and LSTM to predict PM2.
In [7], the authors utilize the k-nearest neighbor algorithm to mine spatial-temporal information. The historical information of related locations is then used as the input of the LSTM, adaptive temporal extractor (ASE), and artificial neural network (ANN) models. Several other deep learning models for predicting air quality can be found in [8] - [11]. Despite considerable effort, air quality prediction models still suffer from two issues: restrictions of the input and output lengths and unoptimized feature selection.
The first issue indicates that the number of time steps in a model’s output cannot exceed that of the input; i., the model cannot predict the future with upcoming steps that exceed the input data’s length. Therefore, it is essential to remove this limitation in PM2. The second issue arises from the fact that air quality data include dozens of factors other than PM2.5, such as various concentrations, temperature, and humidity. These factors may or may not be related to PM2.
However, appropriate use of some of these factors may improve the prediction accuracy. Meanwhile, misuse may not only 1 degrade the accuracy but also add extra computational time. Therefore, choosing the optimal feature combination is essential.3 Goals and approaches This paper aims to address the two issues described above. As a solution, we propose a novel PM2.5 prediction model that combines a genetic algorithm (GA) and an encoder-decoder (E- D) model.
The GA is exploited to perform feature selection in a near-optimal manner, thereby enriching the prediction model. Additionally, we leverage the encoder-decoder model to build a PM2.5 prediction model with high accuracy. As a result, the proposed model can efficiently handle different sizes (in terms of the number of time steps) of input and output. To demonstrate the effectiveness of our proposed approach, we evaluate the GA- based feature selection on the Hanoi [12] and Taiwan datasets [11].
The evaluations show that the GA-based feature selection outperforms other methods. We then compare our model to the state-of-the-art method ST-DNN in [11] using the Taiwan dataset. Compared to ST- DNN, our model improves the accuracy from 14. By combining the GA- based feature selection algorithm and the E-D model, our proposed approach further increases the accuracy by at least 3%.4 Structure of thesis The remainder of this paper is organized as follows.
We describe the motivations in Section II. Section III presents our proposal. The performance of evaluation is introduced in Section IV. Section VI introduces related works.
Finally, Section VII concludes the paper. Related works This section briefly reviews work related to PM2.5/air quality prediction models and GA- related methods. [3] predicted air quality using a deep learning model that includes three parts. In the first part, the training data are fed into an LSTM layer with an input sequence length of 8 and output length of 1.
Second, the predicted data are labeled according to the daily air quality index (AQI) values. Finally, a decision unit is developed to map the observed data and predicted alarm situations. The model succeeds in employing LSTM with high accuracy, but the input and output are not flexible. Several other models, such as ST- DNN [11], deep air learning (DAL) [10], and GC-DCRNN [14], exploit spatial data to formulate the relationships between spatial-temporal data.
However, ST-DNN and GC- DCRNN do not identify the factors that affect air quality. Additionally, the models have a 2 high time cost because of the preprocessing. The DAL model performs feature selection during the training process by inserting a layer between the input layer and the second layer of the neural network. DAL, however, aims to discover the importance of different input features to the predictions, not to increase the prediction accuracy.
The authors reveal the main relevant factors to the variation in air quality and provide proof to support air pollution prevention and control. In [28], the authors use a sequence-to-sequence model to predict PM2. They feed all air pollutants into the model without concern for their appropriation. The main problem is that predicting all air pollutants will results in an ‘‘accumulation of errors.’’ For example, when each feature’s prediction results are inaccurate, it will negatively affect PM2.
Even if a feature does not affect PM2.5, it will cause the outcomes to be more inaccurate. In [20], the dataset consists of five features other than PM2.5; therefore, there are 5 ! = 120 feature combinations. However, the authors do not describe how to select the optimal combination. In the experiments, the authors present results for seven combinations without explaining why these combinations are selected.
Yan et al. use the E-D model to predict PM2. The authors use all other features, including the monthly average PM2.5 concentration, daily average PM2.5 concentration, PM10 concentration, AQI, SO2, CO, NO2, O3, average temperature, humidity, pressure, and wind speed per hour per day.