VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF INFORMATION SYSTEMS NGUYEN LE KHOA - 18520925 GRADUATION THESIS ANALYZING AND FORECASTING WIKIPEDIA WEB TRAFFIC INFORMATION SYSTEMS ENGINEERING INSTRUCTOR DR. DO TRONG HOP HO CHI MINH CITY, 2022 VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF INFORMATION SYSTEMS NGUYEN LE KHOA - 18520925 GRADUATION THESIS ANALYZING AND FORECASTING WIKIPEDIA WEB TRAFFIC INFORMATION SYSTEMS ENGINEERING INSTRUCTOR DR. DO TRONG HOP HO CHI MINH CITY, 2022 INFORMATION OF THE GRADUATE THESIS BOARD Graduation thesis grading committee, established under Decision No. of the President of the University of Information Technology.ÔỎ — Chairman PS — Secretary mu.
— Commissioner ACKNOWLEDGE I would like to thank Mr. Do Trong Hop for his companionship with me and for wholeheartedly guiding me throughout studying and researching for my graduation thesis. I would also like to thank the University of Information Technology teachers in general and the faculty of Information Systems in particular for their dedication and enthusiasm in imparting helpful knowledge to us. Skills needed to be able to achieve inevitable success in the future.
I would also like to thank my tutor and friends for always supporting, encouraging, helping, and giving helpful advice. From the bottom of my heart, sincerely thank you! (Signature) UNIVERSITY OF INFORMATION TECHNOLOGY AEF Advanced Education Program ADVANCED PROGRAM IN INFORMATION SYSTEMS THESIS PROPOSAL THESIS TITLE: ANALYZING AND FORECASTING WIKIPEDIA WEB TRAFFIC Advisor: Do Trong Hop Duration: 10/8/2022-15/11/2022 Students: Nguyen Le Khoa — 18520925 Contents 1. Scope: e Data analyst e¢ Building prediction model 2. Objectives: o The training dataset consists of approximately 145k time series.
Each of these time series represent a number of daily views of a different Wikipedia article, starting from July, 1st, 2015 up until December 31st, 2016 (around 547 days). The training stage is based on traffic from January, 1st, 2017 up until March Ist, 2017. o For each time series, you are provided the name of the article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). You may use this metadata and any other publicly available data to make predictions.
Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day. o To reduce the submission file size, each page and date combination has been given a shorter Id. The mapping between page names and the submission Id is given in the key files.
o Files used for the first stage will end in '_I'. Files used for the second stage will end in' 2'. Both will have identical formats. The complete training data for the second stage will be made available prior to the second stage.csv- contains traffic data.
This a csv file where each row corresponds to a particular article and each column correspond to a particular date. Some entries are missing data. The page names contain the Wikipedia project (e.org), type of access (e. desktop) and type of agent (e.
In other words, each article name has the following format: ‘name_project_access_agent' (e.org_all-access_spider’).csv- gives the mapping between the page names and the shortened Id column used for prediction = sample_submission_*.csv - a submission file showing the correct format. Methodology: e Visualization Method ¢ Statistical Method 4. Expected results: e Analyzing and forecasting future web traffic for approximately 145,000 Wikipedia articles Research timelines: o Duration: 3 months o Analysis dataset by Visualization method and Statistical method o Building prediction model Approved by the advisor(s) Ho Chi Minh city, 19/9/2022 Signature(s) of advisor(s) Signature(s) of student(s) Do Trong Hop Nguyen Le Khoa TABLE OF CONTENTS THESIS PROPOSAL CHAPTER 1.---Sc- St tt. nh HH HH ghi 3 1.
Goal and 0i váco:. Web Traffic Prediction of Wikipedia Pages. Web Traffic Time Series Forecasting using ARIMA and LSTM RNN. THEORETICAL KNOWLEDGE OF TIME-SERIES ANALYSIS AND FORECASTING.
The definition of time series analysis. Important terms to understand time series analySis. + tk vn HH HT HT Hit 15 kh» 0n. ST HH rà 2 4.
Discovering basic dataset information. The distribution of Language, Access, and Type of agent. How languages affect the view per page/project?. Using Prediction Model to forecast the traffic flow on Wikipedia.
Check Stationary and Non-stationary. Auto Correlation and Partial CorrelatiOH. Building and Testing Moel. -- ¿c5 tt svEkskeeerrrerrkerrkrrkrree 61 “So.
CONCLUSION AND FUTURE WORKS .- «càng HH TH TH TT rệt71 REFERENCES.ccccescssssssessssssssescscssssssssssencssssssssssesesssssesseseseseseseeneaeaeseseseeseaesesessensees 73 CATALOG OF IMAGE Chapter 3: Figure 3. 1 Basic Time-Series Prediction Model.- ¿+ stsv£exsvseevrvrerersreve 9 Figure 3. 2 Time Series Analysis PTOC€S§. - tt Ship 10 Figure 3.
3 Time Series Analysis Elemens.--¿- - + SE re 13 Figure 3. 4 ARIMA model eÏemenit. 5 Time Series Analysis FOrmula. 1 Wikipedia Web Traffic Dataset in wrong time-series format.
2 Wikipedia Web Traffic Dataset in proper time-series format. 3 Web Traffic Dataset dividing into topic, language, access and type based on page/project name Figure 4. 4 Web Traffic Dataset dividing project, number ofcolumns, no of column with nulls, % of nulls, total views, average VIEWS .5 The distribution of LangUuage©. ¿+ tt SEvskekerrrerrkserkrkrrree 24 Figure 4.
6 The distribution of ACC€§S. ¿6 SE E212 E1 1111111 re 25 Figure 4. 7 The distribution of Type of agent. ccc + - SE 25 Figure 4.
8 Pages/Projects in Different Language. 9 Average Views Per Each Page/Project. 10 Popular Pages In en.----¿- - + s+ctsksx+eerrrrrkekerrke 28 Figure 4. 11 Popular Pages In es.
12 Popular Pages In ru. 13 Popular Pages In de. ----¿- 5 scsrksk+xexekererrekerrke 29 Eigure 4. 14 Popular Pages In ja.
15 Popular Pages In fr. 16 Popular Pages In zh. 17 Popular Pages In common. 18 Popular Pages In www.
19 The number ofviews in English Pages — Awaken My Love. 20 The number ofviews in English Pages — Cloverfield Lane. 21 The number of views in English Pages — Fiji Water. 22 The number of views in English Pages — Internet of things.
23 The number of views in English Pages — 2016 North Korean Nuclear ¬. 24 The number of views in English Pages — TV Series Unfortunate Events. 25 he number of views in English Pages — Hunky Dory. 26 The number of views in Spanish Pages — AnoreXia.
27 The number of views in Spanish Pages — Charles Perrault. 28 The number of views in Spanish Pages — Fiesta. 29 The number of views in Spanish Pages — La usurpadora. 30 The number of views in French Pages — Leicesfer.
31 The number of views in French Pages — Roselyne. 32 The number of views in French Pages — Noel. 33 English Language represented in dataset. 34 Japanese Language represented in dataset.
35 German Language represented in afaS€K. ¿5+ cc+csxsevserersrree 41 Figure 4. 36 Unidentified Language represented in dataset. 37 French Language represented in dafaset.
38 Chinese Language represented in dafaset. 39 Russian Language represented in dafaSe(. 40 Spanish Language represented in đa(aS€(. 41 Chart represented a specific English page, all access and all agents.
42 Chart represented a specific Japanese page, all access and all agents 43 Figure 4. 43 Chart represented a specific German page, all access and all agents. 44 Chart represented a specific Unidentified page, all access and all agents. 45 Chart represented a specific French page, all access and all agents.
46 Chart represented a specific Chinese page, all access and all agents. 47 Chart represented a specific Russian page, all access and all agents. 48 Chart represented a specific Spanish page, all access and all agents. 49 Auto Regressive and Partial Auto Regressive of specific English page, all access and all agents Tố.
50 Auto Regressive and Partial Auto Regressive ofspecific Japanese page, all access and all agents. cv 2E ng HH HT ngờ 47 Figure 4. 51 Auto Regressive and Partial Auto Regressive of specific German page, all access and all ag€IIẲS.--¿- + ST TT H010 1H11 1 TT 48 Figure 4. 52 Auto Regressive and Partial Auto Regressive of specific Unidentified page, all access and all aØ€TIIS.
¿5+2 + tt St TH HH HH it 48 Figure 4. 53 Auto Regressive and Partial Auto Regressive of specific French page, all access and all agents nh. 54 Auto Regressive and Partial Auto Regressive of specific Chinese page, all access and all 0o nh.55 Auto Regressive and Partial Auto Regressive of specific Russia page, all access and all agents .--- -- th TT TH TT HT HT TT HT Tre 50 Figure 4. 56 Auto Regressive and Partial Auto Regressive of specific Spanish page, all access and all agents NN.
57 Rolling Mean and Standard Deviation by Language: es. 58 Rolling Mean and Standard Deviation by Language: zh. 59 Rolling Mean and Standard Deviation by Language: Ír. 60 Rolling Mean and Standard Deviation by Language: en.
61 Rolling Mean and Standard Deviation by Language: ns (commons) Figure 4. 62 Rolling Mean and Standard Deviation by Language: ru. 63 Rolling Mean and Standard Deviation by Language: ww 2102090 0110ì. 67 Rolling Mean and Standard Deviation by I)1u ó0 ố ố.
67 Rolling Mean and Standard Deviation by Language: ja. 66 Autocorrelation of Wikipedia Web Traffic Time Series. 67 Partial Correlation of Wikipedia Web Traffic Time Series. 68 Training Dataset by AR Model Predictions.
69 Testing by AR Model PredictIOIS. 70 Training by ARIMA Model Predictions. 71 Testing Dataset Model PrediCctiOTs. 72 Training Dataset by Prophet Model Predictions.
73 TestingDataset by Prophet Model Predictions. 74 Log-scale page views of all Wikipedia pages through April 2015 to December 2016 01 ẻ.ốốốốố ố ốc ốc ốốốe. 75 Prophet Components Plot (Trenid|). 76 Prophet Components Plot (Weekly Trend).
77 Training Dataset by LSTM Model Predictions. 78 Testing Dataset by LSTM Model Predictions.------ ‹ -55+5+s+<<+ 70 CATALOG OF ACRONYM ID Acronym Description 1 AR Auto Regressive 2 ACF Auto Correlation Function 3 ADF Augmented Dickey-Fuller 4 ARIMA Auto Regressive Integrated Moving Average 5 CuDNN NVIDIA CUDA Deep Neutral Network 5 DDoS Denial-of-service 6 DWT Discrete Wavelet Transforms 7 FFT Fast Fourier Transform 8 GDP Gross Domestic Product 9 GPU Graphic Processing Unit 10 I Integrated 11 LSTM Long Short-Term Memory 11 RNN Recurrent Neutral Network 12 RSS Residual Sum of Squares 13 RegEx Searching with Regular Expressions 14 PACF Partial Auto Correlation 15 TSA Time Series Analysis 16 SMAPE Symmetric Mean Absolute 17 URL Uniform Resource Locator ABSTRACT In recent years, it has been more crucial to forecast web page traffic, which has increased the need to research different strategies for effectively predicting future values of numerous times series. In this study, I employ a forecasting model to predict web traffic. I employ the Google Web Traffic Time Series Forecasting dataset from Kaggle to explicitly predict future traffic to Wikipedia articles.
Finding an effective load balancing method for cloud-based websites, forecasting future trends based on historical data, and understanding user behaviour are three further ways that web traffic prediction can be useful to website owners. To achieve the goals of this research project, a time- series model based on an ARIMA, and LSTM model was built. Finally, to determine the success of our suggested approach in predicting future traffic to Wikipedia pages, I compare the results of our built model to find which model is more effective and accuracy. After discovering and analyzing by prediction models to forecast the web traffic, I found that Long-Short Memory A more accurate and efficient method for predicting online traffic time series is to employ a recurrent neural network with autoregressive integrated moving average.
How many people will visit the website in the future can be predicted. The plan will get better as more user data is included into it. Our system may be used by any website to improve business analysis, load management [5], even some companies running future marketing campaign.