VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Le Viet Ha ENHANCING WEBSHELL DETECTION WITH DEEP LEARNING-POWERED METHODS PHD DISSERTATION IN INFORMATION SYSTEMS Ha Noi - 2024 VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Le Viet Ha ENHANCING WEBSHELL DETECTION WITH DEEP LEARNING-POWERED METHODS Major: Information Systems Code: 9480104.01 PHD DISSERTATION OF INFORMATION SYSTEMS PhD STUDENT SUPERVISORS Le Viet Ha Nguyen Ngoc Hoa Phung Van On CONFIRMATION OF THE TRAINING UNIVERSITY Ha Noi - 2024 DECLARATION OF AUTHORSHIP I, Le Viet Ha, declare that this dissertation titled, "ENHANCING WEBSHELL DETECTION WITH DEEP LEARNING-POWERED METHODS" and the work presented in it are my own. I confirm that: m This work was done mainly while in candidature for the degree of Ph.D at VNU University of Engineering and Technology. m This dissertation has not previously been submitted for any degree. m The results in my dissertation are my independent work, except where works in the collaboration have been included.
Other appropriate acknowledgments are given within this dissertation by explicit references. Signed: Date: ACKNOWLEDGEMENTS This dissertation would not have been possible without the support, guidance, and encouragement of many individuals. First and foremost, I would like to express my deepest gratitude to my supervisors, Associate Professor Nguyen Ngoc Hoa and Doctor Phung Van On, whose expertise, patience, and unwavering support have been instrumental in the completion of this research. Your insightful feedback and continuous motivation have pushed me to refine my work and think critically, for which I am profoundly grateful.
I am deeply appreciative of the support from my colleagues and friends, whose encouragement and camaraderie have provided me with the energy and resilience to persevere through the challenges of this journey. Lastly, but most importantly, I owe a great debt of gratitude to my family, whose love and understanding have been my constant source of strength. This accomplish- ment would not have been possible without you. Thank you all for your contributions to this work and to my life.
1 ABSTRACT The increasing prevalence of webshell attacks poses a significant threat to web application security, necessitating the development of robust detection mechanisms. The dissertation clearly identifies two research directions: scanning web application source code and in-depth analysis of HTTP traffic to detect webshells. First, the dissertation proposes an advanced DL-Powered Source-Code Scanning Framework, called ASAF, that integrates signature-based techniques with deep learning algo- rithms to enhance the detection of both known and unknown webshells. We design the framework to facilitate the creation of customized detection models for various programming languages.
For the interpreted language, the study chose PHP; for the compiled language, the dissertation chose ASP.NET to build a complete ASAF-based model for experimentation and comparison with other research results to prove its effectiveness. Second, the dissertation introduces a deep neural network that utilizes real-time HTTP traffic analysis of web applications to detect webshells. The study proposes an algorithm to improve the loss function applied in the deep learning model to solve the problem of data imbalance. To demonstrate its effectiveness, we experimented with and compared the model to other studies on the same CSE-CIC-IDS2018 dataset.
We have also integrated the model with the NetIDPS system to improve its capacity to identify new webshells. From there, proactively prevent these attacks by automatically adding attack source IPs to the blacklist and creating rules to block URIs querying webshells on the web server. This research contribution has been demonstrated through 01 national patent, 2 SCI-E journals, 1 E-SCI journal, 1 national journal, 2 WoS conference papers and 1 pending patent, as well as being practically applied in the national research project, code number KC01.19/16-20, granted by Ministry of Science and Technology of Viet- ham. 11 TABLE OF CONTENTS DECLARATION OF AUTHORSHIP ACKNOWLEDGEMENTS ii ABSTRACT iii TABLE OF CONTENTS vi LIST OF FIGURES vil LIST OF TABLES 1x ABBREVIATIONS INTRODUCTION Research Motivations.
Research Challenges Objectives of Dissertation. Research Scope Methodologies Research Contributions. 1 THEORETICAL BACKGROUND AND PRELIMINARIES 11 Fundamental Concepts .3 Webshell Evasion 1V TABLE OF CONTENTS V 1.2 Webshell Detection Approaches .3 Webshell Dataset Collecliion. va 44 131 7 Non-AI Approaches.000 eee ee eee 44 1.2 AJ-Powered Source Code Analysis Approaches .3 AI-Powered Network Analysis Approaches .4 Dissertation Research Direction .5 Summary of Chapter l.000 00 eee eee 56 2 DL-POWERED WEBSHELL DETECTION BY SOURCE CODE ANALYSIS 57 2.2 Proposed DL-Powered Source Code Analysis Framework .3 PHP Webshell Detection.
ST HQ so 71 2.2 Yara-Based Analysis .4 Dataset Collecting and Cleaning .5 Hyperparameter Tuning CNN Model.6 Experimental Results and Evaluation .2 Results and Evaluation .NET Webshell Detection .2 Yara-based Analy§SlSs.4 CNN Model Hyperparameter Tuning.5 Dataset Collecting and Cleaning. 82 TABLE OF CONTENTS vi 2.6 Experimental Results and Evaluatlons.2 Results and Evaluation .5 Summary of Chapter2 .0002 ee 86 3 DL-POWERED PROACTIVE WEBSHELL DETECTION AND PRE- VENTION BY HTTP TRAFFIC ANALYSIS 88 3.2 Proactive Webshell Detection and Prevention. Deep Learning Intrusion Detection Model.3 Webshell Detection and Prevention.4 Handling Imbalanced Datasets .3 Experiments and Evaluation. 20000000 2 eee eee ee 98 3.4 Results and Evaluation.5 Comparisons and Discussions.4 Summary of Chapter3.
000000 eee eee 106 CONCLUSION AND FUTURE WORKS 108 Contribution Highlights. 0000 eee 108 Dissertation Limitations. ee ee 109 Future Works. 112 BIBLIOGRAPHY 112 LIST OF FIGURES 1.1 The conversion process from programming languages to machine code.2 Example of Apache web server architecture .3 Interpreter DFOC@SS.5 China Chopper webshell attack stages .6 Four stages of webshell attack .7 Webshell classification based on communication.8 Behinder webshell sample .9 Decoding and decrypting the obfuscated string .10 Contents of the deobfuscated function .11 Decoded system command .12 Classification of webshell features.1 Correlational links between ASAF components .3 Opcode vectorization module .4 Dataset collecting and cleaning .5 CNN model architecture .1 Proactive webshell detection method based on signatures and DNN .2 DNN architecture for webshell detection .3 Architecture of testbed system .00 0000000 - 99 Vil LIST OF TABLES 1.1 Top 15 opcodes used exclusively used by malware .2 Some widely used Webshelldatasets.3 Summary of related works .1 Non-duplicate benign and webshell datasefs.2 PHP-ASAF hyperparameters tuning value .3 Confusion matrix of PHP webshell detection by using Yara .4 Key metrics of of PHP webshell detection by using Yara(%) .5 Confusion matrix of PHP webshell detection by using Yara .6 Key metrics of of PHP webshell detection by using CNN (%) .7 Confusion matrix of PHP webshell detection by using PHP-ASAF .8 Key metrics of of PHP webshell detection by using CNN (%) .9 Comparison of different webshell detection approaches on our dataset (A) oe ee 2.NET-ASAF hyperparameters tuning value .NET webshell and benign datasets.12 Confusion matrix of ASP.NET webshell detection by using Yara.13 Key metrics of ASP.NET webshell detection by using Yara (%) 2.14 Confusion matrix of ASP.NET webshell detection by using CNN.15 Key metrics of of ASP.NET webshell detection by using CNN (%) 2.16 Confusion matrix of webshell detection using ASP.17 Key metrics of webshell detection by using ASP.1 Total flows in cleaned datasets .2 Number of training and testing samples.3 Hyperparameter optimization value.
vill LIST OF TABLES ix 3.4 Result of hyperparameter optimization with 5-fold cross validation for DSI 2.5 DLWSD 5-fold cross-validation with DS1 .6 DLWSD 5-fold cross-validation with DS2.7 Weighted-DLWSD 5-fold cross-validation with DS1.8 Weighted-DLWSD 5-fold cross-validation with DS2.9 Experiment results with DS3 enhanced by balancing classes.10 Comparison of DLWSD with other methods with DS2. 105 ABBREVIATIONS APT Advanced Persistent Threat ANN Artificial Neural Network AES Advanced Encryption Standard CNN Convolutional Neural Network DNN Deep Neural Network DT Decision Tree DL Deep Learning HTTP HyperText Transfer Protocol IDS Intrusion Detection System IPS Intrusion Prevention System GBDT Gradient Boosted Decision Trees LSTM Long Short-Term Memory ML Machine Learning MLP Multilayer Perceptron NB Naive Bayes OpCode Operation Code RNN Recurrent Neural Network RSA Rivest-Shamir- Adleman SVM Support Vector Machine SSL Secure Sockets Layer TLS Transport Layer Security TF-IDF Term Frequency - Inverse Document Frequency RF Random Forest WAF Web Application Firewall INTRODUCTION Research Motivations Webshell Attack Nowadays, digital transformation is considered an important and inevitable trend for many countries around the world. In Vietnam, digital transforma- tion has become a topic of interest in recent years and is most clearly demonstrated through the National Digital Transformation Program that has been issued. The ad- vancement of web development [22, 11] technology has made web applications more and more popular, gradually replacing traditional native applications because they do not depend on the operating system.
Most applications serving e-government and digital transformation in Vietnam today are built on web platforms, typically the National Public Service Portal system !. Along with this, the issues of information security for the web system have become increasingly important. Malicious code injec- tion (webshell) attacks [33, 95, 68] are the most common and also the most hazardous sort of web application attack [28]. According to the recent Microsoft 365 Defender data ?, the use of webshell attacks not only continued but also accelerated every day.
Webshell attacks [103] pose a severe threat to organisations due to the extensive damage and vulnerabilities they introduce after compromising web-facing servers. As pieces of malicious code written in common web development programming languages (e., ASP, PHP, and JSP) that are installed on web servers, webshells allow attackers to remotely execute arbitrary system commands, exfiltrate sensitive files, install additional payloads, and pivot laterally into internal networks. Attackers can also use webshells to maintain stealthy persistence in order to prolong exploita- tion after the initial breach. Many advanced webshells feature extensive capabilities via graphical user interfaces, including brute-forcing credentials, uploading malware, thttps: //dichvucong.vn/p/home/dvc-trang-chu.htm] ?Web shell attacks continue to rise, https: //www.com/en-us/security/blog/2021/ 02/11/web-shell-attacks-continue-to-rise 2 and interacting with databases.
Once a webshell is uploaded, attackers have an unre- stricted foothold within the victim’s infrastructure. Webshells are especially danger- ous due to their ability to bypass conventional network perimeter defences by using allowed protocols like HTTP or HTTPS [96]. Their flexible and compact nature also allows webshells to evade detection through obfuscation and polymorphism [3, 65]. Overall, webshells represent a serious threat due to their role as a pivot point, enabling an unimpeded gateway for attackers.
Advances in detection techniques have struggled to keep pace as attackers con- tinually release new, heavily obfuscated webshell tools to evade defenses. Manual in- spection is time-consuming, given that a single webshell update could require hours of expert reverse engineering. Detecting obfuscated webshells poses significant challenges for security research. Attackers are continuously adapting exploitation techniques to evade detection, deploying webshells encoded by means such as base64 or hex encod- ing, and using custom encryption schemes.
According to analysis from Cloudflare, over two-thirds of webshells exhibit some form of obfuscation. Advanced polymor- phic webshells such as “Chameleon” can rapidly mutate appearances across attacks while maintaining core malicious functions. The ease of automating webshell obfus- cation and morphing has outpaced improvements in detection approaches tailored to discerning underlying patterns amid intentionally distorted malcode. Defenders also face challenges in obtaining robust datasets spanning various obfuscation schemas needed to train machine learning models.
Webshell Detection Two primary approaches exist across the spectrum of webshell detection: Source Code Analysis and Network-based Analysis. Source code analysis takes yet another approach by directly analysing web applica- tion source code for webshell using analysis tools. Code analysis works by inspecting repositories for suspicious functions, commands, file inclusions, or other constructs in- dicative of a webshell payload. This enables identifying inactive webshells injected into the code before production deployment.
Analysing source code rather than running software provides the ability to catch webshells compiled directly into applications. However, code analysis faces challenges in detecting highly obfuscated or customised webshells designed to mask their malicious intent. Without runtime context, benign code can also generate false positives. Network-based analysis webshell detection [98] operates by analysing web traffic 3 as it enters or exits the network perimeter.
This is commonly implemented through Web Application Firewalls (WAFs) [10, 36] or Intrusion Detection and Prevention Systems (IDPSs) [67, 8, 7, 15] examining packets and connections.