VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF INFORMATION SYSTEMS Pham Minh Quang — 19522099 Nguyen Huynh Thao Nhu — 19521970 GRADUATION THESIS Machine Learning Methods for Cancer Classification BACHELOR OF ENGINERRING IN INFORMATION SYSTEMS THESIS ADVISOR Ph. NGUYEN THANH BINH HO CHI MINH CITY, 2024 ASSESSMENT COMMITTEE The Assessment Committee is established under the Decision. -- , date Rector of the University of Information Technology. Le ccc ccc cece cence cece teense eeeeteeeaneees — Chairman.
Qe eee ec — Secretary. Bo cece cece. ACKNOWLEDGEMENT We would like to express our sincere appreciation to the University of Information Technology for providing an enriching and supportive environment that consistently offers motivating opportunities for individual growth in both learning and research. The diverse array of events, including seminars, student fairs, job fairs, and career days, has played a pivotal role in shaping our personal and professional development.
Furthermore, our heartfelt gratitude extends to our dedicated supervisor, Ph. Nguyen Thanh Binh, whose unwavering support and academic guidance have been indispensable throughout this research journey. Nguyen Thanh Binh not only played a crucial role in shaping the direction and quality of our work but also provided continuous support and valuable insights. Nguyen Thanh Binh's guidance, the excellence achieved in this thesis would not have been possible.
In addition to acknowledging our academic mentors, we express deep appreciation to our families for their unwavering support during the challenging process of completing this thesis. Their presence by our side alleviated our worries and stress, allowing us to concentrate on our academic pursuits with greater ease. Throughout the thesis writing process, we want to emphasize our true commitment and relentless work. We made a concerted effort to overcome every hurdle with tenacity and excitement, even if the journey was not without its share of hardships and challenges.
We aspire for the thesis's end product to demonstrate our commitment and diligence. Simultaneously, should any flaws in the finished product be found, we look to our mentors for empathy and understanding. Your patience and encouragement will be a great source of inspiration for us as we work to grow and absorb this experience. We sincerely thank you for the support and guidance provided by our mentors throughout this journey, and we trust that our commitment will be evident in the final achievement of the thesis.
We believe that the finished product accurately captures our passion and commitment. At the same time, we ask our mentors for empathy and understanding if any flaws are found in the final product. We trust that our commitment will be evident in the final achievement of the thesis. We sincerely appreciate! Pham Minh Quang Nguyen Huynh Thao Nhu DEPARTMENTAL COMMENTS emcee ÓC ĐÓ ĐC ĐÓ CĐ 9 0000000000000 00000000 000000000000000000000000000000000000000000000000000000000000000000000000006 0660606060000 e 606 CHAPTER 1: INTRODUCTION.
CHAPTER 2: BACKGROUND AND THEORY. ---- S1 ST 92T H111 TH HT HT HH 14 2.1 Basic Knowledge of Gene. What is DNA? oo. 14 PIN 00 0 190(00o2ầaaaẳầặầầäặaặaặẶẶ.
What is Gene? oo.2 Basic Knowledge of Gene EXPT€SSIOH. G0 11v TH HT HT HH TH TH nh HT Hàng rệt 18 2.1 Stages in Gene EXPT€SSIOII.- - t1 HT TH TH TH HT TH TT HH nh ng H 2. SG 121 20121 1 9111 11 11 111 HT HT ch HT TH TH TH HT TH TH ch ghi 2.2 Regulation of gene expression .2 __ Post-transcriptional regØulafIOII.4 Post-translational regulafOH.--- «St + + E11 ST TH HT TH TT Hàn HT gưệt 22 2. Methods for measuring gene €XpT€SSIOII.
2631211211511 1511511511111 11 1111 11T Hàn nh chư 22 2. + TE ST HT TT TT HT HH TH HH Tiệc 22 2.2 Quantitative Polymerase Chain Reaction (QPCR) .3 Basic Knowledge of Acute Leukemia (Blood Cancer) .1 What is Blood Cancer? .2 What is Acute Lymphoblastic Leukaemia (ALL))?.1 Definition Acute Lymphoblastic Leukaerm1a.2 Types of ALÌL. St nSn 1S * 1v 1T 11111 1T TH TH TT TH TT TH TH TH TH TH TH TH Hy 24 2.3 Philadelphia positive ALL. 2c 32132112111 11511111 1111111 111 1 11111 TT HT nàn Hy 24 2.
What is Acute Myeloid Leukaemia (AML)? .1 Definition of Acute Myeloid Leuka€Imla.2 Types of AML,. nh TH HT HT HT TH TT HT TH TT HT Tàn ch ghe 26 2. AML starts in the bone IATTOW. án TT TT TT HT HT Hàng HH ưệt 26 2.4 Symptoms of blood Cancer.- (E111 EE 1 SE TT HT TT TT TT TT HT TH TH HT gưệt 27 2.5 Cancer-Causing Að€IS.
kh TT HT TH TT HT TH TT TT TT TT HT TH TT gưệt 28 2.6 Dangerous level of blood Cancer.4 Impact of Initial and Prolonged Exposure to Carcinogen .cceccccescesccecessceseesecesceseesesseeecesessecseseeeeaeesecaecesseaecsesseceeeeaeesecsecerseaeeaeeneeerseseeseeateats 29 VÀ VANGÌ.6 Blood cancer tr€afI€I(.- --- (2232131321821 193 131891191111 191 811911 1 01101111 11H HH TH TH TH ng TH cư 2.1 Machine Learning Based Approacli€s. - -- --- c6 +11 13111151 1111151111 1111111 11 111111 11H nàn chư "mm. "PP nh Soon a.3 Advantages and disadvantages. PP ANH EU, C00.
PA MẦOŨẦOŨŨỒIẮIẮẰIẰŨŨÃ.3 Advantages and Disadvantages n6 ae .3 Advantages and disadvantages nh” ÚUÚUD. St ST TH TH TT TH HT TT TH TT HT TH TT TT TT Hàn PP VN. Advantages and Disadvantages of Random Forest. PA¬ XGBOOSt ee.
Advantages and disadVanfaØ€s. tt T TH TT TH TT HT HT TH HH gưệc PA So on.3 Advantages and disadVanfAØ€S. 312201211 11111111 11111111111 1 H1 11 11T TH HH nh nà Hư "Nà oi. PA XNK} oan ouU:aaaaaaaaaaaaađaa.
ch HT Hàn TT HT TH TT HH TH TH TH 2. Advantages and disadVanifaØCS.- + t1 TH nh HT TH HT HH Tàn HT ch it 2. Advantages and DisadVanfaØ€S. -- Ác kh HH TH TT HT TH TH TH TH TH ngư CHAPTER 3: RESEARCH METHODOLOYY.1 About the Dataset 1.
EU lao ocraẲ$Ỷ.3 Explicate Problem CHAPTER 4: OUR EXPERIMENTS AND RESULTS.1 General Processing Model .1 Import actual dataset the ALL/AML label.2 Import training set and testing Set .- + t1 HH TT TT HT TH TT HH nh cung 4.3 Feature Engineering oo. 64 "nh Model Building .2 Classification Report Confusion Matrix of Naive Bayes Model.2 Logistic R€BT€SSIOH. TT TH TH TT HH TT HT TH TH TT HT TH TT TH ch He 71 4.1 Confusion Matrix of Logistic Ñ€BT€SSIOH. -ó- S11 121 91 919119111101 g1 ng HH gà nưệc 71 4.2 Classification Report Confusion Matrix of Logistic Regression Model .3 Support Vector Machine 1.1 Confusion Matrix of Support Vector Machine.2 Classification Report Confusion Matrix of Support Vector Machine Model.1 Confusion Matrix of Decision “T €.
+ 2t 1912112115111 121 1111111111 111 1n HT Tàn nưệt 76 44.2 Classification Report Confusion Matrix of Decision Tree Model .5 Random FOT€SẨ. c6 c1 SE Tnhh TT TT HT TT TT TT TH HT HT HT TH nh 78 4.1 Confusion Matrix of Random FOF€S(. ¿tt 3x3 EEE ST TT HT HH rệt 78 4.2 Classification Report Confusion Matrix of Random Forest Model .1 Confusion Matrix of XG — BOOSÍ. cóc Sàn TH TH HT TT TH TT Hàn Tàn tiệt 81 44.2 Classification Report Confusion Matrix of XGB Model .1 Confusion Matrix of AdabOOSI.2 Classification Report Confusion Matrix of Adaboost Model.
ch th HH HT HT HT HH HT HH, 87 4.1 Confusion Matrix of Neural NetwOrd.2 Classification Report Confusion Matrix of Neural Netword Model.1 Confusion Matrix of K — means C[USf€TITE.- - 6 E111 91 9121191 5111 11 12t vn gàng rệt 89 44.2 Classification Report Confusion Matrix of K — means Clustering Model .1 Compare Evaluation of Built Models 4.2 Performance Metrics nh cố .3 Conclusion of comparing the evaluation of the modeÌS. --- 5+ 2333 +vE+Evxexsereerrerrrrsrrrre 94 CHAPTER 5: CONCLUSION AND FUTURE RESEARCH DIRECTIONS.---- 5c Scscsrecssrrrres 95 CHAPTER 6: REFERENCES 0n. Structure of the DNA Double Helix. -- - - «+ +31 1 1 1 11T TT nọ TH Hà HH nh nh Hư 15 Figure 2.
¿- 11111 1 111 TH HH TH TH TH TT TH HH HH HT TH Tà TH TT Tư 16 Figure 3. Chromosomes of Human €TOITC.- - x94 1911111615 1 1v TH HT TH HH Tư 17 Figure 5. Gene EXT€SSSIOTI. (Tnhh HH HT TH TH TT TH HH HH HH TT TH HH 18 iltš.
Regulation of Transcription in Eukaryotic C@ÏÏS. Different types of blood cancer: (A) Leukemia, (B) Lymphoma, and (C) Myeloma. Acute Lymphoblastic Leuk€Im1a. - --- - - + + + x3 E kề vn TT TH TH TH nh rêt 24 Figure 11.
Schematic representation of the Philadelphia chrorOSOING. Acute Myeloid Leukaemia. -- -« <- 6 E111 E1 nh TT Hà Hà HT TH TH TT HH HH Hiệp 26 Figure 13. Diagram of a person's hip DOfI€S.
- -- -- 6 6 6111 19191 19111 HH HH TH TH TH TT Tư HH HH 27 Figure 14. Different stages that C€Ï S.- 6 19121211 1 1 1 HH TH TT TH HH TH TH TT Thư Hư HH 27 Figure 15. Symptoms of blood CaTC€T. 1119112112111 1v HH HT nh nh TH HH HH TH TH TH TT Tư HH 28 Figure 16.
Stages of Blood CannC€T. ¿E525 222 139153151 1 5111 3 11 011 TT TT TT TT TT TH TH 30 Figure 17. Current available treatment strategy for Blood Cancer. S11 vn TT TT TT Hà HH Thọ TH TT HH HH HT TT TT Tư Hàn 34 Figure 19: Support Vector Machine (SM),.- «S11 EH TH TH TT TH Họ HH HH TT TH TH TH TH Hàn ĐH 36 Figure 20: Random Forest .-- -- ¿6 111 1 E11 11911 1 1T nh Hà HT HC TT Thọ TH Hà HH HH TT TT Hà Hà ĐH 38 Figure 21: Random Forest FOTIUÌa.
- 6 6 11 E31 11 11 11 1 1H HH TH TT To TT HH HT HT TT TH HH 39 Figure 22: XGBoost FOTTUÌá. -6 1E E1 11 19919 111 1K HH HH HT TH TT TH HT HT ch cư cư Hà 42 Figure 23: AdabOOSf.- cà TH TH TH nh nh Hà HH TH Họ TT TT TH HH HT TH TT TT HH TT ch cư cư Hà 44 Figure 24: Neural Ne€tWOTK. cach ST HH HT nh TT HH HH HT TH TT TT TH Hi cư Hà 47 Figure 25: Neural Network FOTITNUÏlA.-- - ¿tk k1 91191111 1k kh HH HH HT TT TT HH HT ch ch cư 47 Figure 26: Visualization of K ÌM€AIS. (111211211211 11 1111 vn HH HT TT TT HH HT TT ch chà 50 Figure 27.
Ác LH HH HT Họ TH TT Hà HH HH HT TH TH TT HH HH TT TT Hà 60 Figure 31.-- ‹- c1 HT TT TH TT HH HH HT TT TT HH TH TT Hà 61 Figure 32. HT TH Hà HT TH TH TT TT HH Hi HH TT TH TH HT Thư 61 Figure 33.- «+ 1n TT TH TH TH TT HH Hi HT TT TT Tà HH TH ch Tư 62 Figure 34. Data Scaling wo. raầđầiidđiađiiii5.
- «s1 11111 11v TT TT TH HH HH HH HT TT TH 67 Figure 37. Confusion Matrix of Naive Bayes Model with Origin ÏDafa.---- eece eens ceceeeeeeeteeeeeeaeeeeaeeee 69 Figure 38. Confusion Matrix of Naive Bayes Model with Data Scaler 0. Confusion Matrix of Naive Bayes Model with PC A.--- -- + tk HH HT HT HH 70 Figure 40.
Confusion Matrix of Logistic Regression Model with Origin ÏData. Confusion Matrix of Logistic Regression Model with Data ScaÏer. Confusion Matrix of Logistic Regression Model with PCA. vn HH Hư 73 Figure 43.
Confusion Matrix of Support Vector Machine Model with Origin Dafa. Confusion Matrix of Support Vector Machine Model with Data Scaler. Confusion Matrix of Support Vector Machine Model with PCA. Confusion Matrix of Decision Tree Model with Origin ÏDafa.
Confusion Matrix of Decision Tree Model with Data Scaler. Confusion Matrix of Decision Tree Model with PCA .