Analysis of Gene Expression Data forGlioma Grade Classification andSurvival Time Prediction
Abstract
A glioma is a sort of tumor that begins in the glial cells of the brains or the spine. Gliomas contain around 30 percent of all brain tumors and focal sensory system tumors and 80 percent of all dangerous brain tumors. Brain tumors occur when there is a defect in the DNA of normal brain cells. To understand the genetic basis of the tumor, we need to identify the candidate genes. Genes are segments of DNA that contain the code for a specific protein that functions in one or more types of cells in the body. Genes are high dimensional and vary in size, depending on the sizes of the proteins for which they code. Due to the complexity and high dimensionality of genes, the classification of tumor samples remains a challenge. In this work, we have focused on a comparative study of different feature selection methods and proposed a new methodological approach to identify patterns of gene expression effectively that is useful to classify unknown samples. If the tumor types are classified correctly and predict survival time, the plan of treatment can be improved. We propose a novel pipeline framework for glioma analysis that uses several feature selection algorithms followed by effective classifier selection to predict the tumor type based on genes expressions data and find the top n
number of genes as features for tumor type and also find the possible survival time in days. We applied twelve well-known machine learning algorithms such as Decision Trees (DT), Random Forrest (RF), Bagging (BAG), Gradient Boosting (GB), Gaussian Na¨ıve Bayes (NB), Multi-Layer Perception (MLP), Support Vector Machines (SVM), Logistic Regression (LR), K-nearest Neighbors (KNN), AdaBoost (AB), Linear Discriminant Analysis (LDA), and Extra Trees Classifier (ET), with five different feature selection methods, including Univariate Feature Selection, Principal component analysis (PCA), Kernel Principal component analysis (KPCA), Independent Component Analysis (ICA) and Factor Analysis(FA) on two datasets. Datasets are collected from the National Center for Biotechnology Information to analyze the accuracy of tumor type known as Grade classification and Survival time in days. The best performance was achieved by using Univariate Feature Selection for both datasets comparing with other feature selection methods. It is observed that using the Freije dataset, the best classification accuracy achieved by the AdaBoost classifier (98.75%) for grade classification, whereas using the Phillips dataset, Extra tree classifier has the best accuracy(95.67%). For survival classification, the best accuracy achieved by SVM for both the Freije dataset(94.5%) and the Phillips dataset(85.75%). The Median Survival time for the Freije dataset is 1098 days, and for the Phillips dataset, Median Survival time is 2275 days.
Collections
- M.Sc Thesis/Project [149]