Improving Machine Learning Methods for Handling Data Imbalance Problem
Abstract
Class unbalanced datasets are widespread in various fields, including health, security, and banking. When dealing with imbalanced datasets, a standard supervised learning algorithm is biased toward the dominant class. In real-life applications, however, the minority class instances are more interested in reflecting the notion than the majority class instances. For categorizing unbalanced datasets, numerous strategies based on sampling methods (under-sampling of the majority class and oversampling of the minority class), cost-sensitive learning methods, and ensemble learning have recently been employed in the literature. However, deleting the majority of samples at random using a uniform distribution may result in needless data loss. In this paper, we proposed 3 different cluster-based undersampling models to prevent unnecessary data loss. First, we inject test data into training data for clustering. Then we select 25% close to the centroid and 25% from the boundary line. For the last method, we clean 50% majority data around minority data. We experiment with our methods over 49 datasets and calculate auROC, auPR, F1-Score, and MCC for evaluation. According to the experimental results, our methods are promising and successful strategies for dealing with severely unbalanced datasets.
Collections
- M.Sc Thesis/Project [149]