Undersampling strategy based on clustering to improve the performance of splice site classification in human genes
Abstract
The recognition of splice sites plays an important role in the annotation of the eukaryotic genes structure. The detection of such sites is a highly imbalanced classification task because the number of negatives examples found in the DNA sequences is much higher than the number of positive ones. One possible strategy to deal with this particularity is to use training sets more balanced than the original dataset. It is necessary then to choose which part of the majority examples will be taken to compose those sets. Aiming at increasing the learning ability in this problem, we propose a new under sampling procedure. In this strategy, the negative examples used to train the classifier are selected based on clusters obtained from this majority class. The experimental results show that, for the splice site problem, it is possible to increase classification performance when compared to simpler under sampling techniques. © 2013 IEEE.