J. Duhaney, T. Koshgoftaar, and A. Napolitano – 11th International Conference on Machine Learning and Applications (ICMLA), December, 2012
Class imbalance is prevalent in many real world datasets. It occurs when there are significantly fewer examples in one or more classes in a dataset compared to the number of instances in the remaining classes. When trained on highly imbalanced datasets, traditional machine learning techniques can often simply ignore the minority class(es) and label all instances as being of the majority class to maximize accuracy. This problem has been studied in many domains but there is little or no research related to the effect of class imbalance in fault data for condition monitoring of an ocean turbine. This study makes the first efforts in bridging that gap by providing insight into how class imbalance in vibration data can impact a learner’s ability to reliably identify changes in the ocean turbine’s operational state. To do so, we empirically evaluate the performances of three popular, but very different, machine learning algorithms when trained on four datasets with varying class distributions (one balanced and three imbalanced) to distinguish between a normal and an abnormal state. All data used in this study were collected from the testbed for an ocean turbine and were under sampled to simulate the different levels of imbalance. We find here, as in other domains, that the three learners seemed to suffer overall when trained on data with a highly skewed class distribution (with 0.1% examples in a faulty/abnormal state while the remaining 99.9% were captured in a normal operational state). It was noted, however, that the Logistic Regression and Decision Tree classifiers performed better when only 5% of the total number of examples were representative of an abnormal state (the remaining 95% therefore indicating normal operation) than they did when there was no imbalance present.