Computer Vision News

19 Class Imbalance in Classification Tasks Use the right performance metrics The last important thing to remember is to always use the right metrics for the task. In the case of imbalanced datasets, it could be preferable to use metrics such as precision (the positive predictive value) and recall (sensitivity- true positive rate) and to visualise performance using ROC curves and confusion matrices. All these values can be included when compiling the model. The snippet below is an example of how to add these metrics to the analysis using any classification model. METRICS = [ keras.metrics.TruePositives(name='tp'), keras.metrics.FalsePositives(name='fp'), keras.metrics.TrueNegatives(name='tn'), keras.metrics.FalseNegatives(name='fn'), keras.metrics.BinaryAccuracy(name='accuracy'), keras.metrics.Precision(name='precision'), keras.metrics.Recall(name='recall'), keras.metrics.AUC(name='auc'), ] model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = METRICS) Let’s see the ROC curves generated by the customised function highlighted earlier. These are an important indication of sensitivity and specificity variation in the model and across labels: *All graphs are found employing only a random subset of the NIH Chest X-ray Dataset, representing 5% of the 112,120 X-ray images with disease labels from 30,805 unique patients. Better results can of course be obtained by using a higher proportion of the dataset. Conclusion I hope this article helps getting some ideas on techniques and tools to start dealing with class imbalance. Naturally, an optimal fix would be to acquiremore data, enlarge the dataset through an enhanced labelling process, or include synthetic data in the study. But, as we just showed, other options are available, and they also extend to bootstrapping and more complex cost modelling. An important tool is offered by the imblearn library, which has been widely used for this tutorial. Have fun exploring it!

Computer Vision News - May 2021