Bags of Tricks for Multi-Label Classification

Tips and essentials to for boosting your model performance in multi-label classification

5 min readAug 26, 2021

What is Multi-Label Classification?

As we all may know, binary classification classifies the given input into two classes, 1 or 0. Multi-label, or multitarget classification simultaneously predicts multiple binary targets at once from the given input. For example, our model can predict whether the picture given is a dog or a cat and if it has long or short fur.

The targets are mutually exclusive in multi-label classification, meaning that one input can belong to multiple classes. Here, I will be showing a few tips that could improve the performance of your multi-label classification model.

Scoring Metrics

Most metrics that are used for binary classification can be applied to multi-label by calculating the metric for each column, then taking the average of the scores. One metric that we could use is log loss or binary cross-entropy. For a better metric that considers class imbalance, we could use ROC-AUC.

Modeling Tricks

Before we start on the fancy tricks for the features, I have a few tips on designing a model that suits the situation of multi-label classification.

For most non-NN models, our only option is to train one classifier for each of the targets and combine the predictions afterward. The library scikit-learn provides a simple wrapper class to do this, the OneVsRestClassifier.

Although this will make your classifier be able to perform multi-label tasks, this is not an approach that you should take. This has a couple of disadvantages. First, the training time will be relatively long, as for each target, we are training a new model. Second, the models cannot learn the relationship between different labels, or label correlation.

The second problem could be resolved by performing a two-stage training where we use the prediction of the targets combined with the original features as the input data for our second-stage training. The downside to this is that the training time will be increased drastically since now you’ll have to train double the amount of models of what you had before.

Neural Networks, or NN, are more suitable for this situation. The number of labels is the number of output neurons in the network. Now we can just apply any binary classification loss to the model and the model will output all the targets simultaneously. This fixes both problems of non-NN models as we will only need to train one model and the network can learn different label correlations through the output neurons.

Supervised Feature Selection Methods

Before you get started on any feature engineering or selection, your features should be normalized or standardized. Using Quantile Transformer will reduce the skewness of your data, so that the features follows a normal distribution. Another option is to standardize the features, which can be done by subtracting the mean from the data, then dividing it by its standard deviation. This accomplishes a similar job compared to Quantile Transformer, with both aiming to transforming your data to be more robust, however the Quantile Transformer is more computational expensive.

Using supervised feature selection methods is a bit tricky in this situation as most algorithms are designed for single targets. To solve this, we can convert the multi-label situation to a multi-class problem. One popular approach is the LabelPowerset where each unique label combination of the training data is converted into one class. The scikit-multilearn library has tools for this.

After the transformation, we can use methods such as information gain and chi2 to select features. While this approach is viable, things become tricky when we have hundreds or even thousands of different unique combinations of labels, this is where unsupervised feature selection methods could be better.

Unsupervised Feature Selection Methods

In unsupervised methods, we don’t need to consider the nature of the multi-label situation since unsupervised methods do not depend on the label.

Here are a few algorithms:

Principal Component Analysis or other similar Factor Analysis methods. This removes redundant information from features and extracts useful insights for the model. One important note to this is that make sure to standardize the data first before applying PCA as this way every feature contributes equally to the analysis. Another tip to using PCA is that instead of taking the reduced features that the algorithm provides, we can concatenate those reduced features back to the original data as extra information which is model can choose to use.
Variance Threshold. This is a simple yet effective way to reduce the dimensionality of the features. We throw out the features that have a low variance or spread. This could be optimized by finding a better threshold for selection, a good start would be 0.5.
Clustering. We can create new features by creating clusters from the input data, then assigning the corresponding cluster to each row of input data as a new column of feature.

Upsampling Methods

Upsampling methods are used when our classification data is highly imbalanced, we then generate artificial samples for the rare class(es) so that the model focuses on rarer classes. For creating new samples in multi-label situations, we can use MLSMOTE or Multi-label Synthetic Minority Over-sampling Technique. An implementation can be found here.

This is a modification from the original SMOTE method. Where in this case after we generate the data for the minority class and assign the corresponding minority label, we also generate the other labels associated with the data point by counting the number of times each label appear in neighboring data points and take the label that has a frequency more than half of the data points counted.

Conclusion

In this article, we discussed multiple methods and approaches that can be utilized to improve the performance of your multi-label classification model.