Abstract
Machine learning classifiers typically rely on the assumption of balanced training datasets, with sufficient examples per class to facilitate effective model learning. However, this assumption often fails to hold. Consider a common scenario where the positive class has only a few labelled instances compared to thousands in the negative class. This class imbalance, coupled with limited labelled data, poses a significant challenge for machine learning algorithms, especially in the ever-growing data landscape. This challenge is further amplified when dealing with short text datasets, as these inherently provide less information for computational models to leverage. While techniques like data sampling and fine-tuning pre-trained language models exist to address these limitations, our analysis reveals their inconsistencies in achieving reliable performance. We propose a novel model that leverages contrastive learning within a two-stage approach to overcome these challenges. Our proposed framework involves unsupervised Fine-Tuning of a language model to learn representation on short text followed by fine-tuning on a few labels integrated with GPT-generated text using a novel contrastive learning algorithm designed to effectively model short texts and handle class imbalance simultaneously. Our experimental results demonstrate that the proposed method significantly outperforms established baseline models. Copyright © 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Original language | English |
---|---|
Title of host publication | Proceedings of Web Information Systems Engineering: 25th International Conference, WISE 2024 |
Editors | Mahmoud BARHAMGI, Hua WANG, Xin WANG |
Place of Publication | Singapore |
Publisher | Springer |
Pages | 60-75 |
ISBN (Electronic) | 9789819605736 |
ISBN (Print) | 9789819605729 |
DOIs | |
Publication status | Published - 2025 |
Citation
Alsuhaibani, A., Razzak, I., Jameel, S., Wang, X., & Xu, G. (2025). CLIMB: Imbalanced data modelling using contrastive learning with limited labels. In M. Barhamgi, H. Wang, & X. Wang (Eds.), Proceedings of Web Information Systems Engineering: 25th International Conference, WISE 2024 (pp. 60-75). Springer. https://doi.org/10.1007/978-981-96-0573-6_5Keywords
- Fine Tune
- Few labels
- Short text classifications
- Imbalanced data