This project addresses the challenges posed by long-tailed data distributions and low-resource scenarios.
These situations often result in difficulties for models to generalize effectively across classes with sparse data representation, highlighting the need for robust solutions.
Additionally, the project investigates biases and failure cases in Vision-Language Models like CLIP, which often struggle with specific dataset characteristics.
Few Shot Learning;
LoRA;
BitFit;
Meta-Adapter;
Data Augmentation;
Python;
PyTorch;
Stable Diffusion;
The primary objective of this project was to evaluate the effectiveness of few-shot learning methods in addressing long-tailed datasets.
Through analyzing failure cases and exploring method combinations, the project aimed to enhance model robustness in low-resource settings and identify opportunities for improving Vision-Language Models like CLIP.
This project was developed as part of the "Trends and Applications of Computer Vision" course at the University of Trento during the academic year 2023/2024, in the last year of my Master's degree in Artificial Intelligence Systems.
The course, taught by professors Massimiliano Mancini and Giulia Boato, focuses on exploring cutting-edge research topics in computer vision, providing students with the opportunity to engage with real-world challenges and extend state-of-the-art methodologies.
Real-world datasets often exhibit a long-tailed distribution where a few classes (head classes) are well-represented, while the majority (tail classes) lack sufficient examples. This imbalance creates challenges for machine learning models, which struggle to generalize effectively across all classes. The issue becomes more pronounced in low-resource scenarios, where collecting additional data is impractical. Vision-Language Models like CLIP, while powerful, face difficulties in these settings due to biases and their reliance on extensive pre-trained datasets.
This project began with a comprehensive literature review, categorizing existing techniques to address long-tailed data distributions into four main approaches:
From those categories, 4 relevant techniques were implemented and selected to perform our study:
The experiments were designed to assess model performance, particularly in failure cases. Key steps included:
The EuroSAT dataset, which consists of satellite imagery for land use and cover classification, was chosen for its long-tailed yet structured nature. The baseline zero-shot CLIP configuration demonstrated noticeable strengths for certain classes, such as Annual Crop Land and Highway or Road, due to distinct visual patterns.
More interesting is the visualization of failure cases, which we categorized into two main patterns:
The Circuit Diagrams dataset, recently introduced in this paper, is a collection of labeled images used for classifying various types of circuit diagrams. These diagrams represent different electrical components and their interconnections, making them highly structured yet visually diverse. The dataset presents unique challenges, such as variations in diagram layout, symbol orientation, and component grouping. To effectively understand the intricacies of this dataset, models must accurately recognize both individual components and the relationships between them.
This dataset proved particularly challenging for CLIP, as the base model likely has little to no prior knowledge of such structures, making it an interesting case to study its behavior.
Starting with the baseline model, predictions were quite messy, with an overall accuracy of only 12.4%. The main component consistently detected was the text information present in the images, which is a known problem for CLIP.
The Circuit Diagrams test set is highly imbalanced, with most classes having only a few samples (around 5 to 10), while a few classes dominate the majority of the test set. Applying LoRA exhibited an unusual behavior: the model collapsed to predict the majority class, still achieving only 18% accuracy. This was unexpected, as the model was trained in a few-shot setting, which should have prevented such behavior by ensuring a fixed number of samples per class. It is possible that the majority class, "converter/power supply/charger/inverter," contains visual features that are particularly easy for the model to identify, though we couldn't fully explain this given our limited background in electronics.
On the other hand, the application of BitFit enhanced CLIP’s tendency to focus on textual information in the images without causing the collapse behavior seen with LoRA. This was particularly interesting, as it demonstrated BitFit’s ability to leverage the original model's knowledge, rather than optimizing parameters for a specific task. The bias towards text in images became much stronger with this approach.
Our best-performing model configuration (achieving 17.25% accuracy) was obtained through a combination of methods, specifically BitFit, Meta-Adapter, and label preserving/breaking augmentations.
The attention maps clearly show a tendency for the model to avoid the circuit structure and focus instead on blank areas of the images.
When applying Meta-Adapter either alone or in combination with other methods, we noticed that the model's similarity scores between modalities were much higher than usual. This led us to ask: Does this Meta-Learning strategy also reduce the gap between the text and vision modalities?
The answer to this question is "possibly." While the similarity scores are significantly higher, the standard deviation also increases by a few points. This suggests that further investigation is needed to better understand this behavior, ideally using additional datasets.
The results of applying various methods to the EuroSAT and Circuit Diagrams datasets revealed key insights into model behavior and performance.
For EuroSAT, the baseline CLIP model initially struggled with low accuracy due to the challenge of handling satellite imagery with complex land classifications. However, after incorporating LoRA, a significant improvement was observed, both in terms of cluster separability and the richness of the attention maps. The addition of BitFit further enhanced performance, especially in classes like "Sea or Lake" and "River," where a global understanding of the image was crucial.
Combining LoRA, BitFit, and Meta-Adapter yielded the best results, with the model achieving an impressive 90.95% accuracy. This configuration outperformed the individual methods, highlighting the value of a multi-strategy approach.
For the Circuit Diagrams dataset, the CLIP model initially faced difficulties due to the highly structured yet visually diverse nature of circuit diagrams, especially with text-heavy images. LoRA caused the model to collapse on the majority class, while BitFit showed more balanced performance by leveraging the model's existing knowledge. The best configuration for Circuit Diagrams was obtained by combining BitFit, Meta-Adapter, and label preserving/breaking augmentations, reaching 17.25% accuracy.
The attention maps from the best model configurations in both datasets displayed clear tendencies towards focusing on certain image features. In EuroSAT, the model correctly identified land types and geographical features, while in Circuit Diagrams, the model often ignored the circuit structure, focusing instead on blank spaces.
My work involved hands-on application of all the proposed methods, along with a detailed analysis of model performance and failure cases across different datasets. I also manually modified the models to collect comprehensive output information, an integral task that required a solid understanding of the entire system. Additionally, the intuition regarding the potential reduction of the modality gap through Meta-Adapter was my own, and I led the investigation into its impact on the model's behavior.