- Oct-03-24: AgriCLIP paper, pretraining dataset, and the code are released.
We present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework.
- Our primary contribution is the creation of a large, diverse image-text dataset derived solely from vision-based agricultural datasets.
- Our second contribution is a training pipeline that combines image-text contrastive and image-only self-supervised learning to boost global semantic features with fine-grained visual details.
- We followed three-stage training pipeline, combining contrastive learning, DINO-based training, and encoders alignment to capture both global semantic and local fine-grained features.
- We conduct comprehensive evaluation on different downstream tasks demonstrating AgriCLIP's effectiveness in zero-shot performance.
We gather 25 training datasets across crops, fish, and livestock, creating the Agriculture and Livestock (ALive) dataset with 600k images covering a wide range of conditions. This includes various crop growth stages, classifications, and different farming environments for animals and fish. Next, we design a customized prompt generation strategy where the text based on dataset and class-level information is leveraged to provide context and fine-grained details for each image. For instance, instead of using a generic CLIP prompt like “a photo of a boron-deficient leaf,” we craft prompts like “a photo of a leaf with boron deficiency, characterized by yellow patches and curled edges.” We then use GPT-4 to generate diverse variation of these prompts.
📥 Download the Pre-Training Dataset: Access our pre-training dataset: ALive Dataset.
To evaluate the performance of AgriCLIP, we assemble a set of 20 datasets (Downstream data) to test the model’s ability to generalize to unseen concepts. The evaluation set is entirely disjoint from the ALive pre-training set.
📥 Download the Downstream data: Access our downstream dataset: Downstream Dataset.
We recommend setting up a conda environment for the project:
conda create --name=agriclip python=3.10
conda activate agriclip
git clone https://github.com/umair1221/AgriCLIP.git
cd AgriCLIP
pip install -r requirements.txt
export PYTHONPATH="./:$PYTHONPATH"
1. Prepare data
Please download the dataset from ALive Dataset.
After downloading, the next step is to get the features representations for both the models i.e., the DINO and the CLIP. Then run the following command to get the aligned model as an output which will be then used for the zero-shot evaluation.
python AgriCLIP_alignment/train_linear_aligner.py --data-path "/path/to/your/dataset" \
--dino-weights-path "/path/to/your/dino_pretrain.pth" \
--clip-weights-path "/path/to/your/dino_pretrain.pth" \
--path-dino-features "/path/to/your/dino_features.npy" \
--path-clip-features "/path/to/your/clip_features.npy" \
--output-model-path "./path/to/save/aligned_model.pth"
Downstream datasets can either be downloaded manually from here Downstream-Data or by using the script below:
pip install gdown
python Dataset/download_downstream.py --output-dir "/path/to/your/dataset/storage"
Please use the below command to perform zero-shot inference on AgriCLIP.
python AgriCLIP_alignment/AgriClip_zeroshot.py --dataset-name "Banana Deficiency" \
--data-path "/path/to/dataset" \
--dino-path "Weights/dino_pretrain.pth" \
--aligner-path "/Weights/Aligned_Models/Agri_Dino_aligner_DPT_CPT.pth" \
--batch-size 32 \
--num-workers 4
Model Name | Weights |
---|---|
DINO | Features representations of ALive Data for alignment purpose |
CLIP | Features representations of ALive Data for alignment purpose |
- Text2Concept: Our approach is inspired from this work. We are thankful for their Cross-Model alignment code.
- Dino: Provides with the capability of using self-supervised training.
- CLIP: A good resource for zero-shot classification using text prompts.
@misc{nawaz2024agriclip,
title={AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment},
author={Umair Nawaz and Muhammad Awais and Hanan Gani and Muzammal Naseer and Fahad Khan and Salman Khan and Rao Muhammad Anwer},
year={2024},
eprint={2410.01407},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.01407},
}