# FusionDetect: Fake Image Detection with DINOv2 + CLIP A hybrid deep learning model for fake image detection that combines DINOv2 and CLIP features with optional robustness enhancements. ## Features - **Dual Backbone Architecture**: Leverages both DINOv2 and CLIP vision transformers - **Robustness Enhancements**: JPEG compression and Gaussian blur augmentations - **Flexible Training**: Multiple classifier head configurations and fine-tuning options - **Multi-GPU Support**: Parallel training across multiple GPUs ## Project Structure ``` ├── dataset.py # Custom dataset class with data augmentation ├── model.py # Hybrid model architecture (DINOv2 + CLIP) ├── train_concat.py # Main training and evaluation script └── README.md ``` ## Installation ```bash pip install torch torchvision pillow open_clip_torch ``` ## Usage ### Training ```bash python train_concat.py \ --train_fake_dir /path/to/train/fake/images \ --train_real_dir /path/to/train/real/images \ --test_fake_dir /path/to/test/fake/images \ --test_real_dir /path/to/test/real/images \ --save_model_path /path/to/save/models \ --clip_variant ViT-L-14 \ --dino_variant dinov2_vitb14 \ --num_layers 4 \ --batch_size 256 \ --epochs 10 \ --gpu 0 ``` ### Evaluation ```bash python train_concat.py \ --train_fake_dir /path/to/train/fake/images \ --train_real_dir /path/to/train/real/images \ --test_fake_dir /path/to/test/fake/images \ --test_real_dir /path/to/test/real/images \ --model_path /path/to/saved/model.pth \ --dino_variant dinov2_vitl14 \ --clip_variant ViT-L-14 \ --num_layers 4 \ --gpu 0 \ --eval ``` ### Robustness Training (with augmentations) ```bash python train_concat.py \ ... # same as training command --aug_prob 0.3 # 30% probability to apply JPEG/blur during training ``` ### Robustness Evaluation ```bash python train_concat.py \ ... # same as evaluation command --jpeg 95 --blur 2 # Apply JPEG QF=95 and blur sigma=2 during testing ``` ## Key Arguments - `--clip_variant`: CLIP model variant (`ViT-L-14`, `ViT-H-14-quickgelu`) - `--dino_variant`: DINOv2 model variant (`dinov2_vits14`, `dinov2_vitb14`, `dinov2_vitl14`) - `--num_layers`: Number of layers in classifier head (1-5) - `--aug_prob`: Probability for JPEG/blur augmentations during training - `--jpeg`: JPEG quality factors for evaluation (e.g., `95 75 50`) - `--blur`: Gaussian blur sigma values for evaluation (e.g., `1 2 3`) - `--featup`: Use FeatUp feature upsampling - `--mixstyle`: Apply MixStyle for domain generalization - `--finetune_clip`: Fine-tune CLIP model during training - `--finetune_dino`: Fine-tune DINOv2 model during training ## Dataset Structure Organize your dataset as follows: ``` dataset/ ├── train/ │ ├── 0_real/ │ └── 1_fake/ └── test/ ├── 0_real/ └── 1_fake/ ``` ## Output - Trained models are saved in the specified `--save_model_path` directory - Training arguments are logged to `args.txt` - Models are named with epoch number, accuracy, and average precision ## Example Commands ### Full Training Example ```bash python train_concat.py \ --train_fake_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/all_3_cham_sd14/1_fake/ \ --train_real_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/all_3_cham_sd14/0_real/ \ --test_fake_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/Chameleon-train-test/test/1_fake/ \ --test_real_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/Chameleon-train-test/test/0_real/ \ --save_model_path /media/external_16TB_1/amirtaha_amanzadi/dino/ablation/clip-l14_dino-b14 \ --clip_variant ViT-L-14 \ --dino_variant dinov2_vitb14 \ --num_layers 4 \ --batch_size 256 \ --epochs 10 \ --gpu 0 ``` ### Full Evaluation Example ```bash python train_concat.py \ --train_fake_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/all_3_cham_sd14/1_fake/ \ --train_real_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/all_3_cham_sd14/0_real/ \ --test_fake_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/Gen-Img-Cham_test/1_fake/ \ --test_real_dir /media/external_16TB_1/amirtaha_amanzadi/datasets/sample_3_datasets/Gen-Img-Cham_test/0_real/ \ --model_path /media/external_16TB_1/amirtaha_amanzadi/dino/saved_models/robustness/aug_prob_30/ep20_acc_0.7718_ap_0.7451.pth \ --dino_variant dinov2_vitl14 \ --clip_variant ViT-L-14 \ --num_layers 4 \ --gpu 0 \ --eval ``` ## Notes - For robustness testing, use `--jpeg` and `--blur` arguments during evaluation - For robust training, use `--aug_prob` to enable random augmentations - Multiple GPUs can be specified using `--gpu 0,1,2`