Morteza Abolghasemi dc9ee53a1f first commit		10 months ago
.vscode	first commit	10 months ago
dataloaders	first commit	10 months ago
examples	first commit	10 months ago
modules	first commit	10 months ago
utils	first commit	10 months ago
.DS_Store	first commit	10 months ago
.gitignore	first commit	10 months ago
Readme.md	first commit	10 months ago
VideoMAE_frame_selector.ipynb	first commit	10 months ago
check_labels_dist_in_clip_space.ipynb	first commit	10 months ago
download_k400.ipynb	first commit	10 months ago
main.py	first commit	10 months ago
postpretrain_VideoMAE_to_CLIP_Space.ipynb	first commit	10 months ago
preprocess_kinetics_labels.ipynb	first commit	10 months ago
requirements.txt	first commit	10 months ago
save_kinetics_dataset.ipynb	first commit	10 months ago
test.ipynb	first commit	10 months ago
uniform-sampler-video-embedder.py	first commit	10 months ago

Video Action Recognition Using Transfer Learning and Attention Mechanisms

This project focuses on video action recognition using deep learning techniques, leveraging transfer learning from language models and attention mechanisms.

Getting Started

1. Dataset Preparation

1.1. Download the Kinetics dataset:

Use save_kinetics_dataset.ipynb to download the dataset.
Alternatively, you can use download_k400.ipynb.

1.2. Save the dataset:

Store the downloaded dataset in your Google Drive for easy access.

2. Label Preprocessing

2.1. Update Kinetics labels:

Run preprocess_kinetics_labels.ipynb.
This script uses GPT-4 to generate detailed descriptions for each video action.

3. Model Training

3.1. Post-pretraining of VideoMAE:

Execute postpretrain_VideoMAE_to_CLIP_Space.ipynb.
This notebook trains a transformer layer to map VideoMAE embeddings to CLIP space.

4. Testing

4.1. Prepare the test dataset:

Download the UCF101 dataset.
Update the UCF101 labels using GPT-4, similar to the Kinetics label preprocessing step.

4.2. Run the test:

Use test.ipynb to evaluate the model’s performance.

Prerequisites

Python 3.x
Jupyter Notebook
PyTorch
Transformers library
CLIP model
VideoMAE model
Access to GPT-4 API for label preprocessing
Google Drive (for storing datasets)

Usage

Follow the steps in the “Getting Started” section to prepare your data and train the model.
Ensure all datasets are properly saved in your Google Drive.
Run the notebooks in the order specified above.
For testing, make sure you have the UCF101 dataset prepared and labels updated before running test.ipynb.

The model processes multiple frames from a video scene and creates rich representations in the CLIP space.

Future Work

Implement an adaptive frame selection unit
Extend to more diverse datasets
Integrate multimodal inputs (e.g., audio)
Fine-tune hyperparameters

Readme.md