BSc project of Parham Saremi. The goal of the project was to detect the geographical region of the food using textual and visual features extracted from recipes and ingredients of the food.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Parham 0971105df4 update readme 9 months ago
utils Add code for BSc proj 9 months ago update readme 9 months ago
best_config.yml Add code for BSc proj 9 months ago Add code for BSc proj 9 months ago Add code for BSc proj 9 months ago Add code for BSc proj 9 months ago Add code for BSc proj 9 months ago

Required data for running on other servers

  • Crawled Images (Resized to 384): (
  • FastText Model Folder: (
  • Extracted Image Features: (
  • Extracted Text Features: (
  • Crawled Images (Original Size): ( /media/external_10TB/10TB/Behnamnia/HPC-BACKUP_01.02.07/food/ACM-MM/IngredientsCrawling/[crawled_images_full]|[crawled_images_full_v2]

The “Crawled Images (Original Size)” path contains the original images of ingredients that were obtained from Google. Afterwards, these images were resized to 384 to facilitate their transfer between servers. Since our image models use a maximum size of 300 for input images, a larger size is not necessary, making this reduction in size very convenient for us. The original data folder is over 200 GB in size, while the resized folder still contains over 20 GB of data, therefore, only their paths are provided here. These folders are not required to run the final model since their features are extracted using pre-trained models that we use to run our model.

The FastText model is a non-contextual model used to extract embeddings of ingredient names. The model is approximately 1 GB in size, so only its path is provided here. This model is required to run the final training code.

The “Extracted Image Features” path refers to the folder containing the extracted features from ingredient images using pre-trained image models. These image features are necessary to run the main training code.

The “Extracted Text Features” path refers to the folder containing the extracted features from recipes using the BERT model. These features are also required to run the main training code.

Structure of the Available Files


This folder contains the following files:

  • train.json: train split of RecipeDB dataset
  • val.json: validation split of RecipeDB dataset
  • region.json: a JSON file listing all of the regions and assigning a number to each one of them
  • ingredient_counts.json: a JSON file showing a list of all of the ingredients in RecipeDB dataset and their count in the whole dataset.
  • image_dict_ings.json: a list of crawled image names.


    The following list explains all the files that appear in the utils folder:

  • a Python util file that provides functions to get embeddings from FastText.

  • utility functions that help in loading and saving the config.

  • a file that contains functions to handle Batch Normalizations.

  • an implementation of RecipeDB dataset using PyTorch Dataset class.

  • an implementation of SAM optimizer, which is used in this project.


  • used for ingredient visual embedding extraction.

  • used for recipe text embedding extraction.

  • implementation of PyTorch Image-Text-Transformer model that is required for solving the problem.

  • code used for loading the data, creating the model, and feeding the data to the model.

  • best_config.yml: YAML config file used for specifying the hyperparameters of the model.

How to extract features

Text features

We can extract text features from RecipeDB’s recipes using the python file. This file defines a path to the JSON data file. The JSON files (Data/train.json and Data/val.json) are the files that this script uses to extract embeddings. This script also defines an output path at the beginning of the file. The output path is the location where the final embeddings will be saved. Below is the command for running this script:


Image features

Image features can be extracted using the code, which specifies the following text fields at its beginning:

input_dir = '/home/dml/food/CuisineAdaptation/crawled-images-full-384'
output_root_dir = 'image-features-full'

Using these fields, we can define the path to the input image folder (which contains images of different ingredients resized to 384x384) and the output root directory, which indicates where the embeddings will be saved.

This script will load and run five pretrained models on the input data and save their embeddings in the output folder. Keep in mind that the output embedding for an ingredient is the average of all the embeddings extracted from its corresponding images.

How to run the train code

The code only takes a configuration file as input and can be run solely using the configuration. The command for running the training code is as follows:

python3 --config best_config.yml


    epochs: "Number of epochs for training" :> int
    batch_size: "Batch size for the dataset" :> int
    max_lr: "Max learning rate to pass to the scheduler" :> float
    weight_decay: "Weight decay value to pass to optimizer" :> float
    device: "Device to use for training, either 'cuda' or 'cpu'" :> str
    num_workers: "Number of workers for dataloader" :> int
    sam_rho: "Hyperparameter for SAM optimizer" :> float
text_model: "bert-base-uncased" :> str
image_model: "Name of the image model. Available values: resnet18, resnet50, resnet101, efficientnet_b0, efficientnet_b3" :> str
image_features_path: "Path for the extracted image features" :> str
text_features_path: "Path for the extracted text features" :> str
use_recipe_text: "Should the model use recipe embeddings?" :> bool
use_image_ingredients: "Should the model use ingredient image embeddings?" :> bool
use_text_ingredients: "Should the model use ingredient text embeddings?" :> bool
        layers: "Number of transformer blocks like T or TTTT" :> str
        H: "Embedding size for the transformer" :> int
            L: "Number of layers for each transformer block" :> int
            n_heads: "Number of heads for each transformer block" :> int
        final_ingredient_feature_size: "What is the ingredient feature size after we get the output from the transformer?" :> int
    image_feature_size: "What is the size of the image features reduced to in the beginning?" :> int
    text_feature_size: "What is the size of the text features from recipes reduced to?" :> int
    final_classes: "This will be replaced in the code. Just set it to -1." :> int
    embedding_size: "What is the embedding size of ingredient text features?" :> int
    dataset_path: "Path for the RecipeDB dataset" :> str
    fasttext_path: "Path to the fasttext .model file" :> str
    target: "Type of target. Should be 'region' for this project." :> str