T2 - kijouneli/EAR-MP-2025 GitHub Wiki

Title

Gourmet Plate Transformer (GPT): Food photograph-based recipe generation and Q&A with VLMs

Summary

  • Motivation

    • Ever wondered how the food you saw on Instagram was cooked? Have you tried to find the recipe for your favorite restaurant dish on the internet?
  • Goal & Methodologies

    • In this project, we aim to build a Vision-Language Model (VLM) where the model receives food photograph as input, and the VLM generates the recipe for the food.
    • To achieve this goal, we can try several approaches according to the student’s familiarity to deep learning.

(L1) Train a good food photo classification model, which can identify the food class and retrieve its recipe from a large dataset

(L2) Train a classifier as in (L1), use RAG (Retrieval-Augmented Generation)-based approach to prompt the LLM to generate the food recipe.

(L3) Using the trained LLM in (L2), construct a chatbot that can answer your questions while cooking.

(L4) Using VLMs, perform supervised finetuning (SFT) on large-scale food photograph & recipe dataset, to construct an end-to-end recipe generation model.

(L5) Using step-by-step recipe instructions as reasoning processes for cooking a meal, we can try chain-of-thought (CoT)-based approach for “zero-shot” recipe generation from images.

  • Prerequisites: Some familiarity to training & finetuning deep models

Deliverables

Poster, Demo (hopefully)

Expected number of team members

2-4 students

Expected duration in month

4-6 months

Data sets

Recipe1M, RecipeNLG, additional food photograph and recipe collection from web if possible

GPU cloud server

Some A6000 GPUs available

Back to Topics Page