T2 - kijouneli/EAR-MP-2025 GitHub Wiki
Title
Gourmet Plate Transformer (GPT): Food photograph-based recipe generation and Q&A with VLMs
Summary
-
Motivation
- Ever wondered how the food you saw on Instagram was cooked? Have you tried to find the recipe for your favorite restaurant dish on the internet?
-
Goal & Methodologies
- In this project, we aim to build a Vision-Language Model (VLM) where the model receives food photograph as input, and the VLM generates the recipe for the food.
- To achieve this goal, we can try several approaches according to the student’s familiarity to deep learning.
(L1) Train a good food photo classification model, which can identify the food class and retrieve its recipe from a large dataset
(L2) Train a classifier as in (L1), use RAG (Retrieval-Augmented Generation)-based approach to prompt the LLM to generate the food recipe.
(L3) Using the trained LLM in (L2), construct a chatbot that can answer your questions while cooking.
(L4) Using VLMs, perform supervised finetuning (SFT) on large-scale food photograph & recipe dataset, to construct an end-to-end recipe generation model.
(L5) Using step-by-step recipe instructions as reasoning processes for cooking a meal, we can try chain-of-thought (CoT)-based approach for “zero-shot” recipe generation from images.
- Prerequisites: Some familiarity to training & finetuning deep models
Deliverables
Poster, Demo (hopefully)
Expected number of team members
2-4 students
Expected duration in month
4-6 months
Data sets
Recipe1M, RecipeNLG, additional food photograph and recipe collection from web if possible
GPU cloud server
Some A6000 GPUs available