Plate2Recipe – Food Image to Recipe Generation
Multimodal two-stage pipeline: food image → ingredients → structured recipe generation.
GitHub · Report (PDF) · Back to Projects
Stack: PyTorch · Hugging Face Datasets/Transformers · ViT · GPT-2 · TensorFlow (LSTM baseline)
- Problem: Generate structured recipes from a single food image (ingredients + instructions).
- Approach: ViT-based ingredient recognition + recipe text generation (GPT-2 / LSTM) conditioned on ingredients.
- Outcome: Produced coherent, structured recipes; documented practical limits around robustness and scalability.
This project aims to create an end-to-end system that converts food images into structured recipes, combining computer vision and natural language processing techniques.
- Two-stage multimodal pipeline: food image → ingredient/food label prediction (ViT) → recipe text generation (GPT-2 / LSTM).
- Vision (Food-101 fine-tuning): fine-tuned
google/vit-base-patch16-224-in21kon Food-101 (224×224 inputs, augmentation like cropping/flips/rotations; trained on a T4 GPU in Colab). - Ingredient extraction (Recipe1M+): trained an ingredient extractor using ViT feature embeddings and a decoder inspired by FIRE (Chhikara et al., 2023), with preprocessing to remove unreadable/empty images.
- Recipe generation (RecipeNLG): fine-tuned GPT-2 Medium using a structured prompt format (ingredients → title/ingredients/directions). Compared training on 100k vs 10k samples due to compute constraints.
- Iteration learnings: lower training loss on 100k did not imply better outputs (poor generations); after adjusting setup and using 10k samples, outputs became more coherent and ingredient-consistent.
Results & Failure Modes
- ViT fine-tuning: training/eval loss decreased and eval accuracy increased during Food-101 fine-tuning (reported learning curves).
- GPT-2 generation: pretrained GPT-2 produced irrelevant/hallucinated recipe text; early fine-tuning attempts produced nonsensical outputs.
- 10k vs 100k GPT-2 runs: the 10k run produced more coherent step-by-step instructions and better ingredient adherence, while the 100k run often degraded despite lower training loss.
- LSTM baseline: character-level LSTM showed signs of overfitting and produced unstable outputs; early stopping was identified as a likely improvement.
Engineering Notes
- Dataset handling: used Hugging Face Datasets for efficient loading and preprocessing; relied on streaming/iterating patterns to manage large corpora.
- Image preprocessing: standardized inputs to
224×224, normalized RGB channels, and applied augmentation to improve robustness. - Prompt formatting matters: GPT-2 training required structured recipe formatting (title/ingredients/directions) and length control to fit model limits.
- Quality ≠ loss: the 100k-sample GPT-2 run produced poor text despite lower training loss; the smaller 10k run produced more coherent, usable outputs after adjustments.
- Pipeline brittleness: ingredient prediction errors propagate directly into generation quality—making the vision stage a major bottleneck for end-to-end performance.
What I'd Improve Next
- Evaluation harness: add quantitative checks for ingredient precision/recall and recipe consistency (ingredient coverage in generated steps).
- Constrained generation: enforce ingredient grounding (e.g., post-check that generated recipes don’t introduce unseen ingredients).
- Better conditioning: move from plain ingredient prompts to structured slots (title → ingredients → steps) with stronger formatting constraints.
- Robustness: test across image quality, occlusion, and multi-dish scenes; improve ingredient extractor with calibration and thresholding.