Plate2Recipe – Food Image to Recipe Generation

Multimodal two-stage pipeline: food image → ingredients → structured recipe generation.

GitHub · Report (PDF) · Back to Projects


Stack: PyTorch · Hugging Face Datasets/Transformers · ViT · GPT-2 · TensorFlow (LSTM baseline)


  • Problem: Generate structured recipes from a single food image (ingredients + instructions).
  • Approach: ViT-based ingredient recognition + recipe text generation (GPT-2 / LSTM) conditioned on ingredients.
  • Outcome: Produced coherent, structured recipes; documented practical limits around robustness and scalability.

This project aims to create an end-to-end system that converts food images into structured recipes, combining computer vision and natural language processing techniques.

  • Two-stage multimodal pipeline: food image → ingredient/food label prediction (ViT) → recipe text generation (GPT-2 / LSTM).
  • Vision (Food-101 fine-tuning): fine-tuned google/vit-base-patch16-224-in21k on Food-101 (224×224 inputs, augmentation like cropping/flips/rotations; trained on a T4 GPU in Colab).
  • Ingredient extraction (Recipe1M+): trained an ingredient extractor using ViT feature embeddings and a decoder inspired by FIRE (Chhikara et al., 2023), with preprocessing to remove unreadable/empty images.
  • Recipe generation (RecipeNLG): fine-tuned GPT-2 Medium using a structured prompt format (ingredients → title/ingredients/directions). Compared training on 100k vs 10k samples due to compute constraints.
  • Iteration learnings: lower training loss on 100k did not imply better outputs (poor generations); after adjusting setup and using 10k samples, outputs became more coherent and ingredient-consistent.

Results & Failure Modes

  • ViT fine-tuning: training/eval loss decreased and eval accuracy increased during Food-101 fine-tuning (reported learning curves).
  • GPT-2 generation: pretrained GPT-2 produced irrelevant/hallucinated recipe text; early fine-tuning attempts produced nonsensical outputs.
  • 10k vs 100k GPT-2 runs: the 10k run produced more coherent step-by-step instructions and better ingredient adherence, while the 100k run often degraded despite lower training loss.
  • LSTM baseline: character-level LSTM showed signs of overfitting and produced unstable outputs; early stopping was identified as a likely improvement.

Engineering Notes

  • Dataset handling: used Hugging Face Datasets for efficient loading and preprocessing; relied on streaming/iterating patterns to manage large corpora.
  • Image preprocessing: standardized inputs to 224×224, normalized RGB channels, and applied augmentation to improve robustness.
  • Prompt formatting matters: GPT-2 training required structured recipe formatting (title/ingredients/directions) and length control to fit model limits.
  • Quality ≠ loss: the 100k-sample GPT-2 run produced poor text despite lower training loss; the smaller 10k run produced more coherent, usable outputs after adjustments.
  • Pipeline brittleness: ingredient prediction errors propagate directly into generation quality—making the vision stage a major bottleneck for end-to-end performance.

What I'd Improve Next

  • Evaluation harness: add quantitative checks for ingredient precision/recall and recipe consistency (ingredient coverage in generated steps).
  • Constrained generation: enforce ingredient grounding (e.g., post-check that generated recipes don’t introduce unseen ingredients).
  • Better conditioning: move from plain ingredient prompts to structured slots (title → ingredients → steps) with stronger formatting constraints.
  • Robustness: test across image quality, occlusion, and multi-dish scenes; improve ingredient extractor with calibration and thresholding.