April 5, 2025

LLM based text-to-image and image-to-image search

Thumbnail of Astro arches.

This project explores the capabilities of multimodal models to encode both text and images into the same embedding space, enabling seamless nearest neighbor search based on either modality.

This unlocks intuitive and flexible product search experiences: โ€ข ๐Ÿ”ด Text-to-image search: Search with a natural language prompt like โ€œa powerful and dramatic red dressโ€ and retrieve visually matching products. โ€ข ๐Ÿ–ผ Image-to-image search: Upload a photo of a look you love and discover similar items across the catalog.

At the core of this system is OpenAIโ€™s CLIP, used for embedding both product images and search queries into a shared vector space. The retrieval is powered by FAISS, Metaโ€™s efficient similarity search library.

The online demo uses a small dataset (~100 products from online fashion retailers), so results should be viewed as a proof-of-concept rather than a production system.

๐Ÿงฉ Features

  • ๐Ÿ” Multimodal embedding of images and text into a shared vector space
  • ๐Ÿง  FAISS-based fast nearest neighbor retrieval
  • ๐Ÿงช Dual search modes: text-to-product and image-to-product
  • ๐Ÿ“ธ Visual preview of top matches
  • ๐Ÿ“Š Suitable for e-commerce, fashion tech, and style-based recommendation engines

๐Ÿ’ก Technologies used

  • OpenAI CLIP โ€“ Multimodal text/image encoder (via Hugging Face Transformers)
  • FAISS โ€“ Scalable vector index and similarity search
  • Python โ€“ Core backend and logic
  • Streamlit โ€“ UI for demo and result display
  • Pandas / NumPy โ€“ Dataset handling and result formatting

๐ŸŒ Resources

๐Ÿ‘‰ View live demo - feel free to wake it back up in case it is sleeping

๐Ÿ‘‰ Github repo