LLM based text-to-image and image-to-image search
This project explores the capabilities of multimodal models to encode both text and images into the same embedding space, enabling seamless nearest neighbor search based on either modality.
This unlocks intuitive and flexible product search experiences: โข ๐ด Text-to-image search: Search with a natural language prompt like โa powerful and dramatic red dressโ and retrieve visually matching products. โข ๐ผ Image-to-image search: Upload a photo of a look you love and discover similar items across the catalog.
At the core of this system is OpenAIโs CLIP, used for embedding both product images and search queries into a shared vector space. The retrieval is powered by FAISS, Metaโs efficient similarity search library.
The online demo uses a small dataset (~100 products from online fashion retailers), so results should be viewed as a proof-of-concept rather than a production system.
๐งฉ Features
- ๐ Multimodal embedding of images and text into a shared vector space
- ๐ง FAISS-based fast nearest neighbor retrieval
- ๐งช Dual search modes: text-to-product and image-to-product
- ๐ธ Visual preview of top matches
- ๐ Suitable for e-commerce, fashion tech, and style-based recommendation engines
๐ก Technologies used
- OpenAI CLIP โ Multimodal text/image encoder (via Hugging Face Transformers)
- FAISS โ Scalable vector index and similarity search
- Python โ Core backend and logic
- Streamlit โ UI for demo and result display
- Pandas / NumPy โ Dataset handling and result formatting
๐ Resources
๐ View live demo - feel free to wake it back up in case it is sleeping
๐ Github repo