Scraping a javascript website using scrapy, splash, docker and mongodb

October 7, 2020 2 minute read

Github repository for this post:

So I was a senior data scientist at this lovely retail tech company that had a lot of data coming from crawlers monitoring e-commerces. I worked with NLP, interpreting the products descriptions and with computer vision, identifying colours, patterns, details and so on. The thing is - I never actually built the crawlers, so I wanted to try to do it myself!

I got a couple of brazilian retailer website (this one and this one) that uses javascript and have pagination, and I wanted to get all products from all the pages on the dresses category and save the following information to a MongoDB database:

Product id
Retailer id
Product url
Product name
Product price
url of all images available
Crawl timestamp

Scraping dynamic website

Most e-commerces use dynamic websites, meaning that the website renders using some client-side javascript. So in order to properly access the HTML, each page need to be accessed by a browser.

Splash

In a nutshell, splash is like a browser. It opens a page and act as a browser would, therefore processing javascript and rendering the page properly.

Scrapy

Scrapy is a python crawling framework. It is very flexible, has a huge community, is very fast and easy to use. Here we define which website we want to crawl and all the rules to find the fields we want to get and how to interact with controls and paginators. It also has an item pipeline functionality that helps us deal with items and for each item define what to do. In our case, store on a database. It is probably the most popular web scraping tool for python.

MongoDB

MongoDB is a NoSQL database that stores json-like documents. Also has a huge community, easy to integrate with python and offers a cloud-hosted instance with a fair free tier.

Docker

Probably my favorite dev tool, docker creates an isolated environment (container) for you to run your app. This container encapsulates everything the app needs, so it solves the famous “it works on my machine” problem - the container is like a mini machine running your app. Super convenient and easy to use. Here we are using 2 containers: one for splash and one for running scrapy.

Connecting the dots

So, as mentioned, the website uses javascript, so I’m using splash to mimic a browser. Splash already runs in a container (image provided by scrapinghub) and I decided to run scrapy also from a container, so I put together a minimal Dockerfile with scrapy and pymongo and a docker-compose.yml to run it . This makes it easier to deploy wherever I want (or even run locally isolated on my machine). I’m saving the results to a MongoDB cloud collection, with each product_id as document_id.

Closing thoughts

So being able to scrape websites is a great skill to have and empowers one to build data apps, monitor websites, create event-based alarms, sky is the limit! For instance, since I’m into computer vision, here I’m focusing on getting the product images urls to use in another examples there will be published here - so if there’s no link to other articles here right now, come back soon!!

Hope you enjoy and try to build something yourself!

Twitter Facebook LinkedIn

Paulo Sampaio

Scraping a javascript website using scrapy, splash, docker and mongodb

Scraping dynamic website

Splash

Scrapy

MongoDB

Docker

Connecting the dots

Closing thoughts

You May Also Enjoy

Creating a mask detector and whatsapp alarm system with a raspberry pi camera

Building a prediction API using FastAPI and Docker

Using convolutional neural network’s feature vectors to find similar images

Explaining predictions using shap