π‘ Scrapy is a useful web-crawling framework in Python.
Scrapy can handle static websites, a static website is a website with fixed content coded in HTML and displayed in a browser exactly as it is stored.
A dynamic website however contains content that changes depending on different factors. To crawl those sites, a browser to run JavaScript in is needed. Splash is that javascript rendering service that will load the dynamic content.
This article will show you how to set it up!
How to Install Scrapy Splash?
First off, letβs look at how to install and set up splash.
There is a little more to this than just installing the python package using pip.
To run splash, a software named docker is needed.
π Docker is an open-source containerization platform. It enables developers to package applications into containers, standardized executable components combining application source code with the operating system libraries and dependencies required to run the code in any environment.
Use this link to download docker:
After docker is installed and you can start the docker app, execute the following command in a shell.
This will download the splash docker image.
docker pull scrapinghub/splash
After that, in the docker app, select images, scrapinghub/splash
should now be available there, like in the image below. From here, press the run button on the right of the image.Β
Then this window will appear, press the optional settings to expand it.
Fill in the name you want for the container, I simply used βsplashβ for mine.
The βLocal hostβ field will also need to be filled in. It suggests 8050 by default so I decided to go with that. After these fields are filled in, press the run button in the lower right corner of the window.
In your docker app, navigate to Containers / Apps, the splash container should now appear, like this.
To make sure everything is running as it should, either start a browser and type in http://localhost:8050/
. Or press the button that says open in browser like in the image above, that will start your preferred browser and search for http://localhost:8050/
.
If everything is well and working, then this site should appear.
I will also include a link to splash in references on how to install docker and set it up to use splash[1]
Now itβs time to install the splash package using pip
. Run the following command in the shell in your environment of choice to download and install splash.
pip install scrapy-splash
Once scrapy-splash
has been successfully installed, everything should be good to go.Β
Where to Go From Here?
You can now dive into our tutorial on how to scrape dynamic websites using scrapy-splash here: