Custom Images contract¶
Warning
This package is deprecated, please use shub (>=2.5.0) instead.
A contract is a set of requirements that any crawler custom Docker image have to comply with to be able to run on Scrapy Cloud.
Scrapy crawler Docker images are already supported via the scrapinghub-entrypoint-scrapy contract implementation. If you want to run crawlers built using other framework/language than Scrapy/Python, you have to make sure your image follows the contract statements listed below.
Contract statements¶
- Docker image should be able to run via
start-crawl
command without arguments.
docker run myscrapyimage start-crawl
- Docker image should be able to return a spiders list via
list-spiders
command without arguments.
docker run myscrapyimage list-spiders
- Crawler should be able to get all needed params using system environment variables.
Environment variables¶
SHUB_JOB_DATA¶
Job arguments, in json format.
Example:
{"key": "1111112/2/2", "project": 1111112, "version": "version1",
"spider": "spider-name", "spider_type": "auto", "tags": [],
"priority": 2, "scheduled_by": "user", "started_by": "admin",
"pending_time": 1460374516193, "running_time": 1460374557448, ... }
SHUB_SETTINGS¶
Job settings (i.e. organization / project / spider / job settings), in json format.
There are several layers of settings, and they all serve to different needs.
The settings may contain the following sections (dict keys):
organization_settings
project_settings
spider_settings
job_settings
enabled_addons
Organization / project / spider / job settings define appropriate levels of same settings but with different priorities. Enabled addons define Scrapinghub addons specific settings and may have an extended structure.
All the settings should replicate Dash API project /settings/get.json
endpoint response (except job_settings
if exists):
http -a APIKEY: http://dash.scrapinghub.com/api/settings/get.json project==PROJECTID
Note
All environment variables starting from SHUB_
are reserved for Scrapinghub internal use and shouldn’t be used with any other purposes (they will be dropped/replaced on a job start).
Scrapy entrypoint¶
A base support wrapper written in Python implementing Custom Images contract to run Scrapy-based python crawlers and scripts on Scrapy Cloud.
Main functions of this wrapper are the following:
- providing
start-crawl
entrypoint - providing
list-spiders
entrypoint (starting from0.7.0
version) - translating system environment variables to Scrapy
crawl
/list
commands
In fact, there are a lot of different features:
- parsing job data from environment
- processing job args and settings
- running a job with Scrapy
- collecting stats
- advanced logging & error handling
- transparent integration with Scrapinghub storage
- custom scripts support
scrapinghub-entrypoint-scrapy package is available on: