Custom Images contract

This is a set of requirements that any custom Docker image has to comply with to be able to run on Scrapy Cloud.

Scrapy crawler Docker images are already supported via the scrapinghub-entrypoint-scrapy contract implementation. If you want to run crawlers built using other framework/language than Scrapy/Python, you have to make sure your image follows the contract statements listed below. This means you have to implement your own scripts following the specification below. You can find example projects written in other frameworks and programming languages in the custom-images-examples repository. The shub bootstrap can be used to clone these projects.

Contract statements

Docker image should be able to run via start-crawl command without arguments. start-crawl should be executable and located on the search path.
docker run myscrapyimage start-crawl
Crawler will be started by unpriviledged user nobody in a writable directory /scrapinghub. HOME environment variable will be set to /scrapinghub as well. Beware that this directory is added dynamically when job starts, if Docker image contains this directory - it’ll be erased.
Docker image should be able to return its metadata via shub-image-info command without arguments. shub-image-info should be executable and located on the search path. For now only a few fields are supported, and all of them are required:

project_type - a string project type, one of [scrapy, casperjs, other],

spiders - a list of non-empty string spider names.

docker run myscrapyimage shub-image-info
{"project_type": "casperjs", "spiders": ["spiderA", "spiderB"]}

Note

shub-image-info is an extension (and a replacement) for a former list-spiders command to provide metadata in a structured form allowing to simplify non-Scrapy development and parametrize custom images in a more configurable way.

The command could also handle optional --debug flag by returning debug information about the image inside of an additional debug field: a name/version of operation system, installed packages etc. For example, for a Python-based custom image it could be a good idea to include pip freeze call results. Data format of the debug field is plain text, not structured to keep it simple.

Crawler should be able to get all needed params using system environment variables.

Note

The simplest way to place scripts on the search path is to create a symbolic link to the script located in the directory present in the PATH environment variable. Here’s an example Dockerfile:

FROM python:3
RUN mkdir -p /spiders
WORKDIR /spiders
ADD . /spiders
# Create a symbolic link in /usr/sbin because it's present in the PATH
RUN ln -s /spiders/start-crawl /usr/sbin/start-crawl
RUN ln -s /spiders/shub-image-info /usr/sbin/shub-image-info
# Make scripts executable
RUN chmod +x /spiders/start-crawl /spiders/shub-image-info

Environment variables

SHUB_SPIDER

Spider name.

Example:

test-spider

SHUB_JOBKEY

Job key in format PROJECT_ID/SPIDER_ID/JOB_ID.

Example:

123/45/67

SHUB_JOB_DATA

Job arguments, in JSON format.

Example:

{"key": "1111112/2/2", "project": 1111112, "version": "version1",
"spider": "spider-name", "spider_type": "auto", "tags": ["tagA", "tagB"],
"priority": 2, "scheduled_by": "user", "started_by": "john",
"pending_time": 1460374516193, "running_time": 1460374557448, ... }

Some useful fields

Field	Description	Example
key	Job key in format `PROJECT_ID/SPIDER_ID/JOB_ID`	`"1111112/2/2"`
project	Integer project ID	`1111112`
spider	String spider name	`"spider-name"`
job_cmd	List of string arguments for the job	`["--flagA", "--key1=value1"]`
spider_args	Dictionary with spider arguments	`{"arg1": "val1"}`
version	String project version used to run the job	`"version1"`
deploy_id	Integer project deploy ID used to run the job	`253`
units	Amount of units used by the job	`1`
priority	Job priority value	`2`
tags	List of string tags for the job	`["tagA", "tagB"]`
state	Job current state name	`"running"`
pending_time	UNIX timestamp when the job was added, in milliseconds	`1460374516193`
running_time	UNIX timestamp when the job was started, in milliseconds	`1460374557448`
scheduled_by	Username who scheduled the job	`"john"`

If you specified some custom metadata with meta field when scheduling the job, the data will also be in the dictionary.

Warning

SHUB_JOB_DATA may contain other undocumented fields. They are for the platform’s internal use: and are not part of the contract, i.e. they can appear or be removed anytime.

SHUB_SETTINGS

Job settings (i.e. organization / project / spider / job settings), in JSON format.

There are several layers of settings, and they all serve to different needs.

The settings may contain the following sections (dict keys):

organization_settings
project_settings
spider_settings
job_settings
enabled_addons

Organization / project / spider / job settings define appropriate levels of same settings but with different priorities. Enabled addons define Scrapinghub addons specific settings and may have an extended structure.

All the settings should replicate Dash API project /settings/get.json endpoint response (except job_settings if exists):

http -a APIKEY: http://dash.scrapinghub.com/api/settings/get.json project==PROJECTID

Note

All environment variables starting from SHUB_ are reserved for Scrapinghub internal use and shouldn’t be used with any other purposes (they will be dropped/replaced on a job start).

Scrapy entrypoint

A base support wrapper written in Python implementing Custom Images contract to run Scrapy-based python crawlers and scripts on Scrapy Cloud.

Main functions of this wrapper are the following:

providing start-crawl entrypoint
providing shub-image-info entrypoint (starting from 0.11.0 version)
translating system environment variables to Scrapy crawl / list commands

In fact, there are a lot of different features:

parsing job data from environment
processing job args and settings
running a job with Scrapy
collecting stats
advanced logging & error handling
transparent integration with Scrapinghub storage
custom scripts support

scrapinghub-entrypoint-scrapy package is available on:

PyPI
Github

Scrapy addons

If you have Scrapy addons enabled in Dash UI, you may encounter with the similar errors:

[sh_scrapy.settings] Addon import error scrapy_pagestorage.PageStorageMiddleware:  No module named scrapy_pagestorage

As you are in control of managing your Docker image content, you should add all missing packages by yourself to requirements.txt file (including dependencies related with the Scrapy addons), or disable corresponding addons in Dash UI.