Custom Images contract
This is a set of requirements that any custom Docker image has to comply with to be able to run on Scrapy Cloud.
Scrapy crawler Docker images are already supported via
the scrapinghub-entrypoint-scrapy contract implementation.
If you want to run crawlers built using other framework/language than Scrapy/Python,
you have to make sure your image follows the contract statements listed below.
This means you have to implement your own scripts following the specification below.
You can find example projects written in other frameworks and programming languages in
the custom-images-examples repository. The shub bootstrap
can be used to clone
these projects.
Contract statements
Docker image should be able to run via
start-crawl
command without arguments.start-crawl
should be executable and located on the search path.docker run myscrapyimage start-crawl
Crawler will be started by unpriviledged user
nobody
in a writable directory/scrapinghub
.HOME
environment variable will be set to/scrapinghub
as well. Beware that this directory is added dynamically when job starts, if Docker image contains this directory - it’ll be erased.Docker image should be able to return its metadata via
shub-image-info
command without arguments.shub-image-info
should be executable and located on the search path. For now only a few fields are supported, and all of them are required:
project_type
- a string project type, one of [scrapy
,casperjs
,other
],
spiders
- a list of non-empty string spider names.
docker run myscrapyimage shub-image-info
{"project_type": "casperjs", "spiders": ["spiderA", "spiderB"]}
Note
shub-image-info
is an extension (and a replacement) for a former list-spiders
command to provide
metadata in a structured form allowing to simplify non-Scrapy development and parametrize custom images
in a more configurable way.
The command could also handle optional --debug
flag by returning debug information about the image
inside of an additional debug
field: a name/version of operation system, installed packages etc.
For example, for a Python-based custom image it could be a good idea to include pip freeze
call results.
Data format of the debug
field is plain text, not structured to keep it simple.
Crawler should be able to get all needed params using system environment variables.
Note
The simplest way to place scripts on the search path is to create a symbolic link to the script located in the directory present in the PATH environment variable. Here’s an example Dockerfile:
FROM python:3
RUN mkdir -p /spiders
WORKDIR /spiders
ADD . /spiders
# Create a symbolic link in /usr/sbin because it's present in the PATH
RUN ln -s /spiders/start-crawl /usr/sbin/start-crawl
RUN ln -s /spiders/shub-image-info /usr/sbin/shub-image-info
# Make scripts executable
RUN chmod +x /spiders/start-crawl /spiders/shub-image-info
Environment variables
SHUB_SPIDER
Spider name.
Example:
test-spider
SHUB_JOBKEY
Job key in format PROJECT_ID/SPIDER_ID/JOB_ID
.
Example:
123/45/67
SHUB_JOB_DATA
Job arguments, in JSON format.
Example:
{"key": "1111112/2/2", "project": 1111112, "version": "version1",
"spider": "spider-name", "spider_type": "auto", "tags": ["tagA", "tagB"],
"priority": 2, "scheduled_by": "user", "started_by": "john",
"pending_time": 1460374516193, "running_time": 1460374557448, ... }
Some useful fields
Field |
Description |
Example |
---|---|---|
key |
Job key in format |
|
project |
Integer project ID |
|
spider |
String spider name |
|
job_cmd |
List of string arguments for the job |
|
spider_args |
Dictionary with spider arguments |
|
version |
String project version used to run the job |
|
deploy_id |
Integer project deploy ID used to run the job |
|
units |
Amount of units used by the job |
|
priority |
Job priority value |
|
tags |
List of string tags for the job |
|
state |
Job current state name |
|
pending_time |
UNIX timestamp when the job was added, in milliseconds |
|
running_time |
UNIX timestamp when the job was started, in milliseconds |
|
scheduled_by |
Username who scheduled the job |
|
If you specified some custom metadata with meta
field when scheduling the job, the data will also be in the dictionary.
Warning
SHUB_JOB_DATA
may contain other undocumented fields. They are for the platform’s internal useand are not part of the contract, i.e. they can appear or be removed anytime.
SHUB_SETTINGS
Job settings (i.e. organization / project / spider / job settings), in JSON format.
There are several layers of settings, and they all serve to different needs.
The settings may contain the following sections (dict keys):
organization_settings
project_settings
spider_settings
job_settings
enabled_addons
Organization / project / spider / job settings define appropriate levels of same settings but with different priorities. Enabled addons define Scrapinghub addons specific settings and may have an extended structure.
All the settings should replicate Dash API project /settings/get.json
endpoint response
(except job_settings
if exists):
http -a APIKEY: http://dash.scrapinghub.com/api/settings/get.json project==PROJECTID
Note
All environment variables starting from SHUB_
are reserved for Scrapinghub internal use
and shouldn’t be used with any other purposes (they will be dropped/replaced on a job start).
Scrapy entrypoint
A base support wrapper written in Python implementing Custom Images contract to run Scrapy-based python crawlers and scripts on Scrapy Cloud.
Main functions of this wrapper are the following:
providing
start-crawl
entrypointproviding
shub-image-info
entrypoint (starting from0.11.0
version)translating system environment variables to Scrapy
crawl
/list
commands
In fact, there are a lot of different features:
parsing job data from environment
processing job args and settings
running a job with Scrapy
collecting stats
advanced logging & error handling
transparent integration with Scrapinghub storage
custom scripts support
scrapinghub-entrypoint-scrapy package is available on:
Scrapy addons
If you have Scrapy addons enabled in Dash UI, you may encounter with the similar errors:
[sh_scrapy.settings] Addon import error scrapy_pagestorage.PageStorageMiddleware: No module named scrapy_pagestorage
As you are in control of managing your Docker image content, you should add all missing packages
by yourself to requirements.txt
file (including dependencies related with the Scrapy addons),
or disable corresponding addons in Dash UI.