Deploying custom Docker images

Note

This feature is currently only available for paying customers.

It’s possible to deploy Docker images with spiders to Scrapy Cloud. To be able to run spiders in custom Docker images it’s necessary to follow the Custom images contract - a set of requirements that image should comply with to be compatible with Scrapy Cloud.

Deployment

This section describes how to build and deploy a custom Docker image to Scrapy Cloud. For all the following steps it’s assumed that commands are executed at the root directory of your project.

1. Create Dockerfile

The most important thing you need to be able to build and deploy Docker images is a Dockerfile. Please follow the link if you are not familiar with the concept as it’s crucial to understand it while using custom Docker images feature.

If you want to migrate an existing Scrapy project - there’s a tool that may help you, please read this section. In all other cases you’re responsible for writing your own Dockerfile. The resulting Dockerfile should produce a Docker image that follows the Custom images contract - follow the link to find an example Dockerfile.

2. Deploy to Scrapy Cloud

Once you have the Dockerfile run the shub deploy command to build the Docker image. If there’s no scrapinghub.yml configuration file at the project root shub will start a wizard that will help to configure the project and will save the configuration file. If you already have scrapinghub.yml at the project root please ensure that image deploy is configured for the target project. If the target project already exists in the configuration file but images deploy is not configured you can run shub image build to build the image for the first time and shub will help you to configure the image repository.

The deploy consists of 3 stages which are described below. Normally shub deploy will execute all 3 stages in a single run, but in some cases in might be useful to run those stages separately, so there are commands bundled under shub image that allow to execute different stages separately.

Build

During the build stage Docker image is built from the given Dockerfile. This stage can be manually started with shub image build command:

$ shub image build
...
The image images.scrapinghub.com/project/XXXXXX:YYYYYY build is completed.

In the end of the command, shub will automatically run a few tests to make sure everything is alright for deployment. You can run the test manually after the build:

$ shub image test

Note

If you want to access Docker build logs you can invoke the command in the verbose mode:

$ shub image build -v

Push

During the push stage the image is pushed to the repository defined in the scrapinghub.yml file. This stage can be manually started with shub image push command:

$ shub image push
...
The image images.scrapinghub.com/project/XXXXXX:YYYYYY pushed successfully.

In the example above, the image was pushed to the default Scrapinghub images registry images.scrapinghub.com.

Note

If you want to access Docker push logs you can invoke the command in the verbose mode:

$ shub image push -v

Deploy

During the deploy stage the image is deployed to the Scrapy Cloud. This stage can be manually started with shub image deploy command:

$ shub image deploy
...
You can check deploy results later with 'shub image check --id 1'.
Deploy results:
 {'status': 'started'}
 {'project': XXXXXX, 'status': 'ok', 'version': 'YYYYYY', 'spiders': 1}

Now you can schedule your spiders via web dashboard or shub.

Note

The deploy step for a project might be slow for the first time you do it

Create Docker image for existing Scrapy project

If you have an existing Scrapy project and you want to run it using a custom Docker image you’ll need to create a Dockerfile for it. There’s a shub image init command that creates a template Dockerfile, which should be suitable for the majority of the Scrapy projects that run on Scrapy Cloud:

$ shub image init

If your project has requirements.txt file you can easily add it like this:

$ shub image init --requirements path/to/requirements.txt

Warning

If you have a Scrapy project but don’t want to use the generated Dockerfile or need to use a different base image you may want to install scrapinghub-entrypoint-scrapy Python package inside your image. It is a support layer that passes data from the job to Scrapinghub storage. Otherwise you will need to send data to Scrapinghub storage using HTTP API.

Commands

Each of the commands we used in the steps above has some options that allow you to customize their behavior. For example, the push command allows you to pass your registry credentials via the --username and --password options. This section lists the options available for each command.

build

This command uses the Dockerfile to build the image that’s going to be deployed later.

It reads the target images from the scrapinghub.yml file. You should add a section called images on it using the following format:

projects:
  default: 11111
  prod: 22222
# image deploy is enabled for all targets
image: true

Or:

projects:
  default:
    id: 12345
    # image deploy is enabled only for default target
    image: true
  prod: 33333

Options for build

--list-targets

List available targets and exit.

--target <text>

Define the image for release. The <text> parameter must be one of the target names listed by list-targets.

Default value: default

-V/--version <text>

Tag your image with <text>. You’ll probably not need to set this manually, because the tool automatically sets this for you.

If you pass the -V/--version parameter here, you will have to pass the exact same value to any other commands that accept this parameter (push and deploy).

Default value: identifier generated by shub.

-S/--skip-tests

Option to skip testing image with shub image test after build.

-v/--verbose

Increase the tool’s verbosity.

-f/--file

Use this option to pass a custom Dockerfile name (default is ‘PATH/Dockerfile’).

Default value: Dockerfile

Example:

$ shub image build --list-targets
default
private
fallback
$ shub image build --target private --version 1.0.4

push

This command pushes the image built by the build command to the registry (the default or another one specified with the --target option).

Options for push

--list-targets

List available targets and exit.

--target <text>

Define the image for release. The <text> parameter must be one of the target’s names listed by list-targets.

Default value: default

-V/--version <text>

Tag your image with <text>. If you provided a custom version to the build command, make sure to provide the same value here.

Default value: identifier generated by shub.

--username <text>

Set the username to authenticate in the Docker registry.

Note: we don’t store your credentials and you’ll be able to use OAuth2 in the near future.

--password <text>

Set the password to authenticate in the Docker registry.

--email <text>

Set the email to authenticate in the Docker registry (if needed).

--apikey <text>

Use provided apikey to authenticate in the Scrapy Cloud Docker registry.

--insecure

Use the Docker registry in insecure mode.

-v/--verbose

Increase the tool’s verbosity.

Most of these options are related with Docker registry authentication. If you don’t provide them, shub will try to push your image using the plain HTTP --insecure-registry docker mode.

Example:

$ shub image push --target private --version 1.0.4 \
--username johndoe --password johndoepwd

This example authenticates the user johndoe to the registry your.own.registry:port (as defined in the build command example).

deploy

This command deploys your release image to Scrapy Cloud.

Options for deploy

--list-targets

List available targets and exit.

--target <text>

Target name that defines where the image is going to be pushed to.

Default value: default

-V/--version <text>

The image version that you want to deploy to Scrapy Cloud. If you provided a custom version to the build and push commands, make sure to provide the same value here.

Default value: identifier generated by shub

--username <text>

Set the username to authenticate in the Docker registry.

Note: we don’t store your credentials and you’ll be able to use OAuth2 in the near future.

--password <text>

Set the password to authenticate in the registry.

--email <text>

Set the email to authenticate in the Docker registry (if needed).

--apikey <text>

Use provided apikey to authenticate in the Scrapy Cloud Docker registry.

--insecure

Use the Docker registry in insecure mode.

--async

Warning

Deploy in asynchronous mode is deprecated.

Make deploy asynchronous. When enabled, the tool will exit as soon as the deploy is started in background. You can then check the status of your deploy task periodically via the check command.

Default value: False

-v/--verbose

Increase the tool’s verbosity.

Example:

$ shub image deploy --target private --version 1.0.4 \
--username johndoe --password johndoepwd

This command will deploy the image from the private target, using user credentials passed as parameters.

upload

It is a shortcut for the build -> push -> deploy chain of commands.

Example:

$ shub image upload private --version 1.0.4 \
--username johndoe --password johndoepwd

Options for upload

The upload command accepts the same parameters as the deploy command, except for --target, which can be passed as an argument.

check

This command checks the status of your deployment and is useful when you do the deploy in asynchronous mode.

Warning

Deploy in asynchronous mode is deprecated.

By default, the check command will return results from the last deploy.

Options for check

--id <number>

The id of the deploy you want to check the status.

Default value: the id of the latest deploy.

Example:

$ shub image check --id 0

This command above will check the status of the first deploy made (id 0).

test

This command checks if your local setup meets the requirements for a deployment at Scrapy Cloud. You can run it right after the build command to make sure everything is ready to go before you push your image with the push command.

Options for test

--list-targets

List available targets and exit.

--target <text>

Target name that defines an image that is going to be tested.

Default value: default

-V/--version <text>

The image version that you want to test. If you provided a custom version to the deploy, make sure to provide the same value here.

-v/--verbose

Increase the tool’s verbosity.

list

This command lists spiders for your project based on the image you built and your project settings in Dash. You can run it right after the build command to make sure that all your spiders are found.

Options for list

--list-targets

List available targets and exit.

--target <text>

Target name that defines an image to get spiders list.

Default value: default

-V/--version <text>

The image version that you want to use to extract spiders list. If you provided a custom version to the deploy, make sure to provide the same value here.

-s/--silent-mode

Silent mode to suspend errors in a case if project isn’t found for a given target in scrapinghub.yml.

-v/--verbose

Increase the tool’s verbosity.

init

This command helps to migrate existing Scrapy projects to custom Docker images. It generates a Dockerfile that can be used later by the build or upload commands.

The generated Dockerfile will likely fit your needs. But if it doesn’t, it’s just a matter of editing the file.

Options for init

--project <text>

Define the Scrapy project where the settings are going to be read from.

Default value: default from current folder’s scrapy.cfg.

--base-image <text>

Define which base Docker image your custom image will build upon.

Default value: python:2.7

--requirements <path>

Set path as the Python requirements file for this project.

Default value: project directory requirements.txt

--add-deps <list>

Provide additional system dependencies to install in your image along with the default ones. The <list> parameter should be a comma separated list with no spaces between dependencies.

--list-recommended-reqs

List recommended Python requirements for a Scrapy Cloud project and exit.

Example:

$ shub image init --base-image scrapinghub/base:12.04 \
--requirements other/requirements-dev.txt \
--add-deps phantomjs,tmux

Troubleshooting

Image not found while deploying

If you don’t use default Scrapinghub repository - make sure the repository you set in your scrapinghub.yml images section exists in the registry. Consider this example:

projects:
    default: 555555
image: johndoe/scrapy-crawler

shub will try to deploy the image to http://hub.docker.com/johndoe/scrapy-crawler, since hub.docker.com is the default Docker registry. So, to make it work, you have to log into your account there and create the repository.

Otherwise, you are going to get an error message like this:

Deploy results: {u'status': u'error', u'last_step': u'pulling', u'error': u"DockerCmdFailure(u'Error: image johndoe/scrapy-crawler not found',)"}

Uploading to a private repository

If you are using a private repository to push your images to, make sure to pass your registry credentials to both push and deploy commands:

$ shub image push --username johndoe --password yourpass
$ shub image deploy --username johndoe --password yourpass

Or pass it to upload command:

$ shub image upload --username johndoe --password yourpass

Container works locally but fails in Scrapy Cloud

Prior to running start-crawl in Scrapy Cloud, some configurations are set to ensure we can run an isolated process. This can lead to issues that are quite hard to debug and find the root cause. To aid in this process, below you will find some steps that are quite similar to what actually runs in Scrapy Cloud.

Run your container in interactive mode with bash (or any other terminal that is available). Please replace the 2 occurrences of <SPIDER-NAME> with the actual spider that is to run:

$ docker run \
-it \
-e SHUB_JOBKEY=123/4/5 \
-e SHUB_JOB_DATA='{
    "_shub_worker": "kumo",
    "api_url": "https://app.zyte.com/api/",
    "auth": "SOME AUTH KEY NOT REQUIRED FOR THIS TEST",
    "deploy_id": 1,
    "key": "123/4/5",
    "pending_time": 1632739881823,
    "priority": 2,
    "project": 123,
    "running_time": 1632739882059,
    "scheduled_by": "some_user",
    "spider": "<SPIDER-NAME>",
    "spider_type": "manual",
    "started_by": "jobrunner",
    "state": "running",
    "tags": [],
    "units": 1,
    "version": "1.0"
}' \
-e SHUB_JOB_ENV='{}' \
-e SHUB_JOB_MEMORY_LIMIT=950 \
-e SHUB_JOB_UID=123 \
-e SHUB_SETTINGS='{
    "deploy_id": 1,
    "enabled_addons": [],
    "job_settings": {},
    "organization_settings": {},
    "project_settings": {},
    "spider_settings": {},
    "status": "ok",
    "version": "1.0"
}' \
-e SHUB_SPIDER=<SPIDER-NAME> \
--net bridge \
--volume=/scrapinghub \
--rm=true \
--name=scrapy-cloud-container \
my-docker-image \
/bin/bash

Connect to the container in a new terminal window and open a named pipe to communicate through sh_scrapy:

$ docker exec -it scrapy-cloud-container /bin/bash
$ mkfifo -m 0600 /dev/scrapinghub
$ chown 65534:65534 /dev/scrapinghub
$ cat /dev/scrapinghub

Go back to the first window and start the crawling process:

$ export SHUB_FIFO_PATH=/dev/scrapinghub
$ start-crawl

Switch back to the second window (the named pipe one) to see the results comming out.