.. _deploy-custom-image: ============================== Deploying custom Docker images ============================== .. note:: This feature is currently only available for paying customers. It's possible to deploy Docker images with spiders to Scrapy Cloud. To be able to run spiders in custom `Docker`_ images it's necessary to follow the :ref:`Custom images contract ` - a set of requirements that image should comply with to be compatible with Scrapy Cloud. .. _Docker: https://docs.docker.com/ Deployment ========== This section describes how to build and deploy a custom Docker image to Scrapy Cloud. For all the following steps it's assumed that commands are executed at the root directory of your project. 1. Create Dockerfile -------------------- The most important thing you need to be able to build and deploy Docker images is a `Dockerfile`_. Please follow the link if you are not familiar with the concept as it's crucial to understand it while using custom Docker images feature. If you want to migrate an existing Scrapy project - there's a tool that may help you, please read :ref:`this section `. In all other cases you're responsible for writing your own Dockerfile. The resulting Dockerfile should produce a Docker image that follows the :ref:`Custom images contract ` - follow the link to find an example Dockerfile. .. _Dockerfile: https://docs.docker.com/engine/reference/builder/ 2. Deploy to Scrapy Cloud ------------------------- Once you have the Dockerfile run the :ref:`shub deploy ` command to build the Docker image. If there's no :ref:`scrapinghub.yml ` configuration file at the project root shub will start a wizard that will help to configure the project and will save the configuration file. If you already have :ref:`scrapinghub.yml ` at the project root please ensure that :ref:`image deploy is configured ` for the target project. If the target project already exists in the configuration file but images deploy is not configured you can run :ref:`shub image build ` to build the image for the first time and shub will help you to configure the image repository. The deploy consists of 3 stages which are described below. Normally :ref:`shub deploy ` will execute all 3 stages in a single run, but in some cases in might be useful to run those stages separately, so there are commands bundled under :ref:`shub image ` that allow to execute different stages separately. Build ^^^^^ During the build stage Docker image is built from the given Dockerfile. This stage can be manually started with :ref:`shub image build ` command:: $ shub image build ... The image images.scrapinghub.com/project/XXXXXX:YYYYYY build is completed. In the end of the command, shub will automatically run a few tests to make sure everything is alright for deployment. You can run the test manually after the build:: $ shub image test .. note:: If you want to access Docker build logs you can invoke the command in the verbose mode:: $ shub image build -v Push ^^^^ During the push stage the image is pushed to the repository defined in the :ref:`scrapinghub.yml ` file. This stage can be manually started with :ref:`shub image push ` command:: $ shub image push ... The image images.scrapinghub.com/project/XXXXXX:YYYYYY pushed successfully. In the example above, the image was pushed to the default Scrapinghub images registry ``images.scrapinghub.com``. .. note:: If you want to access Docker push logs you can invoke the command in the verbose mode:: $ shub image push -v Deploy ^^^^^^ During the deploy stage the image is deployed to the Scrapy Cloud. This stage can be manually started with :ref:`shub image deploy ` command:: $ shub image deploy ... You can check deploy results later with 'shub image check --id 1'. Deploy results: {'status': 'started'} {'project': XXXXXX, 'status': 'ok', 'version': 'YYYYYY', 'spiders': 1} Now you can schedule your spiders via web dashboard or shub. .. note:: The deploy step for a project might be slow for the first time you do it .. _create-image-for-scrapy-project: Create Docker image for existing Scrapy project =============================================== If you have an existing Scrapy project and you want to run it using a custom Docker image you'll need to create a `Dockerfile`_ for it. There's a :ref:`shub image init ` command that creates a template Dockerfile, which should be suitable for the majority of the Scrapy projects that run on Scrapy Cloud:: $ shub image init If your project has ``requirements.txt`` file you can easily add it like this:: $ shub image init --requirements path/to/requirements.txt .. warning:: If you have a Scrapy project but don't want to use the generated Dockerfile or need to use a different base image you may want to install `scrapinghub-entrypoint-scrapy`_ Python package inside your image. It is a support layer that passes data from the job to Scrapinghub storage. Otherwise you will need to send data to Scrapinghub storage using `HTTP API`__. .. _scrapinghub-entrypoint-scrapy: https://pypi.python.org/pypi/scrapinghub-entrypoint-scrapy __ https://doc.scrapinghub.com/scrapy-cloud.html#storage-scrapinghub-com .. _commands: Commands ======== Each of the commands we used in the steps above has some options that allow you to customize their behavior. For example, the :ref:`push ` command allows you to pass your registry credentials via the ``--username`` and ``--password`` options. This section lists the options available for each command. .. _commands-build: build ----- This command uses the Dockerfile to build the image that's going to be deployed later. It reads the target images from the :ref:`scrapinghub.yml ` file. You should add a section called ``images`` on it using the following format: .. code-block:: yaml projects: default: 11111 prod: 22222 # image deploy is enabled for all targets image: true Or: .. code-block:: yaml projects: default: id: 12345 # image deploy is enabled only for default target image: true prod: 33333 Options for build ^^^^^^^^^^^^^^^^^ .. function:: --list-targets List available targets and exit. .. function:: --target Define the image for release. The ```` parameter must be one of the target names listed by ``list-targets``. **Default value**: ``default`` .. function:: -V/--version Tag your image with ````. You'll probably not need to set this manually, because the tool automatically sets this for you. If you pass the ``-V``/``--version`` parameter here, you will have to pass the exact same value to any other commands that accept this parameter (:ref:`push ` and :ref:`deploy `). **Default value**: identifier generated by shub. .. function:: -S/--skip-tests Option to skip testing image with ``shub image test`` after build. .. function:: -v/--verbose Increase the tool's verbosity. .. function:: -f/--file Use this option to pass a custom Dockerfile name (default is 'PATH/Dockerfile'). **Default value**: ``Dockerfile`` **Example:** :: $ shub image build --list-targets default private fallback $ shub image build --target private --version 1.0.4 .. _commands-push: push ---- This command pushes the image built by the ``build`` command to the registry (the ``default`` or another one specified with the ``--target option``). Options for push ^^^^^^^^^^^^^^^^ .. function:: --list-targets List available targets and exit. .. function:: --target Define the image for release. The ```` parameter must be one of the target's names listed by ``list-targets``. **Default value**: ``default`` .. function:: -V/--version Tag your image with ````. If you provided a custom version to the :ref:`build ` command, make sure to provide the same value here. **Default value**: identifier generated by shub. .. function:: --username Set the username to authenticate in the Docker registry. **Note**: we don't store your credentials and you'll be able to use OAuth2 in the near future. .. function:: --password Set the password to authenticate in the Docker registry. .. function:: --email Set the email to authenticate in the Docker registry (if needed). .. function:: --apikey Use provided apikey to authenticate in the Scrapy Cloud Docker registry. .. function:: --insecure Use the Docker registry in insecure mode. .. function:: -v/--verbose Increase the tool's verbosity. Most of these options are related with Docker registry authentication. If you don't provide them, shub will try to push your image using the plain HTTP ``--insecure-registry`` docker mode. **Example:** :: $ shub image push --target private --version 1.0.4 \ --username johndoe --password johndoepwd This example authenticates the user ``johndoe`` to the registry ``your.own.registry:port`` (as defined in the :ref:`build command example `). .. _commands-deploy: deploy ------ This command deploys your release image to Scrapy Cloud. Options for deploy ^^^^^^^^^^^^^^^^^^ .. function:: --list-targets List available targets and exit. .. function:: --target Target name that defines where the image is going to be pushed to. **Default value**: ``default`` .. function:: -V/--version The image version that you want to deploy to Scrapy Cloud. If you provided a custom version to the :ref:`build ` and :ref:`push ` commands, make sure to provide the same value here. **Default value**: identifier generated by shub .. function:: --username Set the username to authenticate in the Docker registry. **Note**: we don't store your credentials and you'll be able to use OAuth2 in the near future. .. function:: --password Set the password to authenticate in the registry. .. function:: --email Set the email to authenticate in the Docker registry (if needed). .. function:: --apikey Use provided apikey to authenticate in the Scrapy Cloud Docker registry. .. function:: --insecure Use the Docker registry in insecure mode. .. function:: --async .. warning:: Deploy in asynchronous mode is deprecated. Make deploy asynchronous. When enabled, the tool will exit as soon as the deploy is started in background. You can then check the status of your deploy task periodically via the :ref:`check ` command. **Default value**: ``False`` .. function:: -v/--verbose Increase the tool's verbosity. **Example:** :: $ shub image deploy --target private --version 1.0.4 \ --username johndoe --password johndoepwd This command will deploy the image from the ``private`` target, using user credentials passed as parameters. .. _commands-upload: upload ------ It is a shortcut for the build -> push -> deploy chain of commands. **Example:** :: $ shub image upload private --version 1.0.4 \ --username johndoe --password johndoepwd Options for upload ^^^^^^^^^^^^^^^^^^ The ``upload`` command accepts the same parameters as the :ref:`deploy ` command, except for ``--target``, which can be passed as an argument. .. _commands-check: check ----- This command checks the status of your deployment and is useful when you do the deploy in asynchronous mode. .. warning:: Deploy in asynchronous mode is deprecated. By default, the ``check`` command will return results from the last deploy. Options for check ^^^^^^^^^^^^^^^^^ .. function:: --id The id of the deploy you want to check the status. **Default value**: the id of the latest deploy. **Example:** :: $ shub image check --id 0 This command above will check the status of the first deploy made (id 0). .. _commands-test: test ---- This command checks if your local setup meets the requirements for a deployment at Scrapy Cloud. You can run it right after the :ref:`build command ` to make sure everything is ready to go before you push your image with the :ref:`push command `. Options for test ^^^^^^^^^^^^^^^^ .. function:: --list-targets List available targets and exit. .. function:: --target Target name that defines an image that is going to be tested. **Default value**: ``default`` .. function:: -V/--version The image version that you want to test. If you provided a custom version to the :ref:`deploy `, make sure to provide the same value here. .. function:: -v/--verbose Increase the tool's verbosity. list ---- This command lists spiders for your project based on the image you built and your project settings in Dash. You can run it right after the :ref:`build command ` to make sure that all your spiders are found. Options for list ^^^^^^^^^^^^^^^^ .. function:: --list-targets List available targets and exit. .. function:: --target Target name that defines an image to get spiders list. **Default value**: ``default`` .. function:: -V/--version The image version that you want to use to extract spiders list. If you provided a custom version to the :ref:`deploy `, make sure to provide the same value here. .. function:: -s/--silent-mode Silent mode to suspend errors in a case if project isn't found for a given target in :ref:`scrapinghub.yml `. .. function:: -v/--verbose Increase the tool's verbosity. .. _commands-init: init ---- This command helps to migrate existing Scrapy projects to custom Docker images. It generates a ``Dockerfile`` that can be used later by the :ref:`build ` or :ref:`upload ` commands. The generated Dockerfile will likely fit your needs. But if it doesn't, it's just a matter of editing the file. Options for init ^^^^^^^^^^^^^^^^ .. function:: --project Define the Scrapy project where the settings are going to be read from. **Default value**: ``default`` from current folder's ``scrapy.cfg``. .. function:: --base-image Define which `base Docker image `_ your custom image will build upon. **Default value**: ``python:2.7`` .. function:: --requirements Set ``path`` as the Python requirements file for this project. **Default value**: project directory ``requirements.txt`` .. function:: --add-deps Provide additional system dependencies to install in your image along with the default ones. The ```` parameter should be a comma separated list with no spaces between dependencies. .. function:: --list-recommended-reqs List recommended Python requirements for a Scrapy Cloud project and exit. **Example:** :: $ shub image init --base-image scrapinghub/base:12.04 \ --requirements other/requirements-dev.txt \ --add-deps phantomjs,tmux Troubleshooting =============== Image not found while deploying ------------------------------- If you don't use default Scrapinghub repository - make sure the repository you set in your :ref:`scrapinghub.yml ` images section exists in the registry. Consider this example: .. code-block:: yaml projects: default: 555555 image: johndoe/scrapy-crawler shub will try to deploy the image to http://hub.docker.com/johndoe/scrapy-crawler, since `hub.docker.com `_ is the default Docker registry. So, to make it work, you have to log into your account there and create the repository. Otherwise, you are going to get an error message like this:: Deploy results: {u'status': u'error', u'last_step': u'pulling', u'error': u"DockerCmdFailure(u'Error: image johndoe/scrapy-crawler not found',)"} Uploading to a private repository --------------------------------- If you are using a private repository to push your images to, make sure to pass your registry credentials to both :ref:`push ` and :ref:`deploy ` commands:: $ shub image push --username johndoe --password yourpass $ shub image deploy --username johndoe --password yourpass Or pass it to :ref:`upload ` command:: $ shub image upload --username johndoe --password yourpass Container works locally but fails in Scrapy Cloud ------------------------------------------------- Prior to running ``start-crawl`` in Scrapy Cloud, some configurations are set to ensure we can run an isolated process. This can lead to issues that are quite hard to debug and find the root cause. To aid in this process, below you will find some steps that are quite similar to what actually runs in Scrapy Cloud. Run your container in interactive mode with ``bash`` (or any other terminal that is available). Please replace the 2 occurrences of ```` with the actual spider that is to run:: $ docker run \ -it \ -e SHUB_JOBKEY=123/4/5 \ -e SHUB_JOB_DATA='{ "_shub_worker": "kumo", "api_url": "https://app.zyte.com/api/", "auth": "SOME AUTH KEY NOT REQUIRED FOR THIS TEST", "deploy_id": 1, "key": "123/4/5", "pending_time": 1632739881823, "priority": 2, "project": 123, "running_time": 1632739882059, "scheduled_by": "some_user", "spider": "", "spider_type": "manual", "started_by": "jobrunner", "state": "running", "tags": [], "units": 1, "version": "1.0" }' \ -e SHUB_JOB_ENV='{}' \ -e SHUB_JOB_MEMORY_LIMIT=950 \ -e SHUB_JOB_UID=123 \ -e SHUB_SETTINGS='{ "deploy_id": 1, "enabled_addons": [], "job_settings": {}, "organization_settings": {}, "project_settings": {}, "spider_settings": {}, "status": "ok", "version": "1.0" }' \ -e SHUB_SPIDER= \ --net bridge \ --volume=/scrapinghub \ --rm=true \ --name=scrapy-cloud-container \ my-docker-image \ /bin/bash Connect to the container in a new terminal window and open a named pipe to communicate through ``sh_scrapy``:: $ docker exec -it scrapy-cloud-container /bin/bash $ mkfifo -m 0600 /dev/scrapinghub $ chown 65534:65534 /dev/scrapinghub $ cat /dev/scrapinghub Go back to the first window and start the crawling process:: $ export SHUB_FIFO_PATH=/dev/scrapinghub $ start-crawl Switch back to the second window (the named pipe one) to see the results comming out.