Scrapy crawler in Cron job

For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1

Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)


I solved this problem including PATH into bash file

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name

Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name. I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:

    crontab -e 

to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.

    PATH=/usr/local/bin
    */5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
    * 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2

All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.