Why are my input/output processors in Scrapy not working?

However, there is one more place where you can specify the input and output processors to use: in the Item Field metadata.

I suspect the documentation is misleading/wrong (or may be out of date?), because, according to the source code, the input_processor field attribute is read only inside the ItemLoader instance, which means that you need to use an Item Loader anyway.

You can use a built-in one and leave your DmozItem definition as is:

from scrapy.loader import ItemLoader

class DmozSpider(scrapy.Spider):
    # ...

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            loader = ItemLoader(DmozItem(), selector=sel)
            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')
            yield loader.load_item()

This way the input_processor and output_processor Item Field arguments would be taken into account and the processors would be applied.


Or you can define the processors inside a custom Item Loader instead of the Item class:

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class MyItemLoader(ItemLoader):
    desc_in = MapCompose(
        lambda x: ' '.join(x.split()),
        lambda x: x.upper()
    )

    desc_out = Join()

And use it to load items in your spider:

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        loader = MyItemLoader(DmozItem(), selector=sel)
        loader.add_xpath('title', 'a/text()')
        loader.add_xpath('link', 'a/@href')
        loader.add_xpath('desc', 'text()')
        yield loader.load_item()