How I can I lazily read multiple JSON values from a file/stream in Python?

JSON generally isn't very good for this sort of incremental use; there's no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.

The object per line solution that you're using is seen elsewhere too. Scrapy calls it 'JSON lines':

https://docs.scrapy.org/en/latest/topics/exporters.html?highlight=exporters#jsonitemexporter
http://www.enricozini.org/2011/tips/python-stream-json/

You can do it slightly more Pythonically:

for jsonline in f:
    yield json.loads(jsonline)   # or do the processing in this loop

I think this is about the best way - it doesn't rely on any third party libraries, and it's easy to understand what's going on. I've used it in some of my own code as well.

Sure you can do this. You just have to take to raw_decode directly. This implementation loads the whole file into memory and operates on that string (much as json.load does); if you have large files you can modify it to only read from the file as necessary without much difficulty.

import json
from json.decoder import WHITESPACE

def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
    if isinstance(string_or_fp, file):
        string = string_or_fp.read()
    else:
        string = str(string_or_fp)

    decoder = cls(**kwargs)
    idx = WHITESPACE.match(string, 0).end()
    while idx < len(string):
        obj, end = decoder.raw_decode(string, idx)
        yield obj
        idx = WHITESPACE.match(string, end).end()

Usage: just as you requested, it's a generator.

A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.

After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.

Example:

from splitstream import splitfile

for jsonstr in splitfile(sys.stdin, format="json")):
    yield json.loads(jsonstr)

How I can I lazily read multiple JSON values from a file/stream in Python?

Tags:

Python

Serialization

Json

Related

Recent Posts