How to parse huge JSON file as stream in Json.NET?

I think we can do better than the accepted answer, using more features of JsonReader to make a more generalized solution.

As a JsonReader consumes tokens from a JSON, the path is recorded in the JsonReader.Path property.

We can use this to precisely select deeply nested data from a JSON file, using regex to ensure that we're on the right path.

So, using the following extension method:

public static class JsonReaderExtensions
{
    public static IEnumerable<T> SelectTokensWithRegex<T>(
        this JsonReader jsonReader, Regex regex)
    {
        JsonSerializer serializer = new JsonSerializer();
        while (jsonReader.Read())
        {
            if (regex.IsMatch(jsonReader.Path) 
                && jsonReader.TokenType != JsonToken.PropertyName)
            {
                yield return serializer.Deserialize<T>(jsonReader);
            }
        }
    }
}

The data you are concerned with lies on paths:

[0]
[1]
[2]
... etc

We can construct the following regex to precisely match this path:

var regex = new Regex(@"^\[\d+\]$");

it now becomes possible to stream objects out of your data (without fully loading or parsing the entire JSON) as follows

IEnumerable<MyObject> objects = jsonReader.SelectTokensWithRegex<MyObject>(regex);

Or if we want to dig even deeper into the structure, we can be even more precise with our regex

var regex = new Regex(@"^\[\d+\]\.value$");
IEnumerable<string> objects = jsonReader.SelectTokensWithRegex<string>(regex);

to only extract value properties from the items in the array.

I've found this technique extremely useful for extracting specific data from huge (100 GiB) JSON dumps, directly from HTTP using a network stream (with low memory requirements and no intermediate storage required).


This should resolve your problem. Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}

Tags:

C#

Json

Json.Net