JAVA - Best approach to parse huge (extra large) JSON file

I will suggest to have a look at Jackson Api it is very easy to combine the streaming and tree-model parsing options: you can move through the file as a whole in a streaming way, and then read individual objects into a tree structure.

As an example, let's take the following input:

{ 
  "records": [ 
    {"field1": "aaaaa", "bbbb": "ccccc"}, 
    {"field2": "aaa", "bbb": "ccc"} 
  ] ,
  "special message": "hello, world!" 
}

Just imagine the fields being sparse or the records having a more complex structure.

The following snippet illustrates how this file can be read using a combination of stream and tree-model parsing. Each individual record is read in a tree structure, but the file is never read in its entirety into memory, making it possible to process JSON files gigabytes in size while using minimal memory.

import org.codehaus.jackson.map.*;
import org.codehaus.jackson.*;

import java.io.File;

public class ParseJsonSample {
    public static void main(String[] args) throws Exception {
        JsonFactory f = new MappingJsonFactory();
        JsonParser jp = f.createJsonParser(new File(args[0]));
        JsonToken current;
        current = jp.nextToken();
        if (current != JsonToken.START_OBJECT) {
            System.out.println("Error: root should be object: quiting.");
            return;
        }
        while (jp.nextToken() != JsonToken.END_OBJECT) {
            String fieldName = jp.getCurrentName();
            // move from field name to field value
            current = jp.nextToken();
            if (fieldName.equals("records")) {
                if (current == JsonToken.START_ARRAY) {
                    // For each of the records in the array
                    while (jp.nextToken() != JsonToken.END_ARRAY) {
                        // read the record into a tree model,
                        // this moves the parsing position to the end of it
                        JsonNode node = jp.readValueAsTree();
                        // And now we have random access to everything in the object
                        System.out.println("field1: " + node.get("field1").getValueAsText());
                        System.out.println("field2: " + node.get("field2").getValueAsText());
                    }
                } else {
                    System.out.println("Error: records should be an array: skipping.");
                    jp.skipChildren();
                }
            } else {
                System.out.println("Unprocessed property: " + fieldName);
                jp.skipChildren();
            }
        }
    }
}

As you can guess, the nextToken() call each time gives the next parsing event: start object, start field, start array, start object, ..., end object, ..., end array, ...

The jp.readValueAsTree() call allows to read what is at the current parsing position, a JSON object or array, into Jackson's generic JSON tree model. Once you have this, you can access the data randomly, regardless of the order in which things appear in the file (in the example field1 and field2 are not always in the same order). Jackson supports mapping onto your own Java objects too. The jp.skipChildren() is convenient: it allows to skip over a complete object tree or an array without having to run yourself over all the events contained in it.


You don't need to switch to Jackson. Gson 2.1 introduced a new TypeAdapter interface that permits mixed tree and streaming serialization and deserialization.

The API is efficient and flexible. See Gson's Streaming doc for an example of combining tree and binding modes. This is strictly better than mixed streaming and tree modes; with binding you don't waste memory building an intermediate representation of your values.

Like Jackson, Gson has APIs to recursively skip an unwanted value; Gson calls this skipValue().


Declarative Stream Mapping (DSM) library allows you to define mappings between your JSON or XML data and your POJO. So you don't need to write a custom parser. İt has powerful scripting(Javascript, groovy, JEXL) support. You can filter and transform data while you are reading. You can call functions for partial data operation while you are reading data. DSM read data as a Stream so it uses very low memory.

For example,

{
    "company": {
         ....
        "staff": [
            {
                "firstname": "yong",
                "lastname": "mook kim",
                "nickname": "mkyong",
                "salary": "100000"
            },
            {
                "firstname": "low",
                "lastname": "yin fong",
                "nickname": "fong fong",
                "salary": "200000"
            }
        ]
    }
}

imagine the above snippet is a part of huge and complex JSON data. we only want to get stuff that has a salary higher than 10000.

First of all, we must define mapping definitions as follows. As you see, it is just a yaml file that contains the mapping between POJO fields and field of JSON data.

result:
      type: object     # result is map or a object.
      path: /.+staff  # path is regex. its match with /company/staff
      function: processStuff  # call processStuff function when /company/stuff tag is closed
      filter: self.data.salary>10000   # any expression is valid in JavaScript, Groovy or JEXL
      fields:
        name:  
          path: firstname
        sureName:
          path: lastname
        userName:
          path: nickname
        salary: long

Create FunctionExecutor for process staff.

FunctionExecutor processStuff=new FunctionExecutor(){

            @Override
            public void execute(Params params) {

                // directly serialize Stuff class
                //Stuff stuff=params.getCurrentNode().toObject(Stuff.class);

                Map<String,Object> stuff= (Map<String,Object>)params.getCurrentNode().toObject();
                System.out.println(stuff);
                // process stuff ; save to db. call service etc.
            }
        };

Use DSM to process JSON

     DSMBuilder builder = new DSMBuilder(new File("path/to/mapping.yaml")).setType(DSMBuilder.TYPE.XML);

       // register processStuff Function
        builder.registerFunction("processStuff",processStuff);

        DSM dsm= builder.create();
        Object object =  dsm.toObject(xmlContent);

Output: (Only stuff that has a salary higher than 10000 is included)

{firstName=low, lastName=yin fong, nickName=fong fong, salary=200000}

Tags:

Java

Json

Gson