Python conversion from JSON to JSONL

A simple way to do this is with jq command in your terminal.

To install jq on Debian and derivatives:

$ sudo apt-get install jq

CentOS/RHEL users should run:

$ sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
$ sudo yum install jq -y

Basic usage:

$ jq -c '.[]' some_json.json >> output.jsonl

If you need to handle with huge files, i strongly recommend to use --stream flag. This will make jq parse your json in streaming mode.

$ jq -c --stream '.[]' some_json.json >> output.json

But, if you need to do this operation into a python file, you can use bigjson , a useful library that parses the JSON in streaming mode:

$ pip3 install bigjson

To read a huge json (In my case, it was 40 GB):

import bigjson

# Reads json file in streaming mode
with open('input_file.json', 'rb') as f:
    json_data = bigjson.load(f)

    # Open output file  
    with open('output_file.jsonl', 'w') as outfile:
        # Iterates over input json
        for data in json_data:
            # Converts json to a Python dict  
            dict_data = data.to_python()
            
            # Saves the output to output file
            outfile.write(json.dumps(dict_data)+"\n")

If you want, try to parallelize this code aiming to improve performance. Post the result here :)

Documentation and source code: https://github.com/henu/bigjson


the jsonlines package is made exactly for your use case:

import jsonlines

items = [
    {'a': 1, 'b': 2},
    {'a', 123, 'b': 456},
]
with jsonlines.open('output.jsonl', 'w') as writer:
    writer.write_all(items)

(yes, i wrote it years after you posted your original question.)


Your input appears to be a sequence of Python objects; it certainly is not valid a JSON document.

If you have a list of Python dictionaries, then all you have to do is dump each entry into a file separately, followed by a newline:

import json

with open('output.jsonl', 'w') as outfile:
    for entry in JSON_file:
        json.dump(entry, outfile)
        outfile.write('\n')

The default configuration for the json module is to output JSON without newlines embedded.

Assuming your A, B and C names are really strings, that would produce:

{"index": 1, "met": "1043205", "no": "A"}
{"index": 2, "met": "000031043206", "no": "B"}
{"index": 3, "met": "0031043207", "no": "C"}

If you started with a JSON document containing a list of entries, just parse that document first with json.load()/json.loads().

Tags:

Python

Json