Reading big arrays from big json file in php

Your problem is basically related to the memory management performed by each specific programming language that you might use in order to access the data from a huge (storage purpose) file.

For example, when you amass the operations by using the code that you just mentioned (as below)

$data = json_decode(file_get_contents(storage_path("test/ts/ts_big_data.json")), true);

what happens is that the memory used by runtime Zend engine increases too much, because it has to allocate certain memory units to store references about each ongoing file handling involved in your code statement - like keeping also in memory a pointer, not only the real file opened - unless this file gets finally overwritten and the memory buffer released (freed) again. It's no wonder that if you force the execution of both file_get_contents() function that reads the file into a string and also the json_decode() function, you force the interpreter to keep in memory all 3 "things": the file itself, the reference created (the string), and also the structure (the json file).

On the contrary if you break the statement in several ones, the memory stack hold by the first data structure (the file) will be unloaded when the operation of "getting its content" then writing it into another variable (or file) is fully performed. As time as you don't define a variable where to save the data, it will still stay in the memory (as a blob - with no name, no storage address, just content). For this reason, it is much more CPU and RAM effective - when working with big data - to break everything in smaller steps.

So you have first to start by simply rewriting your code as follows:

$somefile = file_get_contents(storage_path("test/ts/ts_big_data.json"));

$data = json_decode($somefile, true);

When first line gets executed, the memory hold by ts_big_data.json gets released (think of it as being purged and made available again to other processes).

When second line gets executed, also $somefile's memory buffer gets released, too. The take away point from this is that instead of always having 3 memory buffers used just to store the data structures, you'll only have 2 at each time, if of course ignoring the other memory used to actually construct the file. Not to say that when working with arrays (and JSON files just exactly arrays they are), that dynamically allocated memory increases dramatically and not linear as we might tend to think. Bottom line is that instead of a 50% loss in performance just on storage allocation for the files (3 big files taking 50% more space than just 2 of them), we better manage to handle in smaller steps the execution of the functions 'touching' these huge files.

In order to understand this, imagine that you access only what is needed at a certain moment in time (this is also a principle called YAGNI -You Aren't Gonna Need It - or similar in the context of Extreme Programming Practices - see reference here https://wiki.c2.com/?YouArentGonnaNeedIt something inherited since the C or Cobol old times.

The next approach to follow is to break the file in more pieces, but in a structured one (relational dependent data structure) as is in a database table / tables.

Obviously, you have to save the data pieces again as blobs, in the database. The advantage is that the retrieval of data in a DB is much more faster than in a file (due to the allocation of indexes by the SQL when generating and updating the tables). A table having 1 or two indexes can be accessed in a lightning fast manner by a structured query. Again, the indexes are pointers to the main storage of the data.

One important topic however is that if you still want to work with the json (content and type of data storage - instead of tables in a DB) is that you cannot update it locally without changing it globally. I am not sure what you meant by reading the time related function values in the json file. Do you mean that your json file is continuously changing? Better break it in several tables so each separate one can change without affecting all the mega structure of the data. Easier to manage, easier to maintain, easier to locate the changes.

My understanding is that best solution would be to split the same file in several json files where you strip down the not needed values. BY THE WAY, DO YOU ACTUALLY NEED ALL THE STORED DATA ??

I wouldn't come now with a code unless you explain me the above issues (so we can have a conversation) and thereafter I will accordingly edit my answer. I wrote yesterday a question related to handling of blobs - and storing in the server - in order to accelerate the execution of a data update in a server using a cron process. My data was about 25MB+ not 500+ as in your case however I must understand the use case for your situation.

One more thing, how was created that file that you must process ? Why do you manage only the final form of it instead of intervening in further feeding it with data ? My opinion is that you might stop storing data into it as previously done (and thus stop adding to your pain) and instead transform its today purpose only into historic data storage from now on then go toward storing the future data in something more elastic (as MongoDB or NoSQL databases).

Probably you don't need so much a code as a solid and useful strategy and way of working with your data first.

Programming comes last, after you decided all the detailed architecture of your web project.


JSON is a great format and way better alternative to XML. In the end JSON is almost one on one convertible to XML and back.

Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.

I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.

I have 3 solutions for you:

  1. Native PHP implementation (preferred)

Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while ($reader->read("type")) {
    echo $reader->value(), "\n";
}
$reader->close();

This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.

  1. Let go formats (cons: multiple conversions)

Preprocess file to a different format like XML or CSV. There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.

  1. Use some NoSQL DB (cons: additional complex software to install and maintain)

For example Redis or CouchDB(import json file to couch db-)