How to use Mathematica to train a network Using out of core classification?

There are two parts to your question: 1. How to use out-of-core classification and 2. Why is the result bad.

For the first part, you can use a generator to solve the problem. And for the second part, the reason for a bad result is because the data is not randomized.

fileName = "/Users/xslittlegrass/Downloads/test_data_SE.dat";

data = RandomSample@Flatten@Table[{x, y} -> x*y, {x, -1, 1, .05}, {y, -1, 1, .05}];
mydata = Flatten[data /. {(a_ -> b_) -> {a, b}}];
Close@BinaryWrite[fileName, mydata, "Real32", ByteOrdering -> -1];

Notice I use RandomSample to shuffle the data.

file = OpenRead[fileName, BinaryFormat -> True];
net = NetChain[{32, Tanh, 1}, "Input" -> 2, "Output" -> "Scalar"];
size = FileByteCount[fileName];
read[file_, batchSize_] := 
  If[StreamPosition[file] + 
     batchSize*3(*length of data in one batch*)*4(*float data*)> size,
    SetStreamPosition[file, 0]; 
   BinaryReadList[file, "Real32", batchSize*3], 
   BinaryReadList[file, "Real32", batchSize*3]];

batchSize = 128;

We can define a generator that reads the data from the file

generator = Function[#[[1 ;; 2]] -> #[[3]] & /@ Partition[read[file, #BatchSize], 3]];

net = NetTrain[net, generator, BatchSize -> 128, MaxTrainingRounds -> 1000]
Close[file];

The result looks much better now

ContourPlot[net[{x, y}], {x, -1, 1}, {y, -1, 1}, 
 ColorFunction -> "RedGreenSplit", PlotLegends -> Automatic]

enter image description here


Okay here's how you do out-of-core training with HDF5:

input = RandomReal[1, {1000, 2}];
output = RandomReal[1, {1000, 2}];

Get["GeneralUtilities`"];
ExportStructuredHDF5["test.h5", <|"Input" -> input, 
  "Output" -> output|>]

NetTrain[LinearLayer["Input" -> 2, "Output" -> 2], File["test.h5"]]

The use of ExportStructuredHDF5 is just for convenience, you could also Export but it doesn't support associations directly. But again you'll need to make a dataset that consists of extendible columns if you want a real-world out-of-core example.

Also important to note is that you need to randomize the order of data yourself before writing it to the H5 file.


@xslittlegrass's answer is perfect, but I want to give a heads up that we will ship a way to stream training data to NetTrain from an ".h5" file that can be arbitrarily big (e.g. hundreds of gigabytes). This will hopefully ship in 11.1.1 or 11.2. The ".h5" file must have a (very simple format): one dataset for each port, so in your example an "Input" dataset and an "Output" dataset.

Unfortunately it will remain undocumented for now for the reason that our existing HDF5 exporter cannot create extendible datasets using the documented functionality, so it's hard for you to use Mathematica to create the out-of-core dataset in the first place. You could obviously create it in something else, like Python. But for some power users it will be just the ticket, and much faster than using BinaryRead + your own generator.