How do you make a Neural Net?

The most basic neural nets are just a DotCross layer and some layer that provides nonlinearity. I recommend starting with that. This is essentially what you see in textbooks. You can see examples like this in the documentation for NetTrain.

Let's make some data for your example of a function that converts from RGBValues to Hues:

rndExample := 
 With[{hueValue = RandomReal[]}, 
  List @@ ColorConvert[Hue[hueValue], "RGB"] -> hueValue];

data = Table[rndExample, {5000}];

Make a net that has a dotplus layer and some nonlinear layer in between them. I added a summation layer at the ends just so we'd up with a scalar.

net = NetChain[{DotPlusLayer[12], ElementwiseLayer[Tanh], SummationLayer[]}, "Input" -> 3]

We can now train it on our data:

trainedNN = NetTrain[net, data]

Run trainedNN on your test data and compare it to the original desired output

(trainedNN /@ data[[All, 1]]) - (data[[All, 2]])

You can use MaxTrainingRounds to change how long it is trained for. This example is far from perfect. But you can check and see that it does a decent job at guessing what the value should be. You can easily get improved results by increasing the training time or adding another DotPlusLayer after the Tanh layer for example.


From A comprhensive overview of network layers and functions

Motivation

As Mathematica v.11 was released earlier this month with a host of new [[experimental]] functions and a limited number of examples on curated data that do not cover all layers, options, etc. I am posting this with the intent of augmenting Mathematica's documentation via the cummulative knowledge of the coding wizards of Stack Exchange.

Why read this?

It includes more details and examples of than the documentation (currently). Also, you will see where there are a lot of peculiarities that may or may not be causing errors in your code. E.g. If you are using NetDecoder for classes, it automatically assumes that you are using the UnitVector encoding, there is not a decoder for Booleans, etc.

Disclaimer

Until this point I have implemented my neural networks in Mathematica by hand. I have a good grasp on the theory behind neural networks (and the associated mathematics); however, Mathematica - being proprietary - does not make it clear as to which algorithms they choose to use to implement their network layers. Therefore I honestly am not sure of the some of these layers' underpinnings.

Correction from Sebastian Mathematica uses the same definitions as all the other frameworks. Definitions have become rather standardized, as everyone wants to use the same backends, like cuDNN

Layers

BatchNormalizationLayer

There are several layers introduced in v.11 that can not be used uninitialized, this is one of them. Input must be either a rank 1 or rank 3 tensor. To be honest, I do not think we can see its true effect on one input as demonstrated by these two examples below. I believe the effect of Batch normalization is only implemented by the function NetTrain. If someone has a simple example of this please let me know and I will update this.

batchNet = NetInitialize[BatchNormalizationLayer["Input"->{1}]];

ListLinePlot[Flatten[Table[{i - batchNet[{i}]}, {i, -20, 50}]]]

tensor334 = {{{1, 2, 10, 20}, {3, 4, 30, 40}, {5, 6, 50, 60}}, {{5, 6,
  50, 60}, {7, 8, 70, 80}, {9, 10, 90, 100}}, {{1000, 10000, 1, 
 5}, {500, 5000, 12, 215}, {21312, 325, 6234, 412}}};

batchNet = NetInitialize[BatchNormalizationLayer["Input" -> {3, 3, 4}]];

batchNet[tensor]

CatenateLayer

The CatenateLayer is pretty simple. It is the Net akin to Flatten[] (if you appended to lists together). If you aren't exactly impressed by this yet, it has more intresting abilities when it comes to network architecture (see example 2). Because Mathematica designed each Net's nodes / neurons / "ports" to recieve only one input with the exception of loss layers that take an association of two rules, it makes it unintuitive to project two seperate nodes to a third different node. CatenateLayer (and others like it) will allow for this. Running example 2 in your notebook will show how the typically linear NetGraphs (in most Mathematica's examples) can become quite more intricate. To demonstrate how intricate it can become (should you have the patience to code it) see example 3. Correction above from Sebastian

JHM gives a clear distinction about the purposes of the uses of CatenateLayer and FlattenLayer. See FlattenLayer.

Example 1

a = {{{1},{2},{3}}};
a//Dimensions
b = CatenateLayer[][a]
b//Dimensions
c = CatenateLayer[][b]
c//Dimensions

Example 2

NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh,TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[],DotPlusLayer[50]}, {1 -> 2, 1 -> 3, 1 -> 4, 2 -> 5, 3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 5 -> 8, 6 -> 8, 7 -> 8, 8 -> 9}, "Input" -> 2]

Example 3

NetInitialize[
 NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh, 
   TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
   DotPlusLayer[50], BatchNormalizationLayer[], Tanh, LogisticSigmoid,
    Tanh, TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
   DotPlusLayer[50], DropoutLayer[], DropoutLayer[], TotalLayer[], 
   LogisticSigmoid, BatchNormalizationLayer[], Tanh, LogisticSigmoid, 
   Tanh, TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
   DotPlusLayer[50], DropoutLayer[], TotalLayer[], TotalLayer[], 
   TotalLayer[], TotalLayer[], TotalLayer[], TotalLayer[], 
   DotPlusLayer[50], DotPlusLayer[1]}, {1 -> 2, 1 -> 3, 1 -> 4, 
   2 -> 5, 3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 5 -> 8, 6 -> 8, 
   7 -> 8, 8 -> 9, 10 -> 11, 10 -> 12, 10 -> 13, 11 -> 14, 12 -> 14, 
   11 -> 15, 13 -> 15, 13 -> 16, 12 -> 16, 16 -> 17, 15 -> 17, 
   14 -> 17, 17 -> 18, 18 -> 19, 9 -> 20, 20 -> 21, 19 -> 21, 
   21 -> 22, 23 -> 24, 23 -> 25, 23 -> 26, 24 -> 27, 25 -> 27, 
   24 -> 28, 25 -> 29, 26 -> 28, 26 -> 29, 27 -> 30, 28 -> 30, 
   29 -> 30, 30 -> 31, 31 -> 32, 32 -> 21, 30 -> 33, 8 -> 33, 8 -> 34,
    17 -> 34, 30 -> 35, 17 -> 35, 33 -> 36, 34 -> 36, 34 -> 37, 
   35 -> 37, 37 -> 38, 36 -> 38, 38 -> 39, 39 -> 21, 22 -> 40}, 
  "Input" -> 85]]

CrossEntropyLossLayer

Depending on your familiarity with information theory this layer may or may not make much sense to you. I recommend Information Theory: a tutorial introduction if you are new to this concept and want to learn more (PDF download from the author's ResearchGate account).

For a brief and unformal description, Entropy (information) is defined somewhat backwards to most people's intuition e.g. unlike probability where if we are certain of an event occuring we give it the value 1, here if we know something we give it the value 0. Why? Because if we know something happens / will happen, then if that event occurs we do not gain any extra knowledge. So along those lines you can think of Entropy as the surprise or amount of information we gain if something happen.

For example, this gives an output 0

CrossEntropyLossLayer[][<|"Input" -> {1}, "Target" -> 1|>]

And this does not. If we increase the input compared to the target, you can see that the value's magnitude increases.

CrossEntropyLossLayer[][<|"Input" -> {2}, "Target" -> 1|>]
CrossEntropyLossLayer[][<|"Input" -> {20}, "Target" -> 1|>]
CrossEntropyLossLayer[][<|"Input" -> {200}, "Target" -> 1|>]

ListLinePlot[Table[{i, CrossEntropyLossLayer[][<|"Input" -> {i}, "Target" -> 1|>]}, {i, 0,100}]]

Here we have been apply our data to the index of the target class. There is also the ability to use the option "Probabilities" to pass your data to a vector of class probabilities.

ConvolutionLayer

The convolution layer is similar to the padding layer, with the exception of being able to specify the number of outputs.

This layer, unlike others which can either take an arbitrary rank tensor or a rank 1 or rank 3 tesnors, can only take a rank three numerical tensor.

The example below shows how different kernel sizes of {h,w} affect this {3,3,3} tensor input. Here the output channels are limited to 1 for clarity. The first list in this output is the dimensions of the output, followed by the output.

If it seems too confusing, in short, lets say you provide a tensor with dimensions {a,b,c} to ConvolutionLayer[n, {h,w}], the resulting output would be (most likely$*$) a tensor with dimensions {n,b-h+1,c-w+1}. It should be clear that your kernel can't be larger than second and third dimensions of your input tensor.

$*$ we will talk about this formula in the poolying layer.

Table[{Dimensions[
NetInitialize[ConvolutionLayer[1, {i, j}, "Input" -> {3, 3, 3}]][
 {
  {{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
  {{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
  {{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
  }
 ]], NetInitialize[
 ConvolutionLayer[1, {i, j}, "Input" -> {3, 3, 3}]][
{
 {{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
 {{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
 {{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
 }
]}, {i, 1, 3}, {j, 1, 3}] // MatrixForm

DeconvolutionLayer

It basically "undoes" the Convolution. However do not be mistaken, if you feed the output of convolution to deconvolution you will not recieved the same result.

a={
{{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
{{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
{{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
};
a//Dimensions
b=NetInitialize[ConvolutionLayer[1, {1, 1}, "Input" -> {3, 3, 3}]][a];
b//Dimensions
c=NetInitialize[DeconvolutionLayer[3, {1, 1}, "Input" -> {1, 3, 3}]][b];
c//Dimensions

DropoutLayer

This is a pretty important layer (in my opinion), it is similar to a drop out method used my neural network enthusiasts to make neural networks more akin to other ensemble methods like random forest.

In essence it sets it takes one argument, p, which is the probability that the its input elements are set to zero during training, and increases the remainder by $\frac{1}{p}$. This is similar to the BatchNormalizationLayer I guess... in the sense that you can not see its effect without training. e.g. DropoutLayer[.5][Range[-3, 3]] Will give you: Range[-3, 3] Unless you are using NetTrain. So this makes trying to adapt this for other purposes a bit more tricky. If you know of a way to invoke this without NetTrain, please let me know.

EmbeddingLayer

This is the one of several net layers (including ConvolutionLayer, DotPlusLayer,etc) that I am aware of, that the documentation specifically calls "trainable" (parameters can be modified during training). It too must be initalized before use. It has two arguments, n and size. It takes integers in [1,n] and puts them into a vector of length size.

NetInitialize[EmbeddingLayer[2, 3]][{1}]
NetInitialize[EmbeddingLayer[5, 3]][{1,3,2}]

There exists other ways to get this functionality using the other layers. I think this layer really ownly exists because of classes. The documentation gives the following example:

NetInitialize[EmbeddingLayer[2, 3, "Input" -> NetEncoder[{"Class",{True, False}}]]]

This to me is odd, as NetEncoder has a specific option for Booleans; However that would not work here as the first argument n, can not be zero (by definition).

From Sebastian

Suppose you are trying to create a vector representation of a very high-dimensional categorical input (like words). NetEncoder will produce one-hot encoded vectors of the same dimension as the number of categories. This can be absolutely massive, and make training impossible. EmbeddingLayer solves this: it maps integers directly to a low-dimensional vector subspace, and this embedding is trained. This allows things like Word2Vec to be implemented (for example). The docs should make this use-case clearer.

FlattenLayer

I forgot this in my original post. Thank you JHM for providing this.

Catenate[{{1, 2, 3}, {4, 5, 6}}]
(* {1, 2, 3, 4, 5, 6) *)

CatenateLayer[][{{1, 2, 3}, {4, 5, 6}}]
(* {1, 2, 3, 4, 5, 6} *)

CatenateLayer[][{{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}]
(* {{1, 2}, {3, 4}, {5, 6}, {7, 8}} *)

Flatten[{{1, 2, 3}, {4, 5, 6}}]
(* {1, 2, 3, 4, 5, 6} *)

FlattenLayer[][{{1, 2, 3}, {4, 5, 6}}]
(* {1., 2., 3., 4., 5., 6.} *)

FlattenLayer[][{{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}}]
(* {1., 2., 3., 4., 5., 6., 7., 8.} *)

CatenateLayer and FlattenLayer have different usage. If you want to connect your lists (i.e. flatten only the outermost level), then CatenateLayer would be the correct choice. If you want to make your rank-n tensor into a vector, then FlattenLayer would be appropriate.

Note: FlattenLayer automatically converts Integers to Reals. I do not know whether that is intended.

MeanAbsoluteLossLayer

This layer does exactly what you think it would, tehrefore I am only putting two lines of code which should make it pretty aparent.

MeanAbsoluteLossLayer[][<|"Input" -> {1, 1, 1, 4}, "Target" -> {1, 1, 1, 4}|>]

MeanAbsoluteLossLayer[][<|"Input" -> {1, 1, 1, 4}, "Target" -> {1.1, 0.9, 1, 4}|>]

If one specifies the Input to be an integer n, e.g.

MeanAbsoluteLossLayer["Input"->n];

The one can cycle over a tensor where the inner most layer is of length n.

e.g.

 MeanAbsoluteLossLayer["Input"->3][<|"Input"->{{1,1,1},{1,.9,1.1}},"Target"->{{1,1,1},{1,1,1}|>];

MeanSquaredLossLayer

Similar to that of above, but, you know, squares first...

MeanSquaredLossLayer[][<|"Input" -> {1, 1, 1, 4}, "Target" -> {1, 1, 1, 4}|>]
MeanSquaredLossLayer[][<|"Input" -> {1, 1, 1, 4}, "Target" -> {1.1, 0.9, 1, 4}|>]

PoolingLayer

This is similar to ConvolutionLayer, at least it takes the same inputs, the difference being on the function of the kernel. You can specify the kernel to use the functions Max, Mean or Total. How this works will be clear in the example.

As promised, however, lets talk about the tranformation of a tensor with dimensions {a,b,c} via this function. Foremost, unlike Convolution, you can not change the output, i.e. the first dimension of the output will always be the same as a Dimension[output] will yield {a,x,y}. So what are the values of x and y? Let us assume you do not mess with the PaddingSize or Stride options. Then x will be the minimial value of either x-k or x-1, where k is the kernel size {h}. Similarly, y would be the minimum of either y-k or y-1, where kernel size is {w}.

If you want to mess with PaddingSize (how many zeros you add to your input) and stride (the step size of your kernel), then the function becomes $Floor[\frac{Min[{x+2p-k+s-1,x+2p-1}]}{s}]$. Same for y.

PoolingLayer[{1, 2}][
 {
  {{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
  {{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
  {{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
  }
 ]
PoolingLayer[{2, 2}][
 {
  {{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
  {{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
  {{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
  }
 ]
PoolingLayer[{3, 2}][
 {
  {{1, 2, 3}, {3, 2, 1}, {7, 8, 9}},
  {{4, 5, 6}, {6, 5, 4}, {1, 2, 3}},
  {{7, 8, 9}, {9, 8, 7}, {4, 5, 6}}
  }
 ]

As the default kernel function is Max, this output shouldn't be too hard to understand.

ReshapeLayer

Layer name says it all. There are some limitations of course. Whatever input you try to reshape needs to be able to fit nicely in that shape. It will not automatically pad or return an untransformed input.

ReshapeLayer[{2, 2, 2}][Range[8]]
ReshapeLayer[{1, 8, 1}][Range[8]]

SummationLayer

This really doesn't need any explaination. It is basically the same as Total[], only you should probably specify your input size.

 SummationLayer[]
 SummationLayer["Input"->{3}][{1,2,3}]
 Total[{1,2,3}]

From Sebastian

To be more precise, its more like Total[array, Infinity]. TotalLayer acts more like Total[{array1, array2, ...}]

TotalLayer

This is like list addition. It adds elementwise whatever lists you through at it

TotalLayer[][{{1,2,3},{1,2,3}}]
{1,2,3} + {1,2,3}

Useful functions

NetExtract

This is probably one of the more important functions. I mean, your weights are the nuts and bolts of your network. It has two arguments, net and then layers you want. You can either all them by layer number, or by name.

simpleNet = NetChain[<|"Tanh" -> Tanh, "Dot" -> DotPlusLayer[7],"Sigmoid" -> LogisticSigmoid|>, "Input" -> 7];
NetExtract[simpleNet, {"Dot", "Weights"}] // MatrixForm
NetExtract[simpleNet, {"Dot", "Weights"}] // MatrixPlot

NOTE

Yo can not get weights of elementwise layers. It will return an error. This means you can not extract all the weights at once. This to me is very stupid (and obnoxious if you use elementwise layers in between your network). It could at least return "Elementwise layer". Anyway, that is just a gripe of mine.

NetEncoder

NetEncoder takes some of the work of preprocessing your data. e.g. if you have used networks before, you are probably used to giving your classes numerical labels. This just automates that.

Print["Scalar", "\t", a = NetEncoder["Scalar"][Range[-3, 3]]];
Print["Class(unitVector)", "\t", 
  b = NetEncoder[{"Class", {"Apple", "Banana"}, 
      "UnitVector"}][{"Apple", "Apple", "Banana"}]];
Print["Class(index)", "\t", 
  c = NetEncoder[{"Class", {"Apple", "Banana"}, "Index"}][{"Apple", 
     "Apple", "Banana"}]];
Print["Boolean", "\t", d = NetEncoder["Boolean"][{True, False, True}]]

NetDecoder

This supposedly undoes the decoder. However there are some, lets say pecularities, of it. Foremoest, there is no NetDecoder for Boolean. NetDecoder for scalar, is also very much like flatten / catenate layers that we have seen before.

NetDecoder["Scalar"][{{-3.`}, {-2.`}, {-1.`}, {0.`}, {1.`}, {2.`},{3.`}}]

Other pecularities are that NetDecoder automatically assumes that your classes are in a UnitVector. Even if you specify "Index" It will produce an error. Actually, specifying UnitVector will cause an error...

NetDecoder[{"Class", {"Apple", "Banana"}}][{0, 1}]
NetDecoder[{"Class", {"Apple", "Banana"}, "Index"}][{1}]
NetDecoder[{"Class", {"Apple", "Banana"}, "UnitVector"}][{0, 1}]

From Sebastian

This is because your network cannot output an index: no layer allows you to do this. Also, it assumes you are giving the decoder a probability vector for the classes, which allows you to use it like a ClassifierFunction

NetGraph

In my mind, I think their implementation of NetGraph is kind of silly. Why? Because if you define a net and then pass it to NetGraph, it doesnt produce a graph; however, if you define your net inside NetGraph it still functions just like NetChain, but you also get the picture and who doesn't love a nice picture?

From Sebastian

It does produce a graph, just the net you passed in looks like any other layer (until you click on it and see its structure). The philosophy is that you can use NetChain or NetGraph objects exactly as you would use normal layers inside other NetChain or NetGraph objects. It allows for nice definitions of things like Inception networks etc. It also solves namespacing issues elegantly when composing containers.

The difference in implmenting your network in NetGraph rather than NetChain is you get a lot more flexibility in defining your network architecture, as you will see in the examples below. Note, if you do not specify the underlying graph structure, it is assumed that your network is linear.

tinyNet = NetInitialize[
 NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh, 
   TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
   DotPlusLayer[50], DotPlusLayer[1], Tanh}, {1 -> 2, 1 -> 3, 1 -> 4, 
   2 -> 5, 3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 5 -> 8, 6 -> 8, 
   7 -> 8, 8 -> 9, 9 -> 10, 10 -> 11}, "Input" -> 3]]

smallNet = 
 NetInitialize[
  NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh, 
    TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
    DotPlusLayer[50], BatchNormalizationLayer[], Tanh, 
    LogisticSigmoid, Tanh, TotalLayer[], TotalLayer[], TotalLayer[], 
    CatenateLayer[], DotPlusLayer[50], DropoutLayer[], DropoutLayer[],
     TotalLayer[], LogisticSigmoid}, {1 -> 2, 1 -> 3, 1 -> 4, 2 -> 5, 
    3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 5 -> 8, 6 -> 8, 7 -> 8, 
    8 -> 9, 10 -> 11, 10 -> 12, 10 -> 13, 11 -> 14, 12 -> 14, 
    11 -> 15, 13 -> 15, 13 -> 16, 12 -> 16, 16 -> 17, 15 -> 17, 
    14 -> 17, 17 -> 18, 18 -> 19, 9 -> 20, 20 -> 21, 19 -> 21, 
    21 -> 22}, "Input" -> 2122]]

smallishNet = 
 NetInitialize[
  NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh, 
    TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
    DotPlusLayer[50], BatchNormalizationLayer[], Tanh, 
    LogisticSigmoid, Tanh, TotalLayer[], TotalLayer[], TotalLayer[], 
    CatenateLayer[], DotPlusLayer[50], DropoutLayer[], DropoutLayer[],
     TotalLayer[], LogisticSigmoid, BatchNormalizationLayer[], Tanh, 
    LogisticSigmoid, Tanh, TotalLayer[], TotalLayer[], TotalLayer[], 
    CatenateLayer[], DotPlusLayer[50], DropoutLayer[]}, {1 -> 2, 
    1 -> 3, 1 -> 4, 2 -> 5, 3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 
    5 -> 8, 6 -> 8, 7 -> 8, 8 -> 9, 10 -> 11, 10 -> 12, 10 -> 13, 
    11 -> 14, 12 -> 14, 11 -> 15, 13 -> 15, 13 -> 16, 12 -> 16, 
    16 -> 17, 15 -> 17, 14 -> 17, 17 -> 18, 18 -> 19, 9 -> 20, 
    20 -> 21, 19 -> 21, 21 -> 22, 23 -> 24, 23 -> 25, 23 -> 26, 
    24 -> 27, 25 -> 27, 24 -> 28, 25 -> 29, 26 -> 28, 26 -> 29, 
    27 -> 30, 28 -> 30, 29 -> 30, 30 -> 31, 31 -> 32, 32 -> 21}, 
   "Input" -> 2122]]

smallNotReallyNet = 
 NetInitialize[
  NetGraph[{BatchNormalizationLayer[], Tanh, LogisticSigmoid, Tanh, 
    TotalLayer[], TotalLayer[], TotalLayer[], CatenateLayer[], 
    DotPlusLayer[50], BatchNormalizationLayer[], Tanh, 
    LogisticSigmoid, Tanh, TotalLayer[], TotalLayer[], TotalLayer[], 
    CatenateLayer[], DotPlusLayer[50], DropoutLayer[], DropoutLayer[],
     TotalLayer[], LogisticSigmoid, BatchNormalizationLayer[], Tanh, 
    LogisticSigmoid, Tanh, TotalLayer[], TotalLayer[], TotalLayer[], 
    CatenateLayer[], DotPlusLayer[50], DropoutLayer[], TotalLayer[], 
    TotalLayer[], TotalLayer[], TotalLayer[], TotalLayer[], 
    TotalLayer[], DotPlusLayer[50], DotPlusLayer[1]}, {1 -> 2, 1 -> 3,
     1 -> 4, 2 -> 5, 3 -> 5, 3 -> 6, 4 -> 6, 2 -> 7, 4 -> 7, 5 -> 8, 
    6 -> 8, 7 -> 8, 8 -> 9, 10 -> 11, 10 -> 12, 10 -> 13, 11 -> 14, 
    12 -> 14, 11 -> 15, 13 -> 15, 13 -> 16, 12 -> 16, 16 -> 17, 
    15 -> 17, 14 -> 17, 17 -> 18, 18 -> 19, 9 -> 20, 20 -> 21, 
    19 -> 21, 21 -> 22, 23 -> 24, 23 -> 25, 23 -> 26, 24 -> 27, 
    25 -> 27, 24 -> 28, 25 -> 29, 26 -> 28, 26 -> 29, 27 -> 30, 
    28 -> 30, 29 -> 30, 30 -> 31, 31 -> 32, 32 -> 21, 30 -> 33, 
    8 -> 33, 8 -> 34, 17 -> 34, 30 -> 35, 17 -> 35, 33 -> 36, 
    34 -> 36, 34 -> 37, 35 -> 37, 37 -> 38, 36 -> 38, 38 -> 39, 
    39 -> 21, 22 -> 40}, "Input" -> 85]]

NetPort

From Sebastian

The inputs and outputs of layers and containers (like NetGraph and NetChain) are called "Ports". NetPort is a way to unambiguously refer to one of these inputs/outputs. For example, MeanSquaredLossLayer has two input ports, "Target" and "Input". Consider:

NetGraph[{ElementwiseLayer[Tanh], DropoutLayer[], 
MeanSquaredLossLayer[]}, {1 -> NetPort[3, "Target"], 2 -> NetPort[3, "Input"]}]

NetPort allows you to specify exactly which input/output you are referring to.

It also lets you name your layers...
NetGraph[{DotPlusLayer[5], SummationLayer[]}, {1 -> NetPort["output"],2 -> NetPort["sum"]}]

Alternatively you can just do this directly in NetChain

simpleNet = NetChain[<|"Tanh" -> Tanh, "Dot" -> DotPlusLayer[7], "Sigmoid" -> LogisticSigmoid|>, "Input" -> 7];

How to save your trained net

Very simple, use extension .wlnet

Export["file_of_your_net.wlnet",yourNet];

Important just requires this file...

Graphs of basic elementwise layer functions

simpleNet = NetChain[{Ramp}];
ListLinePlot[simpleNet[Range[-5, 5]]]
simpleNet = NetChain[{Tanh}];
ListLinePlot[simpleNet[Range[-5, 5]]]
simpleNet = NetChain[{LogisticSigmoid}];
ListLinePlot[simpleNet[Range[-5, 5]]]
simpleNet = NetChain[{SoftmaxLayer[]}];
ListLinePlot[simpleNet[Range[-5, 5]]]

I had the same question, and it might be good to start with simpler examples. The simplest perceptron has two inputs.

So, a single perceptron has a combiner stage (a1) and a thresholding stage (a2). The combiner stage is like a summing junction which performs the summation y=w_i * x_i + b_i, where the $x_i$ are inputs, $w_i$ are the weights (which are adjusted by a training process) and b_i is the bias term (which can be thought of a as weight multiplied by a constant fixed bias x0). This hidden weight is not always drawn in perceptron network diagrams but is always there.

In Mathematica, the whole perceptron can be modeled in a single statement:

f = NetChain[{DotPlusLayer[1], ElementwiseLayer[LogisticSigmoid]}, 
  "Input" -> 2]

The DotPlusLayer[1] implements the combiner with 1 output, while the ElementwiseLayer[] implements the threshold (you have several choices, I used the LogisticSigmoid function). The input space is set to two inputs.

We can train this example. To do this first you need some training data, so let's try to make this solve the AND function:

TrainingAND = {{0, 0} -> {0}, {1, 0} -> {0}, {1, 0} -> {0}, {1, 
    1} -> {1}}

You have to initialize the perceptron to give an initial weight assignment:

f = NetInitialize[f]

and you can test the untrained neuron with an example input:

f[{0.5, 0.5}]

To train, you use the NetTrain[] function:

trainedAnd = NetTrain[f, trainingDataAnd]

From here, you can extend to more complex examples. For example, a 2-layer perceptron, which is capable of training on the XOR function, can be realized in a multi-layer network:

f = NetChain[{DotPlusLayer[2], ElementwiseLayer[Tanh], 
   DotPlusLayer[1], ElementwiseLayer[Tanh]}, "Input" -> 2]

In this case, we have a hidden layer with two outputs (uses the first DotPlusLayer and ElementwiseLayer), followed by an output layer with a single perceptron (the second DotPlusLayer and ElementwiseLayer). Note that the ElementwiseLayer functions automatically adjust to accommodate the width of the previous stage.

So, this is really a mere warmup exercise, and the other answers to this question have expanded on the power and sophistication of other parts of the neural network capabilities to tackle more complicated problems.