Linear vs nonlinear neural network?

When it comes to nonlinear regression, this is referring to how the weights affect the output. If a function is not linear with respect to the weights, then your problem is a nonlinear regression problem. So for example, let's look at a Feedforward Neural Network with one hidden layer where the activation functions in the hidden layer are some function $g(z)$ and the output layer has linear activation functions. Given this, the mathematical representation can be:

$y = W_2 g(W_1 x + b_1) + b_2$

where we assume $g(z)$ can operator on scalars and vectors with this notation to make it easy. $W_1$ , $W_2$ , $b_1$ , and $b_2$ are the weight you are aiming to estimate with the regression. If this was linear regression, $g(z)$ would equal z, because that would make y linearly dependent on $W_1$ & $b_1$ . But if $g(z)$ is nonlinear, say like $g(z) = z^2$ , then now y is nonlinearly dependent on the weights $W_2$ .

Now provided you understand all that, I am surprised you haven't seen discussion of the nonlinear case because that's pretty much all people talk about in textbooks and research. The use of things like stochastic gradient descent, Nonlinear Conjugate Gradient, RProp, and other methods are to help find local minima (and hopefully good local minima) for these nonlinear regression problems, even though a global optimum is not typically guaranteed.

Any non-linearity from the input to output makes the network non-linear. In the way we usually think about and implement neural networks, those non-linearities come from activation functions.

If we are trying to fit non-linear data and only have linear activation functions, our best approximation to the non-linear data will be linear since that's all we can compute. You can see an example of a neural network trying to fit non-linear data with only linear activation functions here.

However, if we change the linear activation function to something non-linear like ReLu, then we can see a better non-linear fitting of the data. You can see that here.

For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.

A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.

The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.

WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.

Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.

First let me point out that we can talk about linearity in parameters or linearity in the variables.

The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).

For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):

$(1)\quad\hat{y} = b_0 + b_1x + b_2x^2$

Now for simplicity, let us consider a single neuron, single layer neural network:

$(2)\quad a = f(wp + b)$

If the transfer function is linear then:

$(3)\quad a = wp+ b$

As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):

$(4)\quad \vec{a} = W\vec{p}+ \vec{b}$

Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):

$(5) \quad{} a_1 = w_1p + b_1$ (first layer)

$(6) \quad{} a_2 = w_2a_1 + b_2$ (second layer)

Now notice:

$(7) \quad{} a_2 = w_2(w_1p + b_1) + b_2$

Reduces to:

$(8) \quad{} a_2 = z_1p + z_2 + b_2$

Where $z_1 = w_2*w_1$ and $z_2 = w_2*b_1$

Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).

That being said, let's take a look at the following statement from @Pawu regarding this answer:

The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.

While you could argue that what @Pawu is saying is technically true, I think they are implying:

The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.

I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.

Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:

xor problem

The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.

Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:

#%% Packages
import numpy as np

#%% Data
data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]])
np.random.shuffle(data)
train_data = data[:,:2]
target_data = data[:,2]

#%% XOR architecture
class XOR_class():

    def __init__(self, train_data, target_data, alpha=.1, epochs=10000):
        self.train_data = train_data
        self.target_data = target_data
        self.alpha = alpha
        self.epochs = epochs
        
        #Random weights
        self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
        self.b0 = np.random.uniform(low=-1, high=1, size=(1))
        self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
        self.b2 = np.random.uniform(low=-1, high=1, size=(1))

    #xor network (linear transfer functions only)
    def xor_network(self, X0):
        n0 = np.dot(X0, self.W0) + self.b0
        X1 = n0*X0
        a = np.dot(X1, self.W2) + self.b2
        return(a, X1)

    #Training the xor network
    def train(self):
        for epoch in range(self.epochs):
            for i in range(len(self.train_data)):
                # Forward Propagation:
                X0 = self.train_data[i]
                a, X1 = self.xor_network(X0)
        
                # Backward Propagation:
                e = self.target_data[i] - a
                s_2 = -2*e
        
                # Update Weights:
                self.W0 = self.W0 - (self.alpha*s_2*X0)
                self.b0 = self.b0 - (self.alpha*s_2)
                self.W2 = self.W2 - (self.alpha*s_2*X1)
                self.b2 = self.b2 - (self.alpha*s_2)
        
        #Restart training if we get lost in the parameter space.
        if np.isnan(a) or (a > 1) or (a < -1):
            print('Bad initialization, reinitializing.')
            self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
            self.b0 = np.random.uniform(low=-1, high=1, size=(1))
            self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
            self.b2 = np.random.uniform(low=-1, high=1, size=(1))
            self.train()
    
    #Predicting using the trained weights.
    def predict(self, test_data):
        for i in train_data:
            a, X1 = self.xor_network(i)
            #I cut off decimals past 12 for convienience, not necessary.
            print(f'input: {i} - output: {np.round(a, 12)}')

Now let's take a look at the output:

#%% Execution
xor = XOR_class(train_data, target_data)
xor.train()
np.random.shuffle(data)
test_data = data[:,:2]
xor.predict(test_data)

input: [1 0] - output: [1.]
input: [0 0] - output: [0.]
input: [0 1] - output: [1.]
input: [1 1] - output: [0.]

And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .

The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.

Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.

Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.

This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.

For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function $f(x) = x^2$ , we have the following:

$(9) \quad{} a = (wp+b)*(wp+b)$

Which Reduces to:

$(10) \quad{} a = z_3p^2 + z_4p + z_5$

Where $z_3 = w^2$ , $z_4 = 2wb$ , and $z_5 = b^2$

Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.

Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).

EDIT:

As pointed out by @hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.

At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.

Linear vs nonlinear neural network?

Tags:

Neural Network

Tensorflow

Related

Recent Posts