XOR not learned using keras v2.0

I think it's a "local minimum" in the loss function.

Why?

I have run this same code over and over for a few times, and sometimes it goes right, sometimes it gets stuck into a wrong result. Notice that this code "recreates" the model every time I run it. (If I insist on training a model that found the wrong results, it will simply be kept there forever).

from keras.models import Sequential
from keras.layers import *
import numpy as np

m = Sequential()
m.add(Dense(2,input_dim=2, activation='tanh'))
#m.add(Activation('tanh'))

m.add(Dense(1,activation='sigmoid'))
#m.add(Activation('sigmoid'))

X = np.array([[0,0],[0,1],[1,0],[1,1]],'float32')
Y = np.array([[0],[1],[1],[0]],'float32')

m.compile(optimizer='adam',loss='binary_crossentropy')
m.fit(X,Y,batch_size=1,epochs=20000,verbose=0)
print(m.predict(X))

Running this code, I have found some different outputs:

  • Wrong: [[ 0.00392423], [ 0.99576807], [ 0.50008368], [ 0.50008368]]
  • Right: [[ 0.08072935], [ 0.95266515], [ 0.95266813], [ 0.09427474]]

What conclusion can we take from it?

The optimizer is not dealing properly with this local minimum. If it gets lucky (a proper weight initialization), it will fall in a good minimum, and bring the right results.

If it gets unlucky (a bad weight initialization), it will fall in a local minimum, without really knowing that there are better places in the loss function, and its learn_rate is simply not big enough to escape this minimum. The small gradient keeps turning around the same point.

If you take the time to study which gradients appear in the wrong case, you will probably see it keeps pointing towards that same point, and increasing the learning rate a little may make it escape the hole.

Intuition makes me think that such very small models have more prominent local minimums.


Instead of just increasing the number of epochs, try using relu for the activation of your hidden layer instead of tanh. Making only that change to the code you provide, I am able to obtain the following result after only 2000 epochs (Theano backend):

import numpy as np
print(np.__version__) #1.11.3
import keras
print(theano.__version__) # 0.9.0
import theano
print(keras.__version__) # 2.0.2

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam, SGD

np.random.seed(100)

model = Sequential()
model.add(Dense(units = 2, input_dim=2, activation = 'relu'))
model.add(Dense(units = 1, activation = 'sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer='adam'
model.fit(X, y, epochs=2000, batch_size=1,verbose=0)
print(model.evaluate(X,y))
print(model.predict_classes(X))
4/4 [==============================] - 0s
0.118175707757
4/4 [==============================] - 0s
[[0]
[1]
[1]
[0]]

It would be easy to conclude that this is due to vanishing gradient problem. However, the simplicity of this network suggest that this isn't the case. Indeed, if I change the optimizer from 'adam' to SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False) (the default values), I can see the following result after 5000 epochs with tanh activation in the hidden layer.

from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import Adam, SGD

np.random.seed(100)

model = Sequential()
model.add(Dense(units = 2, input_dim=2, activation = 'tanh'))
model.add(Dense(units = 1, activation = 'sigmoid'))
X = np.array([[0,0],[0,1],[1,0],[1,1]], "float32")
y = np.array([[0],[1],[1],[0]], "float32")
model.compile(loss='binary_crossentropy', optimizer=SGD())
model.fit(X, y, epochs=5000, batch_size=1,verbose=0)

print(model.evaluate(X,y))
print(model.predict_classes(X))
4/4 [==============================] - 0s
0.0314897596836
4/4 [==============================] - 0s
[[0]
 [1]
 [1]
 [0]]

Edit: 5/17/17 - Included complete code to enable reproduction


I cannot add a comment to Daniel's response as I don't have enough reputation, but I believe he's on the right track. While I have not personally tried running the XOR with Keras, here's an article that might be interesting - it analyzes the various regions of local minima for a 2-2-1 network, showing that higher numerical precision would lead to fewer instances of getting stuck on a gradient descent algorithm.

The Local Minima of the Error Surface of the 2-2-1 XOR Network (Ida G. Sprinkhuizen-Kuyper and Egbert J.W. Boers)

On a side note I won't consider using a 2-4-1 network as over-fitting the problem. Having 4 linear cuts on the 0-1 plane (cutting into a 2x2 grid) instead of 2 cuts (cutting the corners off diagonally) just separates the data in a different way, but since we only have 4 data points and no noise in the data, the neural network that uses 4 linear cuts isn't describing "noise" instead of the XOR relationship.