How to implement the ReLU function in Numpy

You can do it in much easier way:

def ReLU(x):
    return x * (x > 0)

def dReLU(x):
    return 1. * (x > 0)

I'm completely revising my original answer because of points raised in the other questions and comments. Here is the new benchmark script:

import time
import numpy as np


def fancy_index_relu(m):
    m[m < 0] = 0


relus = {
    "max": lambda x: np.maximum(x, 0),
    "in-place max": lambda x: np.maximum(x, 0, x),
    "mul": lambda x: x * (x > 0),
    "abs": lambda x: (abs(x) + x) / 2,
    "fancy index": fancy_index_relu,
}

for name, relu in relus.items():
    n_iter = 20
    x = np.random.random((n_iter, 5000, 5000)) - 0.5

    t1 = time.time()
    for i in range(n_iter):
        relu(x[i])
    t2 = time.time()

    print("{:>12s}  {:3.0f} ms".format(name, (t2 - t1) / n_iter * 1000))

It takes care to use a different ndarray for each implementation and iteration. Here are the results:

         max  126 ms
in-place max  107 ms
         mul  136 ms
         abs   86 ms
 fancy index  132 ms

There are a couple of ways.

>>> x = np.random.random((3, 2)) - 0.5
>>> x
array([[-0.00590765,  0.18932873],
       [-0.32396051,  0.25586596],
       [ 0.22358098,  0.02217555]])
>>> np.maximum(x, 0)
array([[ 0.        ,  0.18932873],
       [ 0.        ,  0.25586596],
       [ 0.22358098,  0.02217555]])
>>> x * (x > 0)
array([[-0.        ,  0.18932873],
       [-0.        ,  0.25586596],
       [ 0.22358098,  0.02217555]])
>>> (abs(x) + x) / 2
array([[ 0.        ,  0.18932873],
       [ 0.        ,  0.25586596],
       [ 0.22358098,  0.02217555]])

If timing the results with the following code:

import numpy as np

x = np.random.random((5000, 5000)) - 0.5
print("max method:")
%timeit -n10 np.maximum(x, 0)

print("multiplication method:")
%timeit -n10 x * (x > 0)

print("abs method:")
%timeit -n10 (abs(x) + x) / 2

We get:

max method:
10 loops, best of 3: 239 ms per loop
multiplication method:
10 loops, best of 3: 145 ms per loop
abs method:
10 loops, best of 3: 288 ms per loop

So the multiplication seems to be the fastest.