softmax and sigmoid function for the output layer

softmax() helps when you want a probability distribution, which sums up to 1. sigmoid is used when you want the output to be ranging from 0 to 1, but need not sum to 1.

In your case, you wish to classify and choose between two alternatives. I would recommend using softmax() as you will get a probability distribution which you can apply cross entropy loss function on.


The sigmoid and the softmax function have different purposes. For a detailed explanation of when to use sigmoid vs. softmax in neural network design, you can look at this article: "Classification: Sigmoid vs. Softmax."

Short summary:

If you have a multi-label classification problem where there is more than one "right answer" (the outputs are NOT mutually exclusive) then you can use a sigmoid function on each raw output independently. The sigmoid will allow you to have high probability for all of your classes, some of them, or none of them.

If you instead have a multi-class classification problem where there is only one "right answer" (the outputs are mutually exclusive), then use a softmax function. The softmax will enforce that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes.