What are C classes for a NLLLoss loss function in Pytorch?

I agree with you that the documentation for nn.NLLLoss() is far from ideal, but I think we can clarify your problem here, firstly, by clarifying that "class" is often used as a synonym of "category" in a Machine Learning context.

Therefore, when PyTorch is talking about C classes, it is actually referring to the number of distinct categories that you are trying to train your network on. So, in the classical example of a categorical neural network trying to classify between "cats" and "dogs", C = 2, since it is either a cat or dog.

Specifically for this classification problem, it also holds that we only have one single truth value over the array of our categories (a picture cannot depict both a cat AND a dog, but always only either one), which is why we can conveniently indicate the corresponding category of an image by its index (let's say that 0 would indicate a cat, and 1 a dog). Now, we can simply compare the network output to the category we want.

BUT, in order for this to work, we need to also be clear what these loss values are referencing to (in our network output), since our network will generally make predictions via a softmax over different output neurons, meaning that we have generally more than a single value. Fortunately, PyTorch's nn.NLLLoss does this automatically for you.

Your above example with the LogSoftmax in fact only produces a single output value, which is a critical case for this example. This way, you basically only have an indication of whether or not something exists/doesn't exist, but it doesn't make much sense to use in a classification example, more so in a regression case (but that would require a totally different loss function to begin with).

Last, but not least, you should also consider the fact that we generally have 2D tensors as input, since batching (the simultaneous computation of multiple samples) is generally considered a necessary step to match performance. Even if you choose a batch size of 1, this still requires your inputs to be of dimension (batch_size, input_dimensions), and consequently your output tensors of shape (batch_size, number_of_categories).

This explains why most of the examples you find online are performing the LogSoftmax() over dim=1, since this is the "in-distribution axis", and not the batch axis (which would be dim=0).

If you simply want to fix your problem, the easiest way would be to extend your random tensor by an additional dimension (torch.randn([1, 5], requires_grad=True)), and then to compare by only one value in your output tensor (print(loss(output, torch.tensor([1])))


Basically you are missing a concept of batch.

Long story short, every input to loss (and the one passed through the network) requires batch dimension (i.e. how many samples are used).

Breaking it up, step by step:

Your example vs documentation

Each step will be each step compared to make it clearer (documentation on top, your example below)

Inputs

input = torch.randn(3, 5, requires_grad=True)
input = torch.randn(5, requires_grad=True)

In the first case (docs), input with 5 features is created and 3 samples are used. In your case there is only batch dimension (5 samples), you have no features which are required. If you meant to have one sample with 5 features you should do:

input = torch.randn(5, requires_grad=True)

LogSoftmax

LogSoftmax is done across features dimension, you are doing it across batch.

m = nn.LogSoftmax(dim=1) # apply over features m = nn.LogSoftmax(dim=0) # apply over batch

It makes no sense usually for this operation as samples are independent of each other.

Targets

As this is multiclass classification and each element in vector represents a sample, one can pass as many numbers as one wants (as long as it's smaller than number of features, in case of documentation example it's 5, hence [0-4] is fine ).

train = torch.tensor([1, 0, 4])
train = torch.tensor([1, 0, 0])

I assume, you wanted to pass one-hot vector as target as well. PyTorch doesn't work that way as it's memory inefficient (why store everything as one-hot encoded when you can just pinpoint exactly the class, in your case it would be 0).

Only outputs of neural network are one hot encoded in order to backpropagate error through all output nodes, it's not needed for targets.

Final

You shouldn't use torch.nn.LogSoftmax at all for this task. Just use torch.nn.Linear as last layer and use torch.nn.CrossEntropyLoss with your targets.