What does DeepLab's --train_crop_size actually do?

yes, it seems that in your case the images are cropped during the training process. This enables a larger batch size within the computational limitations of your system. A larger batch size leads to optimization steps which are based on multiple instances instead of considering only one (or very few) instance(s) per optimization (=training) step. This often leads to better results. Normally a random crop is used to make sure that the network is trained on all parts of the image.

The training or deployment of a "fully convolutional" CNN does not require a fixed input size. By using padding at the input edges, the dimentionality reduction is often represented by a factor of 2^n (caused by striding or pooling). Example: your encoder is reducing each spatial dimension by a factor of 2^4 before the decoder is upsampling it again. --> So you only have to make sure that your input dimensions are a multiple of 2^4 (The exact input size does not matter, it is just defining the spatial dimensions of the hidden layer of your network during the training). In case of deeplab, the framework automatically adapts the given input dimensions to the required multiple of 2^x to make it even easier for you to use.

The evaluation instances should never be randomly cropped since only a deterministic evaluation process guarantees meaningful evaluation results. During the evaluation, there is no optimization and a batch size of one is fine.