What is the mAP metric and how is it calculated?

For detection, a common way to determine if one object proposal was right is Intersection over Union (IoU, IU). This takes the set A of proposed object pixels and the set of true object pixels B and calculates:

Commonly, IoU > 0.5 means that it was a hit, otherwise it was a fail. For each class, one can calculate the

  • True Positive TP(c): a proposal was made for class c and there actually was an object of class c
  • False Positive FP(c): a proposal was made for class c, but there is no object of class c
  • Average Precision for class c:

The mAP (mean average precision) is then:

Note: If one wants better proposals, one does increase the IoU from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote this with mAP@p, where p \in (0, 1) is the IoU.

mAP@[.5:.95] means that the mAP is calculated over multiple thresholds and then again being averaged

Edit: For more detailed Information see the COCO Evaluation metrics


mAP is Mean Average Precision.

Its use is different in the field of Information Retrieval (Reference [1] [2] )and Multi-Class classification (Object Detection) settings.

To calculate it for Object Detection, you calculate the average precision for each class in your data based on your model predictions. Average precision is related to the area under the precision-recall curve for a class. Then Taking the mean of these average individual-class-precision gives you the Mean Average Precision.

To calculate Average Precision, see [3]


Quotes are from the above mentioned Zisserman paper - 4.2 Evaluation of Results (Page 11):

First an "overlap criterion" is defined as an intersection-over-union greater than 0.5. (e.g. if a predicted box satisfies this criterion with respect to a ground-truth box, it is considered a detection). Then a matching is made between the GT boxes and the predicted boxes using this "greedy" approach:

Detections output by a method were assigned to ground truth objects satisfying the overlap criterion in order ranked by the (decreasing) confidence output. Multiple detections of the same object in an image were considered false detections e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections

Hence each predicted box is either True-Positive or False-Positive. Each ground-truth box is True-Positive. There are no True-Negatives.

Then the average precision is computed by averaging the precision values on the precision-recall curve where the recall is in the range [0, 0.1, ..., 1] (e.g. average of 11 precision values). To be more precise, we consider a slightly corrected PR curve, where for each curve point (p, r), if there is a different curve point (p', r') such that p' > p and r' >= r, we replace p with maximum p' of those points.

What is still unclear to me is what is done with those GT boxes that are never detected (even if the confidence is 0). This means that there are certain recall values that the precision-recall curve will never reach, and this makes the average precision computation above undefined.

Edit:

Short answer: in the region where the recall is unreachable, the precision drops to 0.

One way to explain this is to assume that when the threshold for the confidence approaches 0, an infinite number of predicted bounding boxes light up all over the image. The precision then immediately goes to 0 (since there is only a finite number of GT boxes) and the recall keeps growing on this flat curve until we reach 100%.