Normalizing grades across multiple graders

I am not entirely sure if this is an "answer" or a "comment" but I'll supply it as an answer.

First off, I'm answering from the perspective of academia -- not from the perspective of perfect statistical analysis or experiment design or anything like that. So put away any screams that this fails some sort of T test or does not produce a normal distribution -- those aren't the goals of academic work in general.

I think you might be trying to solve this problem the wrong way in all honesty. You state:

We also blindly grade a single student submission at a time and re-calibrate the rubric and it's application till we have consensus.

and then

Last Spring, this led to a significant number of grading iterations for each assignment, and we eventually ended up averaging across iterations, rather then converging to a common grade for each student.

This sounds like an immense amount of work to give students grades. But worse than that, it's clear the earlier work doesn't do anything for you since you had to take the consensus-achieved "re-calibrated rubric" and then if I'm reading correctly redo and jig together a bunch of things to produce grades.

If it seems like an immense struggle to come up with grades, then at least on of the following seems true to me:

  1. You and your team are suffering from a strong case of OCD.
  2. You're probably not achieving real consensus on how to grade things so much as minimal acquiescence on a single answer.
  3. You and your team are fighting against gravity -- i.e. you are trying to counteract a feature of human grading that you should just accept and work around.
  4. A misundertanding of the nature of academic work and grading (for more on that topic see this question and especially this answer: https://academia.stackexchange.com/a/31526/20058).

I see three potential solutions:

Rather than try to QC to perfect grade normalization, accept that individual grading differences are in-eradicable and make it so students' work is graded by a distribution of graders and that grades are not so subject to this fluctuation as to be questionable. For instance, have each grader look at a sample of 5 "standards" (an A, B, C, D , F) and see what grades they assign to them. Use this to categorize graders as severe, neutral, and soft and make it so everyone gets a fair mix of graders.

And/or make the rubric explicitly clear to the point where individual differences don't matter. i.e.,

"one point for a program that complies, one point for a program that
executes without crashing, one point for a program that produces the correct output, two points for using a recursive function, one point for mentioning "iterative sort" / "iterative sorting" in the description.

And/or figure out where the individual grading differences happen and minimize the importance of these for actual grades.