Use of automated assessment of programming assignments

TL;DR

We did something along this lines for Java-based programming assignments. Students and TAs generally like it. It was a lot of work for us.

(Very) Long Version

I used to teach software engineering at a large public university in Austria. We implemented a similar approach for two courses back there, for a 400+ students bachelor-level distributed systems course, and for a 150 students master-level course on middleware and enterprise application engineering. Both courses included large, time-consuming Java-based programming assignments, which students needed to solve individually three times per semester.

Traditionally, students would submit their solutions for both courses as Maven projects via our university-wide Moodle system. After the deadline, TAs would download and execute the submissions, and grade manually based on extensive check lists that we provided. There was usually a bit of huffing and puffing among students about this grading. Sometimes, TAs would not understand correct solutions (after looking at dozens of similar programs, your mind tends to get sloppy). Sometimes, different TAs would grade similar programs differently (the sheer size required some parallelization of grading and the tasks were complex, hence it was impossible to do check lists that covered all possible cases). Sometimes, the assignments were actually under-specified or unclear, and students lost points for simple misunderstandings. Sometimes, applications that actually worked on the student's machine failed on the TA's machine. Generally, it was hard for students to estimate in advance how many points their submissions would be worth. Given that those two courses are amongst the most difficult / most time-consuming courses in the entire SE curriculum, this was all but optimal.

Hence, we decided to move to a more automated solution. Basically, we codified our various check lists into a set of hundreds of JUnit test cases, which we gave to our students in source code. Additionally, we kept back a smaller set of tests, which were similar but used different test data. The tests would also serve as reference implementation - if the assignment text did not specify how e.g., a given component should behave in a given borderline case, what the tests expected was the expected behavior. Our promise to the students was that, if all the tests for a given task pass, and the student did not game our test system (more on this below), not much can go wrong anymore with the grading (it was possible to get minor point losses anyway for things impossible to test automatically, but those things could not amount for more than a small percentage of all points).

EDIT after comments Note that I was saying more automated, not fully automated. Human TAs still look at the code. There is no grading Web service or something like that (this usually does not work, as keshlam correctly notes). On a high level, what we did is that we provided the requirements for students not only in written text anymore, but also in form of formal executable unit tests. The new process is roughly like that:

  1. TA downloads submission from Moodle and starts tests (takes about 10 minutes).
  2. While tests are running, TA browses over the code, does some spot checking of code, and checks a few things not covered by the tests.
  3. When the tests are done, the TA notes down the results of the tests and his own observations.
  4. If something severe happens (e.g., compile error), the TA sighs, takes a deep breath, and falls back to manually grading (and, likely, complaining a bit in our mailing list).

Advantages:

  • Students now have a good feeling about how many points they will get in the end. No nasty surprises anymore.
  • The TA effort for grading is now much, much lower, maybe 1/3 of the previous time. Hence, they have more time to look at problematic solutions and help students actually produce assignments.
  • The tests, good or bad (more on this later), are the same for everybody. Lenient and strict TAs are much less of a problem.
  • Working solutions now work no matter if the grading TA understands the solution or not.
  • Students get rapid feedback on the quality of their solution, but not so rapid that it makes sense to just program against the tests without thinking. One good side effect of the way we built our tests is that simply executing the tests for many tasks takes 10+ minutes (starting an application server, deploying code, ...). Students need to think before starting tests, just like they would if they would test against, e.g., a real staging server.
  • Students do not need to waste time writing test applications / test data. Before, one problem that made the student effort in these courses skyrocket was that students did not only have to program the assignments themselves, but also needed to write various demo programs / test data sets to test and showcase their solutions. This has become entirely obsolete with the automated test framework, significantly reducing the amount of boilerplate code that students need to write. In the end, we can now focus more on the actual content of the labs, and not on writing stupid test code.
  • All in all, our personal impression as well as student evaluations have shown that the majority of students appreciates the automated test system.

Problems:

  • The initial effort for us was certainly non-trivial. Coming up with the first version of the tests required a concerted effort of 6 or more TAs and 2 junior faculty over multiple months during the summer (not full-time of course, but still). Note that this was even though we utilize a widely used, standard testing framework - the effort was just in codifying our grading rules, finding and defining all corner cases, and writing everything down in useful tests. Further, as the assignments are complex distributed systems, so were the tests - we had to start application servers in-memory, hot-deploy and undeploy code, and make sure that all of this works out of the box on all major OS versions. All in all, writing the tests was an actual project, that required a lot of leadership and time commitment.
  • While, overall, our total time spent arguing about grading has decreased, it is not zero. Most importantly, as students can now see the grading guidelines clearly written down in source code, they seem to feel more compelled to question requirements and grading decisions than before. Hence, we now get a lot of But why should I do it like that? Doing it in X-Y-Other-Way would be much better! questions and complaints (incidentally, in practically all cases, the "much better" way is also much less work / much easier for the student).
  • While we were able to cover most of our original assignments, some things are impossible to cover in tests. In these cases, TAs still need to grade manually. In addition, in order to make our test framework technically work, students are now a bit more restricted in how they can solve the assignments than before.
  • Some students feel compelled to try to game our grading system. They spend significant efforts into finding solutions that do not actually solve the assignments but still get full points. In general, these efforts fail (as we actually have pretty sophisticated backend tests that do not only check whether the behavior is correct on interface level, but also look "under the hood"). Occasionally, they succeed, leading us to improve our tests.
  • Initially, we received some flack from a few students for a small number of bugs in our tests. The bugs were easy to fix and mostly not particularly severe, and most students understand that if you roll out a couple thousand lines of Java code for the first time, bugs will happen. However, some will take the opportunity to complain. A lot.
  • We had the impression that the number of copied solutions (plagiarism) was up the first time we rolled out the automated tests. Note that we had automated plagiarism checks long before, and nothing had changed in this regard, but apparently students assumed that cheating would go unnoticed with the automated testing.

TLDR

We used fully automated testing for programming assignments in a second-year course. We directly showed students their grade and allowed them to re-submit as often as they liked. Students loved it, and it freed up time for the TAs to hold far more office hours.

Long version

We developed our own in-house solution for a second-year Data Structures course. The exercises came with compiling skeleton code and interfaces for the test software to hook into. Students would submit to a web server running a Perl script, which would then compile the code alongside the testing framework, run a number of tests and display the grade. We did not provide the students with the tests, but did display the outcome of each test, usually in a way that made clear where the problem was, without revealing the exact test data. In addition, students were allowed to submit as many times as they liked. We explicitly encouraged them to submit early and often.

Advantages:

  • It allowed us to harshly punish code that doesn't work, or gives compile errors (we gave 0 marks for both).
  • Complete uniformity in grading. Fully automating the grading meant that solutions that implemented the same functionality would always get the same mark.
  • By being very clear about the requirements, and at the same time being fair by allowing them to re-submit, we significantly reduced complaints about grading.
  • We had one TA (myself) who was mainly responsible for developing tests, and one who maintained the submission server. This freed up the other TAs to hold many more office hours. We had TAs available 4 hours per day, every day of the week.
  • Students can correct (and learn from!) their mistakes before the grade is final. The good students in particular did not rest before they had a perfect grade.

Disadvantages:

  • Can promote plagiarism. We did not bother with any plagiarism checks, but this could become a significant problem, especially if assignments are reused from previous years.
  • Non-functional requirements are hard to test. While it is possible to test certain coding style requirements automatically, this is much harder than having a TA look it over. It is also much easier to game these automatic tests.
  • Can promote 'coding against the tests'. Some students did not bother to implement even the most basic correctness tests themself, instead relying purely on the server. For the next iteration of the course, we are considering basing part of the mark on whether student-written tests can succesfully detect our incorrect solutions. This will hopefully force them to think about testing their code as well.
  • Can shift focus from high-level understanding of the material to implementation details. Automated testing relies on the students being able to produce valid code. If the students are struggling with syntax, it can get in the way of higher-level objectives like the algorithms and data structures involved.

Conclusions:

Overall, both the students and the professor and TAs really appreciated this form of grading. The students received instant, unambiguous feedback about their work and the chance to learn from their mistakes while it still counted. At the same time, it allowed the majority of TAs to focus on helping the students understand the material, instead of tedious evaluation.

The Java testing framework is available on Bitbucket. Unfortunately, it lacks documentation.