Tips for testing data intensive legacy application

Get a copy of Working Effectively with Legacy Code by Michael Feathers. It is full of useful advice for working with large, untested codebases.

Another good book is Object Oriented Reengineering Patterns. Most of the book is not specific to object-oriented software. The full text is available for free download in PDF format.

From my own experience: try to...

  • Automate the build and deployment
  • Get the database schema into version control, if it isn't yet. Usually databases include reference data that the transactional code needs to exist before it can work. Get this under version control too. Tools like dbdeploy can help you easily rebuild a schema and reference data from a sequence of deltas.
  • Install a version of the database (and any other infrastructure services) onto your development workstation. This will let you work on the database without continually having to go through DBAs. It's also faster than using a schema on a shared server in a remote datacentre. All major commercial database servers have free (as in beer) development versions that work on Windows (if you're stuck in the unenviable situation of developing on Windows and deploying on Unix).
  • Before starting work on an area of the code, write end-to-end tests that roughly cover the behaviour of the area you're working on. An end-to-end test should exercise the system from outside -- by controlling its user interface or interacting through network services -- so you won't need to change the code to put it into place. It will act as an (imperfect) regression test and give you more confidence to refactor the internals of the system towards a structure that is easier to unit test.
  • If there are manual test plans, read them and see what can be automated. Most manual test plans are almost entirely scripted and so are low-hanging fruit for automation
  • Once you've got end-to-end tests coverage, refactor the code into more loosely coupled units as you modify and/or extend it. Surround those units with unit tests.

Things to avoid:

  • Copying data from the production database into the environment you use for automated testing. This will make your tests unpredictable. Sure, run the system against a copy of production data, but use that for exploratory testing, not regression testing.
  • Rolling back transactions at the end of tests to isolate tests from one another. This will not test behaviour that only happens when transactions are committed, and will throw away data that is valuable for diagnosing test failures. Instead, tests should ensure the database is in a known initial state when they start.
  • Creating a "tiny" data set for tests to run against. This makes tests hard to understand because they cannot be read as a single unit. The "tiny" data set soon grows very large as you add tests for different scenarios. Instead, tests can insert data into the database to set up the test-fixture.