Autodetect Presence of CSV Headers in a File

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".


As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

  • The first row has columns that are not strings or are empty
  • The first row's columns are not all unique
  • The first row appears to contain dates or other common data formats (eg, xx-xx-xx)

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).

But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.

FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.


In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill

Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7

You will process this incorrectly. Worst of all, you will lose the best car!