CSV Autodetection in Java

There are always going to be non-CSV files that look like CSV, and vice versa. For instance, there's the pathological (but perfectly valid) CSV file that frankc posted in the Java link you cited:

Name
Jim
Tom
Bill

The best one can do, I think, is some sort of heuristic estimate of the likelihood that a file is CSV. Some heuristics I can think of are:

  1. There is a candidate separator character that appears on every line (or, if you like, every line has one token).
  2. Given a candidate separator character, most (but not necessarily all) of the lines have the same number of fields.
  3. The presence of a first line that looks like it might be a header increases the likelihood of the file containing CSV data.

One can probably think up other heuristics. The approach would then be to develop a scoring algorithm based on these. The next step would be to score a collection of known CSV and non-CSV files. If there is a clear-enough separation, then the scoring could be deemed useful and the scores should tell you how to set a detection threshold.


If you can't constrain whats used as a delimiter then you can use brute-force.

You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

Remove all quoted columns, this can be done with a RegEx replace.

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

Split on record delimiter

["id,text,date", "1,,1/1/2012"]

Split records on column delimiter

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

If the number of columns per record match you have some CSV confidence.

3 == 3

If the number of columns don't match try another combination of row, column and quote character

EDIT

Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

  • Are all the columns in the first (header?) row strings
  • Does column X always parse out to null/empty or a valid (int, float, date)

The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.