Data Type Recognition/Guessing of CSV data in python

After putting some thought into it, this is how I would design the algorithm myself:

  • For performance reasons: take a sample for each column (say, 1%)
  • run a regex match for each cell in the sample, checking for the data type
  • Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

  • What's a sufficient sample size? For small data sets? For large data sets?
  • What's a high enough threshold for selecting a data type based on the frequency distribution?

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

  • https://github.com/okfn/messytables
  • https://messytables.readthedocs.org/ - docs

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164


Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt