Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?

Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for "," or "\t" is:

COMMON_DELIMITERS = ['","',"\"\t\""].freeze

def sniff(path)
  first_line = File.open(path).first
  return unless first_line

  snif = {}
  COMMON_DELIMITERS.each do |delim| 
    snif[delim] = first_line.count(delim)
  end
  snif = snif.sort { |a,b| b[1]<=>a[1] }

  snif[0][0] if snif.size > 0
end

Note: that would return the full delimiter it finds, e.g. ",", so to get , you could change the snif[0][0] to snif[0][0][1].

Also, I'm using count(delim) because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length.


Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.

class ColSepSniffer
  NoColumnSeparatorFound = Class.new(StandardError)
  EmptyFile = Class.new(StandardError)

  COMMON_DELIMITERS = [
    '","',
    '"|"',
    '";"'
  ].freeze

  def initialize(path:)
    @path = path
  end

  def self.find(path)
    new(path: path).find
  end

  def find
    fail EmptyFile unless first

    if valid?
      delimiters[0][0][1]
    else
      fail NoColumnSeparatorFound
    end
  end

  private

  def valid?
    !delimiters.collect(&:last).reduce(:+).zero?
  end

  # delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
  # delimiters[0] #=> ["\";\"", 54]
  # delimiters[0][0] #=> "\",\""
  # delimiters[0][0][1] #=> ";"
  def delimiters
    @delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
  end

  def most_found
    ->(a, b) { b[1] <=> a[1] }
  end

  def count
    ->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
  end

  def first
    @first ||= file.first
  end

  def file
    @file ||= File.open(@path)
  end
end

Spec

require "spec_helper"

describe ColSepSniffer do
  describe ".find" do
    subject(:find) { described_class.find(path) }

    let(:path) { "./spec/fixtures/google/products.csv" }

    context "when , delimiter" do
      it "returns separator" do
        expect(find).to eq(',')
      end
    end

    context "when ; delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }

      it "returns separator" do
        expect(find).to eq(';')
      end
    end

    context "when | delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }

      it "returns separator" do
        expect(find).to eq('|')
      end
    end

    context "when empty file" do
      it "raises error" do
        expect(File).to receive(:open) { [] }
        expect { find }.to raise_error(described_class::EmptyFile)
      end
    end

    context "when no column separator is found" do
      it "raises error" do
        expect(File).to receive(:open) { [''] }
        expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
      end
    end
  end
end

I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.

One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.