How can I inspect a Hadoop SequenceFile for which I lack full schema information?

From shell:

$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0

or directly from hive (which is much faster for small files, because it is running in an already started JVM)

hive> dfs -text /user/hive/warehouse/table_seq/000000_0

works for sequence files.


Check the SequenceFileReadDemo class in the 'Hadoop : The Definitive Guide'- Sample Code. The sequence files have the key/value types embedded in them. Use the SequenceFile.Reader.getKeyClass() and SequenceFile.Reader.getValueClass() to get the type information.


My first thought would be to use the Java API for sequence files to try to read them. Even if you don't know which Writable is used by the file, you can guess and check the error messages (there may be a better way that I don't know).

For example:

private void readSeqFile(Path pathToFile) throws IOException {
  Configuration conf = new Configuration();
  FileSystem fs = FileSystem.get(conf);

  SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf);

  Text key = new Text(); // this could be the wrong type
  Text val = new Text(); // also could be wrong

  while (reader.next(key, val)) {
    System.out.println(key + ":" + val);
  }
}

This program would crash if those are the wrong types, but the Exception should say which Writable type the key and value actually are.

Edit: Actually if you do less file.seq usually you can read some of the header and see what the Writable types are (at least for the first key/value). On one file, for example, I see:

SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable

Tags:

Apache

Hadoop