Read file contents to variable and iterate over each character in file (hexdump)

As Heiko Oberdiek points out in this answer, pdfTeX defines a new, expandable primitive \pdffiledump which can be used to read files in binary mode. The syntax of the command is

\pdffiledump offset 0 length <length>{<filename>}

where for <length> we can use another primitive \pdffilesize{<filename>}. The result is a sequence of pairs XX, where XX is the hex representation of each character in the input file. The rest of the processing is similar to the answer below, beside we don't need the extra hex conversion.

\documentclass{article}

\makeatletter

\def\showbinary#1{%
    \begingroup
    \xdef\@temp{\pdffiledump offset 0 length \pdffilesize{#1}{#1}}%
    \expandafter\analyze\expandafter{\@temp}%
    \endgroup
}

\def\analyze#1{%
    \count@=0
    \if\relax\detokenize{#1}\relax\else
        \expandafter\analyze@#1\@end
    \fi
}
\def\analyze@#1#2#3\@end{%
    #1#2
    \advance\count@ by 1
    \ifnum\count@>15
        \count@=0
        \par
    \fi
%
    \let\@next=\relax
    \if\relax\detokenize{#3}\relax\else
        \def\@next{\analyze@#3\@end}%
    \fi
    \@next
}

\makeatother

\begin{document}
\ttfamily
\showbinary{ascii.txt}
\end{document}

outputs

enter image description here


Old answer

Not a complete answer, but this is the best I could come up with for reading a binary file:

\documentclass{article}

\makeatletter

\def\showbinary#1{%
    \begingroup
    \count@=0
    \loop
        \catcode\count@=12
        \advance\count@ by 1
    \ifnum\count@<256
    \repeat
%
    \endlinechar=-1
    \everyeof{\noexpand}%
    \xdef\@temp{\@@input #1 }%
%
    \analyze\@temp
    \endgroup
}

\def\analyze#1{%
    \expandafter\analyze@#1\@end
}
\def\analyze@#1#2\@end{%
    \count@=`#1\relax
    \expandafter\hex\expandafter{\the\count@}
    \let\@next=\relax
    \if\relax\detokenize{#2}\relax\else
        \def\@next{\analyze@#2\@end}%
    \fi
    \@next
}

\def\hex#1{%
    \begingroup
    \count@=#1\relax
    \divide\count@ by 16
    \hexchar\count@
%
    \multiply\count@ by 16
    \advance\count@ by -#1\relax
    \multiply\count@ by -1
    \hexchar\count@
    \ifnum\count@=15\par\fi
    \endgroup
}
\def\hexchar#1{%
    \ifcase#10\or1\or2\or3\or4\or5\or6\or7\or8\or9\or A\or B\or C\or D\or E\or F\else x\fi
}

\makeatother

\begin{document}
\ttfamily
\showbinary{ascii.txt}
\end{document}

outputs

enter image description here

ascii.txt is a binary file including all characters from 0x00 to 0xFF. First all those characters are set to catcode 12 (other), then the file is \input'ed and its contents stored in a macro \@temp. Afterwards we iterate over each character in \@temp to output its hex representation.

As you can see, three characters are missing: 0x09 (\t), 0x0A (\n) and 0x0D (\r). The latter two are likely because TeX files are read in text mode and not in binary mode. Not sure if something can be done about that. The tab character is missing in this particular test file because TeX treats tabs like spaces when they occur at the end of a line (the \t is immediately followed by \n) and thus removes it from the input line.


You're using LaTeX, not Plain, so it doesn't make much sense to not use packages. With a bit of expl3 code you can make yourself a proper hexdump of a file.

My previous answer (see edit history) used a rather simple expl3 code to read in the file and make a hexdump out of it. However the code was rather slow (it took about 60 seconds to produce 7 pages of hexdump of a 6 kB file).

I made a slightly optimized version (takes about half a second to process the same file :-) with a few more niceties: it's faster, it has some key-val properties to control the output, it's faster, it uses \pdf@filedump from pdftexcmds to avoid losing line feeds and spaces, and it's much faster :-)

Here it is:

\documentclass{article}
\usepackage{pdftexcmds}
\usepackage{xparse}
\ExplSyntaxOn
\cs_new_eq:Nc \__hexdump_filedump:nnn { pdf@filedump }
\cs_new_eq:Nc \__hexdump_filesize:n { pdf@filesize }
\int_new:N \l__hexdump_begin_int
\int_new:N \l__hexdump_bytes_int
\int_new:N \l__hexdump_filesize_int
\int_new:N \l__hexdump_byte_int
\int_new:N \l__hexdump_byte_ptr_int
\int_new:N \l__hexdump_word_int
\int_new:N \l__hexdump_word_ptr_int
\int_new:N \l__hexdump_column_int
\int_new:N \l__hexdump_column_ptr_int
\int_new:N \l__hexdump_line_length_int
\int_new:N \l__hexdump_address_size_int
\int_new:N \l__hexdump_address_int
\bool_new:N \l__hexdump_address_bool
\tl_new:N \l__hexdump_dump_tl
\tl_new:N \l__hexdump_font_tl
\tl_new:N \l__hexdump_visible_tl
\clist_new:N \l__hexdump_cols_clist
\seq_new:N \l__hexdump_cols_seq
\cs_generate_variant:Nn \str_count:n { f }
\keys_define:nn { hexdump }
  {
    , begin   .int_set:N   = \l__hexdump_begin_int
    , begin   .initial:n   = { 0 }
    , length  .int_set:N   = \l__hexdump_bytes_int
    , length  .initial:n   = { -1 }
    , byte    .int_set:N   = \l__hexdump_byte_int
    , byte    .initial:n   = { 2 }
    , columns .clist_set:N = \l__hexdump_cols_clist
    , columns .initial:n   = { 4, 4 }
    , font    .tl_set:N    = \l__hexdump_font_tl
    , font    .initial:n   = \ttfamily
  }
\NewDocumentCommand \hexdump { o m }
  {
    \group_begin:
      \IfValueT {#1} { \keys_set:nn { hexdump } {#1} }
      \hexdump:n {#2}
    \group_end:
  }
\cs_new_protected:Npn \hexdump:n #1
  {
    \file_if_exist:nTF {#1}
      { \__hexdump_read:n {#1} }
      { \msg_error:nnn { hexdump } { file-not-found } {#1} }
  }
\cs_new_protected:Npn \__hexdump_read:n #1
  {
    \int_set:Nn \l__hexdump_filesize_int { \__hexdump_filesize:n {#1} }
    \__hexdump_assert_int:Nnn \l__hexdump_begin_int
      { \c_zero_int } { \l__hexdump_filesize_int }
    \int_compare:nNnT { \l__hexdump_bytes_int } = { -1 }
      { \int_set:Nn \l__hexdump_bytes_int { \l__hexdump_filesize_int } }
      {
        \__hexdump_assert_int:Nnn \l__hexdump_bytes_int
          { \c_zero_int } { \l__hexdump_filesize_int }
      }
    \tl_set:Nx \l__hexdump_dump_tl
      {
        \__hexdump_filedump:nnn
          { \l__hexdump_begin_int } { \l__hexdump_bytes_int }
          {#1}
      }
    \tl_map_function:nN { \. \? \! \: \; \, } \__hexdump_french_spacing:N
    \tl_use:N \l__hexdump_font_tl
    \__hexdump:N \l__hexdump_dump_tl
  }
\cs_new_protected:Npn \__hexdump_french_spacing:N #1
  { \char_set_sfcode:nn { `#1 } { 1000 } }
\cs_new_protected:Npn \__hexdump_assert_int:Nnn #1 #2 #3
  { \int_set:Nn #1 { \int_min:nn { \int_max:nn { #1 } { #2 } } { #3 } } }
\msg_new:nnn { hexdump } { file-not-found }
  { File~`#1'~not~found. }
\cs_new_protected:Npn \__hexdump:N #1
  {
    \__hexdump_initialise:
    \exp_last_unbraced:NV \__hexdump:NNw #1
      \q_recursion_tail \q_recursion_tail \q_recursion_stop
  }
\cs_new_protected:Npn \__hexdump_initialise:
  {
    \seq_set_from_clist:NN \l__hexdump_cols_seq \l__hexdump_cols_clist
    \int_set:Nn \l__hexdump_word_int { \seq_item:Nn \l__hexdump_cols_seq { 1 } }
    \int_set:Nn \l__hexdump_column_int { \seq_count:N \l__hexdump_cols_seq }
    \int_set:Nn \l__hexdump_address_size_int
      { \str_count:f { \int_to_hex:n { \l__hexdump_bytes_int } } }
    \int_set_eq:NN \l__hexdump_address_int \l__hexdump_begin_int
    \int_set:Nn \l__hexdump_line_length_int
      { \l__hexdump_byte_int * ( \seq_use:Nn \l__hexdump_cols_seq { + } ) }
    \exp_args:NNf \seq_put_right:Nn \l__hexdump_cols_seq
      { \seq_item:Nn \l__hexdump_cols_seq { 1 } }
    \bool_set_true:N \l__hexdump_address_bool
    \int_zero:N \l__hexdump_byte_ptr_int
    \int_zero:N \l__hexdump_word_ptr_int
    \int_zero:N \l__hexdump_column_ptr_int
  }
\cs_new_protected:Npn \__hexdump:NNw #1 #2
  {
    \quark_if_recursion_tail_stop_do:Nn #1
      { \__hexdump_end: }
    \bool_if:NT \l__hexdump_address_bool { \__hexdump_address: }
    #1 #2
    \tl_put_right:Nx \l__hexdump_visible_tl
      {
        \__hexdump_if_visible_ascii:nTF { "#1#2 }
          { \char_generate:nn { "#1#2 } { 12 } }
          { . }
      }
    \__hexdump_ptr_check:
    \__hexdump:NNw
  }
\cs_new_protected:Npn \__hexdump_ptr_check:
  {
    \__hexdump_ptr_step:nn { byte }
      {
        \c_space_tl
        \__hexdump_ptr_step:nn { word }
          {
            \int_set:Nn \l__hexdump_word_int
              {
                \seq_item:Nn \l__hexdump_cols_seq
                  { \l__hexdump_column_ptr_int + 2 }
              }
            \c_space_tl
            \__hexdump_ptr_step:nn { column }
              { \tex_unskip:D \__hexdump_dump_visible: }
          }
      }
  }
\cs_new_protected:Npn \__hexdump_ptr_step:nn #1 #2
  {
    \int_incr:c { l__hexdump_#1_ptr_int }
    \int_compare:nNnT
        { \int_use:c { l__hexdump_#1_ptr_int } }
          =
        { \int_use:c { l__hexdump_#1_int } }
      {
        \int_zero:c { l__hexdump_#1_ptr_int }
        #2
      }
  }
\prg_new_protected_conditional:Npnn \__hexdump_if_visible_ascii:n #1 { TF }
  {
    \int_compare:nNnTF {#1} > {31}
      {
        \int_compare:nNnTF {#1} < {127}
          { \prg_return_true: }
          { \prg_return_false: }
      }
      { \prg_return_false: }
  }
\cs_new_protected:Npn \__hexdump_address:
  {
    \bool_set_false:N \l__hexdump_address_bool
    \exp_args:Nf \__hexdump_address:nn
      { \str_count:f { \int_to_hex:n { \l__hexdump_address_int } } }
      { \l__hexdump_address_size_int }
    \int_add:Nn \l__hexdump_address_int { \l__hexdump_line_length_int }
  }
\cs_new_protected:Npn \__hexdump_address:nn #1 #2
  {
    \prg_replicate:nn { #2 - #1 } { 0 }
    \int_to_hex:n { \l__hexdump_address_int } : ~
  }
\cs_new_protected:Npn \__hexdump_dump_visible:
  {
    | \tl_use:N \l__hexdump_visible_tl |
    \tl_clear:N \l__hexdump_visible_tl
    \bool_set_true:N \l__hexdump_address_bool
    \tex_par:D
  }
\cs_new_protected:Npn \__hexdump_end:
  {
    \bool_if:NF \l__hexdump_address_bool
      {
        \c_space_tl \c_space_tl
        \tl_put_right:Nn \l__hexdump_visible_tl { ~ }
        \__hexdump_ptr_check:
        \__hexdump_end:
      }
  }
\ExplSyntaxOff
\begin{document}
\hexdump{somebinary.file}
\end{document}

Visible bytes (ASCII 32 – 126) are printed, and everything else is represented by a . in the right pane:

enter image description here


Here's an expandable version (using two “forbidden functions”), using ideas of siracusa.

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn

\NewExpandableDocumentCommand{\hexdump}{O{~}m}
 {
  \awa_hexdump:ne {#1} { \tex_filedump:D~offset~0~length~\tex_filesize:D{#2}{#2} }
 }
% there's not yet an official interface to \pdffiledump and \filesize

\cs_new:Nn \awa_hexdump:nn
 {
  \__awa_hexdump_read_byte:nNNN {#1} #2 \q_nil \q_stop
 }
\cs_generate_variant:Nn \awa_hexdump:nn { ne }

\cs_new:Nn \__awa_hexdump_read_byte:nNNN
 {
  \quark_if_nil:nTF { #4 }
   % true: print the last two digits and ignores the trailer
   { #2#3 \use_none:n }
   % false: print two digits, a comma and some space
   { #2#3#1 \__awa_hexdump_read_byte:nNNN { #1 } #3 }
 }

\ExplSyntaxOff

\begin{document}

\raggedright\ttfamily
\hexdump{cmr10.tfm}

\hexdump[,\hspace{0pt plus 1fill}]{\jobname.tex}

\end{document}

I used a copy of the standard cmr10.tfm file. The optional argument (default a space) is for controlling the delimiter between two bytes.

The picture shows the last two lines from the first call and the first two lines from the second call.

enter image description here

A check for the existence of the file can be easily added.