Macro for the average width of a character

The macros are fairly low level TeX, so it is easy to use them in LaTeX by adding a few missing definitions. With these definitions in place, you can simply import lang-frq.mkii,
lang-frd.mkii, and the helper file supp-mis.mkii (on the destination page, click raw to download) and use ConTeXt's \averagecharwidth directly.

% Copy definition of \emptybox from supp-box.mkii
\ifx\voidbox\undefined      \newbox\voidbox \fi
\def\emptybox{\box\voidbox}

% Copy definition of \startnointerference from syst-new.mkii
\newbox\nointerferencebox

\def\startnointerference
  {\setbox\nointerferencebox\vbox
   \bgroup}

\def\stopnointerference
  {\egroup
   \setbox\nointerferencebox\emptybox}

% Load a trimmed down version of ConTeXt macros
\input supp-mis.mkii

\input lang-frq.mkii 
\input lang-frd.mkii

% Set the main language. (I don't know what the LateX equivalent of
% \currentmainlanguage)
\def\currentmainlanguage{en}

\documentclass{article}
\begin{document}
The average character width is \the\averagecharwidth
\end{document}

NOTE: Comment line 116 from lang-frd.mkii (the one that reads \startcharactertable[en] 100 x \stopcharactertable % kind of default).


Here's a naive approach.

  1. Store the entire document in a token list
  2. Count the number of occurrences of each alphabetic character (mostly)
  3. Divide each character count by total number of alphabetic characters to get the relative frequency of that character.
  4. Multiply that ratio by the width of the character and sum to get average character width.

Some notes:

  • brace groups are counted as a single token so things like \begin{environment} and \par won't match any alphabetic characters, this is an advantage.
  • At the same time, the words within \text{some text} won't get counted, this is a disadvantage.
  • Capital letters can be taken into account but it is slow.
  • I don't think that I missed anything significant, but you never know.
  • Edit: Spaces are now included in the calculation, and the effect of the macro is cumulative. In dealing with spaces, I made the assumption that stretching and shrinking cancel one another in the long run and that the average width of a space is just the normal width of a space. Someone please let me know if there's a better way to deal with that.
  • Edit: Compile twice to automatically adjust textwidth to desired value.

Anyway, for straight text, this gives an exact average character width. The result becomes less accurate if more printed text is hidden in brace groups.

\documentclass{article}
\usepackage{xparse}
\usepackage{siunitx}
\usepackage{booktabs}
\usepackage{environ}

\ExplSyntaxOn

\bool_new:N \g_has_run_bool
\tl_new:N \l_aw_text_tl
\int_new:N \l_aw_tot_int
\int_new:N \g_aw_tot_alph_int
\int_new:N \g_wid_space_int
\int_new:N \g_space_int
\fp_new:N \g_rat_space_int
\fp_new:N \g_aw_avg_width_fp
\dim_new:N \myalphabetwidth
\dim_new:N \mytextwidth
\input{testing.aux}
\tl_const:Nx \c_aw_the_alphabet_tl {abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.;?()!' \token_to_str:N :}

% this can be changed to an evironment or renamed or whatever
\NewDocumentCommand {\avgwidthstart} {}
  {
    \aw_avg_width:w
  }

\NewDocumentCommand {\avgwidthend}{}{}

% Here is the environment version, using just "text" as a name is probably a bad idea.
\NewEnviron{awtext}
{
  \expandafter\avgwidthstart\BODY\avgwidthend
}

\makeatletter

\cs_new:Npn \aw_avg_width:w #1 \avgwidthend
  {
    % if first run, then generate variables to be used
    \bool_if:NF \g_has_run_bool
      {
        \tl_map_inline:Nn \c_aw_the_alphabet_tl
        {
          \int_new:c {g_##1_int}
          \fp_new:c {g_rat_##1_fp}
          \fp_new:c {g_wid_##1_fp}
        }
      }
    \tl_set:Nn \l_aw_text_tl {#1}

    % this can be used rather than the preceding line to take capital 
    % letters into account, but is Slooooooow
    %\tl_set:Nx \l_aw_text_tl {\tl_expandable_lowercase:n {#1}}

    \int_set:Nn \l_aw_tot_int {\tl_count:N \l_aw_text_tl}
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_get_counts:n
    \deal_with_spaces:n {#1}
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_calc_ratios:n
    \tl_map_function:NN \c_aw_the_alphabet_tl \aw_calc_avg_width:n
    \fp_gset_eq:NN \g_aw_avg_width_fp \l_tmpa_fp
    \fp_zero:N \l_tmpa_fp

    % the dimension \myalphabetwidth gives the width of the alphabet based on your character freq,
    % can be accessed by \the\myalphabetwidth
    \dim_gset:Nn \myalphabetwidth {\fp_to_dim:n {\fp_eval:n {61*\g_aw_avg_width_fp}}}

    % the dimension \mytextwidth gives the recommended \textwidth based on 66 chars per line.
    % can be accessed by \the\mytextwidth
    \dim_gset:Nn \mytextwidth {\fp_to_dim:n {\fp_eval:n {66*\g_aw_avg_width_fp}}}
    \protected@write\@mainaux{}{\mytextwidth=\the\mytextwidth}
    \bool_gset_true:N \g_has_run_bool

    % and lastly print the content
    #1
  }

\makeatother

\cs_new:Npn \aw_get_counts:n #1
  {
    % make a temporary token list from the document body 
    \tl_set_eq:NN \l_tmpb_tl \l_aw_text_tl
    % remove all occurrences of the character
    \tl_remove_all:Nn \l_tmpb_tl {#1}
    % add to appropriate int the number of occurrences of that character in current block
    \int_set:Nn \l_tmpa_int {\int_eval:n{\l_aw_tot_int -\tl_count:N \l_tmpb_tl}}
    % add to appropriate int the number of occurrences of that character in current block
    \int_gadd:cn {g_#1_int} {\l_tmpa_int}
    % add this to the total
    \int_gadd:Nn \g_aw_tot_alph_int {\l_tmpa_int}
  }

\cs_new:Npn \deal_with_spaces:n #1
  {
    \tl_set:Nn \l_tmpa_tl {#1}
    % rescan body with spaces as characters
    \tl_set_rescan:Nnn \l_tmpb_tl {\char_set_catcode_letter:N \ }{#1}
    % find number of new characters introduced.  add to number of spaces and alph chars
    \int_set:Nn \l_tmpa_int {\tl_count:N \l_tmpb_tl -\tl_count:N \l_tmpa_tl}
    \int_gadd:Nn \g_space_int {\l_tmpa_int}
    \int_gadd:Nn \g_aw_tot_alph_int {\l_tmpa_int}
    % since this comes after the rest of chars are dealt with, tot_alph is final total
    \fp_set:Nn \g_rat_space_fp {\g_space_int/\g_aw_tot_alph_int}
    % get width of space and use it.  obviously space is stretchable, so i'll assume
    % that the expansions and contractions cancel one another over large text.  is this
    % a terrible assumption???
    \hbox_set:Nn \l_tmpa_box {\ }
    \fp_gset:Nn \g_wid_space_fp {\dim_to_fp:n {\box_wd:N \l_tmpa_box}}
    \fp_add:Nn \l_tmpa_fp {\g_wid_space_fp*\g_rat_space_fp}
  }

\cs_new:Npn \aw_calc_ratios:n #1
  {
    % divide number of occurrences of char by total alphabetic chars
    \fp_gset:cn {g_rat_#1_fp}{{\int_use:c {g_#1_int}}/\g_aw_tot_alph_int}
  }

\cs_new:Npn \aw_calc_avg_width:n #1
  {
    % only need to find char widths once
    \bool_if:NF \g_has_run_bool
      {
        % find width of char box
        \hbox_set:Nn \l_tmpa_box {#1}
        \fp_gset:cn {g_wid_#1_fp}{\dim_to_fp:n {\box_wd:N \l_tmpa_box}}
      }
    % multiply it by char frequency and add to avg width
    \fp_add:Nn \l_tmpa_fp {{\fp_use:c {g_wid_#1_fp}}*{\fp_use:c {g_rat_#1_fp}}}
  }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This part is just for fun. Delete it and the showtable command from the document if
% it isn't wanted
\tl_new:N \l_aw_tab_rows_tl
\seq_new:N \g_aw_the_alphabet_seq

\NewDocumentCommand {\showtable}{}
    {
      \clearpage
      \aw_make_table:
    }

\cs_generate_variant:Nn \seq_set_split:Nnn {NnV}
\cs_new:Npn \aw_make_table:
    {
      \thispagestyle{empty}
      \seq_set_split:NnV \g_aw_the_alphabet_seq {} \c_aw_the_alphabet_tl
      \seq_map_function:NN \g_aw_the_alphabet_seq \aw_generate_row:n
      \begin{table}
      \centering
      \sisetup{round-mode = places,round-precision = 5,output-decimal-marker={,},table-format = 3.5}
      \begin{tabular}{lll}
        \toprule
        {Average\,text\,width}&{Average\,character\,width}&{Average\,alphabet\,width}\\
        \midrule
        \the\mytextwidth&\fp_eval:n {round(\g_aw_avg_width_fp,5)}pt&\the\myalphabetwidth\\
        \bottomrule
      \end{tabular}\par
      \end{table}
      \vfil
      \centering
      \sisetup{round-mode = places,round-precision = 5,output-decimal-marker={,},table-format = 3.5}
      \begin{longtable}{cS}
        \toprule
        {Letter}&{Actual}\\
        \midrule
        spaces&\fp_eval:n {\g_rat_space_fp*100}\%\\
        \tl_use:N \l_aw_tab_rows_tl
        \bottomrule
      \end{longtable}\par
    }

\cs_new:Npn \aw_generate_row:n #1
    {
      \tl_put_right:Nn \l_aw_tab_rows_tl {#1&}
      \tl_put_right:Nx \l_aw_tab_rows_tl {\fp_eval:n {100*{\fp_use:c {g_rat_#1_fp}}}\%}
      \tl_put_right:Nn \l_aw_tab_rows_tl {\\}
    }

\ExplSyntaxOff

    \begin{document}

    \avgwidthstart
    My audit group's Group Manager and his wife have an infant I can describe only as fierce.
    Its expression is fierce; its demeanor is fierce; its gaze over bottle or pacifier or finger-fierce, 
    intimidating, aggressive. I have never heard it cry. When it feeds or sleeps, its pale face reddens,
    which makes it look all the fiercer.
    \avgwidthend

    \avgwidthstart
    On those workdays when our Group Manager, Mr. Yeagle, brought it in to the District office, hanging papoose-style in a nylon device on his back, the infant appeared to 
    be riding him as a mahout does an elephant. It hung there, radiating authority. Its back lay directly 
    against Mr. Yeagle's, its large head resting in the hollow of its father's neck and forcing our Group 
    Manager's head out and down into a posture of classic oppression. They made a creature with two faces,
    one of which was calm and blandly adult and the other unformed and yet emphatically fierce. The infant 
    never wiggled or fussed in the device. Its gaze around the corridor at the rest of us gathered waiting 
    for the morning elevator was level and unblinking and (it seemed) almost accusing. The infant's face, as 
    I experienced it, was mostly eyes and lower lip, its nose a mere pinch, its forehead milky and domed, 
    its pale red hair wispy, no eyebrows or lashes or even eyelids I could see. I never saw it blink. Its 
    features seemed suggestions only. It had roughly as much face as a whale does. I did not like it at all.\par\noindent
    http://harpers.org/media/pdf/dfw/HarpersMagazine-2008-02-0081893.pdf
    \avgwidthend

    \begin{awtext}
    Here is some more text in an environment this time.  This text is included in the calculation of the average width.
    \end{awtext}
    \showtable{}

    \end{document}

Here are the character frequencies for the given text

Explanation The gist I get from this "average width of a character" thing is the following.

  • People have decided that having ~66 characters per line improves readability of text.
  • Since line width is fixed, the actual number of characters per line depends on which characters are typed e.g, a line of all m's will contain fewer characters than a line of all i's since an m is wider than an i.
  • Thus, to set a reasonable line width to approximate 66 characters per line, we need to know the relative frequencies of the characters that are used in the document. If most of the characters are wide, then we need wider lines. If most of the characters are narrow then we need correspondingly narrower lines.
  • Therefore, we calculate the average width of the characters that are used and use this to determine line width. For example, if our document consists of m's and i's in an equal ratio (50/50), then "the average character" has width somewhere between the width of an m and that of an i. Specifically, the average character has width x=(wd(m)+wd(i))/2 and we should set our \textwidth to 66*x. Extrapolating to an arbitrary document we calculate the weighted average of the widths of the characters used according to their relative frequencies within the document, and multiply this by 66 (or use it in whatever way) to get the \textwidth that best accommodates the 66 character per line criteria.