Capitalizing strings ignoring closed class words

\documentclass[a4paper]{article}
\usepackage[latin1]{inputenc}
\usepackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}{
  \CapitalizeFirst#1\Capitalize\unskip
}
\ExplSyntaxOff
\def\Sentinel{\Capitalize}
\def\CapitalizeFirst#1{\MakeUppercase#1 \Capitalize}
\def\Capitalize#1{%
  \def\next{#1}%
  \ifx\next\Sentinel
    \expandafter\unskip
  \else
    \CheckInList{#1}\space\expandafter\Capitalize
  \fi}
\def\CheckInList#1{%
  \ifcsname List@\detokenize{#1}\endcsname
    #1%
  \else
    \MakeUppercase#1%
  \fi}
\makeatletter
\def\AppendToList#1{%
  \@for\next:=#1\do
  {\expandafter\let\csname List@\detokenize\expandafter{\next}\endcsname\empty}}
\makeatother
\AppendToList{a,is,of}

\begin{document}
\capitalize{here is a list of words école}
\end{document}

Won't work with UTF-8 (unless XeLaTeX or LuaLaTeX are used).

It won't work with UTF-8 in pdflatex because \MakeUppercase will apply only to the first byte of a possible two, three or four byte combination (for Western languages probably only two). For that to work one has to feed the whole block of bytes to \MakeUppercase.

To be clearer: when we say \MakeUppercase, LaTeX will uppercase the argument; in general the call is \MakeUppercase{word}; here we're saying instead \MakeUppercase#1 (without braces), so only the first token (usually a character) will be uppercased; here's where it will fail with input such as \'ecole: the token passed to \MakeUppercase would be \', which it doesn't know what to do. Using école (and a one byte encoding such as latin1), \MakeUppercase will process é and give the correct result.

With UTF-8 this would fail: what we see as é on our screen when writing a LaTeX document is actually two bytes (C3 and A9, for é) and again \MakeUppercase would be passed only the first one. So a more complex routine is necessary.

In order to have this work with pdflatex and UTF-8, the definition of \CheckInList and \CapitalizeFirst above can be changed into the following

\def\CapitalizeFirst#1{\expandafter\UC@next#1 \Capitalize}
\def\CheckInList#1{%
  \ifcsname List@\detokenize{#1}\endcsname
    #1%
  \else
    \expandafter\UC@next#1%
  \fi}
\def\UC@next#1{%
  \ifx#1\UTFviii@two@octets
     \expandafter\@firstoffour
  \else
    \ifx#1\UTFviii@three@octets
      \expandafter\expandafter\expandafter\@secondoffour
    \else
      \ifx#1\UTFviii@four@octets
        \expandafter\expandafter\expandafter\expandafter\expandafter
        \@thirdoffour
      \else
        \expandafter\expandafter\expandafter\expandafter\expandafter
        \expandafter\expandafter\@fourthoffour
      \fi
    \fi
  \fi
  {\UC@two}{\UC@three}{\UC@four}{\MakeUppercase}#1}
\def\UC@two#1#2#3{\MakeUppercase{#1#2#3}}
\def\UC@three#1#2#3#4{\MakeUppercase{#1#2#3#4}}
\def\UC@four#1#2#3#4#5{\MakeUppercase{#1#2#3#4#5}}
\providecommand\@firstoffour[4]{#1}
\providecommand\@secondoffour[4]{#2}
\providecommand\@thirdoffour[4]{#3}
\providecommand\@fourthoffour[4]{#4}

However accent commands are not allowed (they aren't also in the other version).


UPDATE

After a few years, here's a better implementation, thanks to new expl3 features; it works for all engines.

\documentclass[a4paper]{article}

\usepackage{ifxetex}

\ifxetex
  \usepackage{fontspec}
\else
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
\fi

\usepackage{xparse}

\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}
 {
  \seq_clear:N \l_capitalize_words_seq
  \ProcessList{#1}{\CapitalizeFirst}
  \seq_use:Nn \l_capitalize_words_seq { ~ }
 }
\NewDocumentCommand{\CapitalizeFirst}{m}
 {
  \capitalize_word:n { #1 }
 }

\sys_if_engine_pdftex:TF
 {
  \cs_set_eq:Nc \capitalize_tl_set:Nn { protected@edef }
 }
 {
  \cs_set_eq:NN \capitalize_tl_set:Nn \tl_set:Nn
 }

\cs_new_protected:Nn \capitalize_word:n
 {
  \capitalize_tl_set:Nn \l_capitalize_word_tl { #1 }
  \seq_if_in:NfTF \g_capitalize_exceptions_seq { \tl_to_str:n { #1 } }
   % exception word
   { \seq_put_right:Nn \l_capitalize_words_seq { #1 } } % exception word
   % to be uppercased
   { \seq_put_right:Nx \l_capitalize_words_seq { \tl_mixed_case:V \l_capitalize_word_tl } }
 }
\cs_generate_variant:Nn \tl_mixed_case:n { V }
\NewDocumentCommand{\AppendToList}{m}
 {
  \clist_map_inline:nn { #1 }
   {
    \seq_gput_right:Nx \g_capitalize_exceptions_seq { \tl_to_str:n { ##1 } }
   }
 }
\cs_generate_variant:Nn \seq_if_in:NnTF { Nf }
\seq_new:N \l_capitalize_words_seq
\seq_new:N \g_capitalize_exceptions_seq
\ExplSyntaxOff

\AppendToList{a,is,of,óf}

\begin{document}
X\capitalize{here is a list of words óf école}X
\end{document}

enter image description here


A ConTeXt solution:

You can use the command \applytosplitstringwordspaced for this:

\def\IgnoredWords
  {a,is,to,of,or,and}

\define[1]\CapitalizeWithIgnoreWord
  {\doifinsetelse{#1}\IgnoredWords{#1}{\Words{#1}}}

\def\CapitalizeWithIgnore
  {\applytosplitstringwordspaced\CapitalizeWithIgnoreWord}

\starttext
  \CapitalizeWithIgnore{This is some of my input or another and to the end.}
\stoptext

which gives

enter image description here

The \applytosplitstringwordspaced command divides the input into words and applies each word to the macro \CapitalizeWithIgnoreWord, which takes one argument. Then I simply test, if the given word is a member of the word list and print it, or print it uppercased.


The titlecaps package is newly introduced and demonstrated here: Headings in uppercase. It will take care of titling diacritical marks (e.g., umlauts, etc.) national symbols (e.g., oe) and is compatible with (i.e., can include in its argument) commands that change the font characteristics, such as \textit{}, \scshape, and \footnotesize. Further, it allows for words to be designated as lower-cased, for example prepositions and conjunctions, which are to be screened out and not titled. The presence of punctuation should not affect the ability of the package to either capitalize a word or detect it as a pre-designated lower-cased word.