How can I use \dtlsort with different alphabets?

This problem might be similar to that of sorting index entries. When you want to force a specific sort placement (indirectly, a sort order), you have to provide the sorting algorithm with some help.

In your application, I would supply a regular English alphabet replacement or Norsk English that will be used to sort entries on. To that end, create an additional field in your database where you replace each æ with za, each ø with zb and each å with zc (say), in order to have them sorted sequentially after z.

enter image description here

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{datatool,xparse}

\NewDocumentCommand{\vocab}{m o m m}{%\parskip=0pt \textbf{#1} (#2) #3~\par%
  \DTLnewrow{vocabsort}%
  \DTLnewdbentry{vocabsort}{norsk}{#1}% Norsk
  \IfNoValueF{#2}{\DTLnewdbentry{vocabsort}{norsksort}{#2}}% Norsk sort
  \DTLnewdbentry{vocabsort}{class}{#3}% Class
  \DTLnewdbentry{vocabsort}{english}{#4}% English
}
\newenvironment{vocabsection}[1]{\noindent \textbf{#1} \par \medskip
  \DTLifdbexists{vocabsort}{\DTLcleardb{vocabsort}}{\DTLnewdb{vocabsort}}%
}% 
{ \dtlsort[norsk]{norsksort}{vocabsort}{\dtlicompare}%
  \displaysorteddb
}
\newcommand{\displaysorteddb}{%
  \begin{DTLenvforeach*}{vocabsort}{\norsk=norsk, \class=class, \english=english}
    \textbf{\norsk} (\class) \english~\par
  \end{DTLenvforeach*}
}

\begin{document}

\begin{vocabsection}{Body Vocab}     
  \vocab{hør}[hzbr]{n.}{hair}  % ø > zb
  \vocab{hode}{n.}{head}
  \vocab{øye}[zbye]{n.}{eyes}  % ø > zb
  \vocab{øre}[zbre]{n.}{ear}   % ø > zb
  \vocab{hake}{n.}{chin}
  \vocab{nakke}{n.}{neck}
  \vocab{hånd}[hzcnd]{n.}{hand}% å > zc
  \vocab{finger}{n.}{finger}
  \vocab{arm}{n.}{arm}
\end{vocabsection}

\end{document}

The sort order is supplied as an optional second argument using the interface

\vocab{<norsk>}[<norsk sort>]{<class>}{<english>}

The addition of <norsk sort> into the database is conditional on its existence. The macro

\dtlsort[norsk]{norsksort}{vocabsort}{\dtlicompare}

uses the field norsksort as sorting preference in the database vocabsort (using a case insensitive compare), and norsk only if norsksort is absent (null or \DTLstringnull).

As mentioned, this is no different than using the a@b suggestion with \index; that is, place b in the index, but sort it as a (see section 2.2 The Basics/p 4 of the makeindex documentation). The typical usage would involve something like \index{alpha@$\alpha$}. One could, of course, also change the above \vocab interface to take a similar input, if needed.

(You'll need version datatool v2.25 for this to work. I've only just uploaded it to CTAN, so it make take a day or so to reach the mirrors and TeX distributions.)

The comparison handlers provided by datatool (actually defined in datatool-base.sty) compare two strings by assigning a numerical code to each element of the strings. The code is assigned as follows:

If empty, the code is set to -1.
If the character (or macro) is identified as a word break, the code is set to the same as the character code for the space character (32).
If the "character" is actually a macro (command), the code is set to 0. As from v2.24, if this macro is identified as the first octet of a UTF-8 character (and the UTF-8 option has been enabled), then the code is instead obtained using \dtlsetUTFviiicharcode{octets}{register} (for case-sensitive comparisons) or \dtlsetUTFviiilccharcode{octets}{register} (for case-insensitive comparisons).
If none of the above apply, the code is set using \dtlsetcharcode{character}{register} (for case-sensitive comparisons) or \dtlsetlccharcode{character}{register} (for case-insensitive comparisons). These commands were introduced in v2.24. Earlier versions simply set the character code using TeX's backtick method (with \lccode for case-insensitive comparisons).

(Note that although these new commands were introduced in v2.24, the more recent v2.25 fixes a bug that affects this answer.)

There are therefore two approaches that depend on the document's input encoding.

Latin-1

(This should also work with other encodings where each character is viewed as a single TeX token.)

The two commands that are relevant in this case are \dtlsetcharcode and \dtlsetlccharcode. The default definitions are:

\newcommand*{\dtlsetcharcode}[2]{#2=`#1\relax}
\newcommand*{\dtlsetlccharcode}[2]{#2=\lccode`#1\relax}

(The first argument is the character and the second is a count register in which to store the character code.) Using this method, the codes for the Norwegian characters are as follows:

(For reference A -> 65, Z -> 90, a -> 97, z -> 122.) So this puts all the Norwegian characters after z in the above order. If you want to change this order, you need to redefine \dtlsetcharcode and \dtlsetlccharcode. For example, for the Norwegian alphabet:

\renewcommand*{\dtlsetcharcode}[2]{%
  \ifstrequal{#1}{æ}%
  {%
    #2=123\relax
  }%
  {%
    \ifstrequal{#1}{ø}%
    {%
      #2=124\relax
    }%
    {%
      \ifstrequal{#1}{å}%
      {%
        #2=125\relax
      }%
      {%
        \ifstrequal{#1}{Æ}%
        {%
          #2=91\relax
        }%
        {%
          \ifstrequal{#1}{Ø}%
          {%
            #2=92\relax
          }%
          {%
            \ifstrequal{#1}{Å}%
            {%
              #2=93\relax
            }%
            {%
              #2=`#1\relax
            }%
          }%
        }%
      }%
    }%
  }%
}

\renewcommand*{\dtlsetlccharcode}[2]{%
  \ifstrequal{#1}{æ}%
  {%
    #2=123\relax
  }%
  {%
    \ifstrequal{#1}{ø}%
    {%
      #2=124\relax
    }%
    {%
      \ifstrequal{#1}{å}%
      {%
        #2=125\relax
      }%
      {%
        \ifstrequal{#1}{Æ}%
        {%
          #2=123\relax
        }%
        {%
          \ifstrequal{#1}{Ø}%
          {%
            #2=124\relax
          }%
          {%
            \ifstrequal{#1}{Å}%
            {%
              #2=125\relax
            }%
            {%
              #2=\lccode`#1\relax
            }%
          }%
        }%
      }%
    }%
  }%
}

These use the etoolbox conditional \ifstrequal for clarity. There are more efficient ways of doing this using TeX's conditionals, such as \ifnum. For example:

\renewcommand*{\dtlsetcharcode}[2]{%
  #2=`#1\relax
  \ifnum#2=230\relax
    #2=123\relax
  \else
    \ifnum#2=248\relax
      #2=124\relax
    \else
      \ifnum#2=229\relax
        #2=125\relax
      \else
       \ifnum#2=198\relax
          #2=91\relax
       \else
         \ifnum#2=216\relax
            #2=92\relax
         \else
           \ifnum#2=197\relax
              #2=93\relax
           \fi
         \fi
       \fi
      \fi
    \fi
  \fi
}

\renewcommand*{\dtlsetlccharcode}[2]{%
  #2=\lccode`#1\relax
  \ifnum#2=230\relax
    #2=123\relax
  \else
    \ifnum#2=248\relax
      #2=124\relax
    \else
      \ifnum#2=229\relax
        #2=125\relax
      \fi
    \fi
  \fi
}

These redefinitions can be added to your example:

\documentclass[11pt]{memoir}
\usepackage{multicol}
\usepackage[latin1]{inputenc}
\usepackage{datatool}

\renewcommand*{\dtlsetcharcode}[2]{%
  #2=`#1\relax
  \ifnum#2=230\relax
    #2=123\relax
  \else
    \ifnum#2=248\relax
      #2=124\relax
    \else
      \ifnum#2=229\relax
        #2=125\relax
      \else
       \ifnum#2=198\relax
          #2=91\relax
       \else
         \ifnum#2=216\relax
            #2=92\relax
         \else
           \ifnum#2=197\relax
              #2=93\relax
           \fi
         \fi
       \fi
      \fi
    \fi
  \fi
}

\renewcommand*{\dtlsetlccharcode}[2]{%
  #2=\lccode`#1\relax
  \ifnum#2=230\relax
    #2=123\relax
  \else
    \ifnum#2=248\relax
      #2=124\relax
    \else
      \ifnum#2=229\relax
        #2=125\relax
      \fi
    \fi
  \fi
}

\newcommand{\vocab}[3]{%
    \DTLnewrow{vocabsort}%
    \DTLnewdbentry{vocabsort}{norsk}{#1}%
    \DTLnewdbentry{vocabsort}{class}{#2}%
    \DTLnewdbentry{vocabsort}{english}{#3}%
}
\newenvironment{vocabsection}[1]{\noindent \hspace{1em} \textbf{#1} %
    \DTLifdbexists{vocabsort}{\DTLcleardb{vocabsort}}{\DTLnewdb{vocabsort}}%
}% 
{ \dtlsort{norsk}{vocabsort}{\dtlicompare}%
    \displaysorteddb
}
\newcommand{\displaysorteddb}{%
    \begin{multicols}{2}
        \begin{hangparas}{0.5em}{1}
            \begin{DTLenvforeach*}{vocabsort}{\norsk=norsk, \class=class, \english=english}
                \textbf{\norsk} (\class) \english~\par
            \end{DTLenvforeach*}
        \end{hangparas}
    \end{multicols}
}

\begin{document}

    \subsection{Foo}

    \begin{vocabsection}{Body Vocab}
    \vocab{hør}{n.}{hair}
    \vocab{hode}{n.}{head}
    \vocab{øye}{n.}{eyes}
    \vocab{øre}{n.}{ear}
    \vocab{hake}{n.}{chin}
    \vocab{nakke}{n.}{neck}
    \vocab{hånd}{n.}{hand}
    \vocab{finger}{n.}{finger}
    \vocab{arm}{n.}{arm}
    \end{vocabsection}

\end{document}\newcommand{\displaysorteddb}{%
    \begin{multicols}{2}
        \begin{hangparas}{0.5em}{1}
            \begin{DTLenvforeach*}{vocabsort}{\norsk=norsk, \class=class, \english=english}
                \textbf{\norsk} (\class) \english~\par
            \end{DTLenvforeach*}
        \end{hangparas}
    \end{multicols}
}

\begin{document}

    \subsection{Foo}

    \begin{vocabsection}{Body Vocab}
    \vocab{hør}{n.}{hair}
    \vocab{hode}{n.}{head}
    \vocab{øye}{n.}{eyes}
    \vocab{øre}{n.}{ear}
    \vocab{hake}{n.}{chin}
    \vocab{nakke}{n.}{neck}
    \vocab{hånd}{n.}{hand}
    \vocab{finger}{n.}{finger}
    \vocab{arm}{n.}{arm}
    \end{vocabsection}

\end{document}

This produces:

image of document

which has the desired ordering: hake, hode, hør, hånd.

UTF-8

UTF-8 support is enabled by loading inputenc with the utf8 option before loading datatool-base (which is automatically loaded by datatool). You can also enable or disable it later with \dtlenableUTFviii or \dtldisableUTFviii.

In this case, \dtlsetUTFviiicharcode and \dtlsetUTFviiilccharcode are used to set the character code where the first argument consists if the two octet tokens that make up the UTF-8 character. The default definitions of these commands are:

\newcommand*\dtlsetUTFviiicharcode[2]{\dtlsetdefaultUTFviiicharcode{#1}{#2}}
\newcommand*\dtlsetUTFviiilccharcode[2]{\dtlsetdefaultUTFviiilccharcode{#1}{#2}}

These default commands set the code for some common supplemental accented Latin characters to be the same as the code for the unaccented version. (For example, é is given the same character code as e. So a UTF-8 version of your example document is:

\documentclass[11pt]{memoir}
\usepackage{multicol}
\usepackage[utf8]{inputenc}
\usepackage{datatool}

\renewcommand*{\dtlsetUTFviiicharcode}[2]{%
  \ifstrequal{#1}{Æ}%
  {%
    #2=91\relax
  }%
  {%
    \ifstrequal{#1}{Ø}%
    {%
      #2=92\relax
    }%
    {%
      \ifstrequal{#1}{Å}%
      {%
        #2=93\relax
      }%
      {%
        \ifstrequal{#1}{æ}%
        {%
          #2=123\relax
        }%
        {%
          \ifstrequal{#1}{ø}%
          {%
            #2=124\relax
          }%
          {%
            \ifstrequal{#1}{å}%
            {%
              #2=125\relax
            }%
            {%
              \dtlsetdefaultUTFviiicharcode{#1}{#2}%
            }%
          }%
        }%
      }%
    }%
  }%
}

\renewcommand*{\dtlsetUTFviiilccharcode}[2]{%
  \ifstrequal{#1}{Æ}%
  {%
    #2=123\relax
  }%
  {%
    \ifstrequal{#1}{Ø}%
    {%
      #2=124\relax
    }%
    {%
      \ifstrequal{#1}{Å}%
      {%
        #2=125\relax
      }%
      {%
        \ifstrequal{#1}{æ}%
        {%
          #2=123\relax
        }%
        {%
          \ifstrequal{#1}{ø}%
          {%
            #2=124\relax
          }%
          {%
            \ifstrequal{#1}{å}%
            {%
              #2=125\relax
            }%
            {%
              \dtlsetdefaultUTFviiilccharcode{#1}{#2}%
            }%
          }%
        }%
      }%
    }%
  }%
}


\newcommand{\vocab}[3]{%
    \DTLnewrow{vocabsort}%
    \DTLnewdbentry{vocabsort}{norsk}{#1}%
    \DTLnewdbentry{vocabsort}{class}{#2}%
    \DTLnewdbentry{vocabsort}{english}{#3}%
}
\newenvironment{vocabsection}[1]{\noindent \hspace{1em} \textbf{#1} %
    \DTLifdbexists{vocabsort}{\DTLcleardb{vocabsort}}{\DTLnewdb{vocabsort}}%
}% 
{ \dtlsort{norsk}{vocabsort}{\dtlicompare}%
    \displaysorteddb
}
\newcommand{\displaysorteddb}{%
    \begin{multicols}{2}
        \begin{hangparas}{0.5em}{1}
            \begin{DTLenvforeach*}{vocabsort}{\norsk=norsk, \class=class, \english=english}
                \textbf{\norsk} (\class) \english~\par
            \end{DTLenvforeach*}
        \end{hangparas}
    \end{multicols}
}

\begin{document}

    \subsection{Foo}

    \begin{vocabsection}{Body Vocab}     
    \vocab{hør}{n.}{hair}
    \vocab{hode}{n.}{head}
    \vocab{øye}{n.}{eyes}
    \vocab{øre}{n.}{ear}
    \vocab{hake}{n.}{chin}
    \vocab{nakke}{n.}{neck}
    \vocab{hånd}{n.}{hand}
    \vocab{finger}{n.}{finger}
    \vocab{arm}{n.}{arm}
    \end{vocabsection}

\end{document}

This produces the same result:

image same as before

How can I use \dtlsort with different alphabets?

Latin-1

UTF-8

Tags:

Sorting

Alphabet

Datatool

Scandinavian Letters

Related

Recent Posts