Collating Ogoneks

Swithun Crowe
Friday 5 May 2017

No, this isn’t one of Captain Haddock’s expletives, uttered after drinking a bottle of whisky. Which isn’t to say that there was no cursing or desire for string drink when Research Computing helped PhD student Vittorio Mattioli to produce a digital edition of two versions of the Icelandic Eddic poem Grímnismál, as part of his thesis.
Vittorio came to us looking for advice on how to mark-up the poem using Text Encoding Initiative XML. We advised, and Vittorio did the hard work of marking-up the 3000+ words in the poems, identifying the lemmata, providing translations of each lemma and normalising the place and personal names.
Research Computing’s job was to take the TEI XML and produce a web page which would allow readers to explore the text, taking advantage of all the semantic mark-up present in the source. The result is a resource which lets the user view both (or either) poems side by side, verse for verse. Mousing over each word, one can see the lemma and meaning or normalised name, and clicking on each word takes one to the glossary or list of names, where one can then follow links to all other attestations of the word.
That involved writing some clever XSLT. But the frustrating, though ultimately satisfying, part was sorting the glossary and list of names. Normally sorting in XSLT is straight forward:

<xsl:apply-templates select="path/to/elements">
  <xsl:sort select="element/to/sort/on" />
</xsl:apply-templates>

The problem arose because sorting a list of Icelandic words correctly requires specifying Icelandic collation rules rather than following Unicode order. The Icelandic alphabet has 32 letters, with acute accented vowels coming after their unaccented counterparts. The ash (Æ) and O diaeresis (Ö) are considered to be single letters.
We investigated using a different XSLT processor – Saxon XSLT instead of xsltproc, as it offers more features, including custom collation. But then we discovered that one can use a different language’s collation rules simply by specifying the language:

<xsl:apply-templates select="path/to/elements">
<xsl:sort select="element/to/sort/on" lang="is" />
</xsl:apply-templates>

This produced almost the correct results, apart from the O ogonek (Ǫ). Modern Icelandic doesn’t use this letter, and puts it after the other O forms – instead it uses O diaeresis, which is the last letter of the Icelandic alphabet. The problem was how to get O ogonek to come at the end of the alphabet.
After much cursing, we worked out how to modify the Icelandic locale file on my computer (/usr/share/i18n/locales/is_IS), so that O ogonek could be sorted by old Icelandic collation rules:

<U01EA> <U00F6>;<OGONEK>;<CAPITAL>;IGNORE
<U01EB> <U00F6>;<OGONEK>;<SMALL>;IGNORE

This was an interesting project with which to have been involved, both in terms of the subject matter, and the technical problems and solutions which it threw up, and we’re grateful to Vittorio for bringing it to us.
Thus ends the small tale of the ogonek.

Share this story