First part of a translation currently being done from German:

Computer Alphabetizing

Principles, Problems, and an Attempt of Improvement

By Hans Christophersen © Copyright 1996

Introduction

In this article I shall describe some of the problems of computer alphabetizing.
A special problem arises when sorting national characters. Different countries have different requirements concerning the succession of accented and special letters.
Software programs generally do not have the ability of satisfying national demands. This fact is unsatisfing to countries having accented and special letters in their national alphabet that must be sorted in another order than the sequence in which the order can be produced by software programs made in other countries.
I shall put forth an algorithm that can be adopted to reproduce any acquired sorting order.

International Alphabetizing

Alphabetizing words and phrases in dictionaries and encyclopaedias is due to the acquirement that information should be retrieved as quickly as possible, and thus text strings should be sorted in a well-known order. But this order may vary from country to country. Most complicated is alphabetizing when not only one language is involved. In a telephone book there will be names origitaning from many different languages, and it would be impossible that these names be sorted by different criteria.
The sorting order of the letters A-Z, a-z is the same in all the European languages I have examined, but some national special letters are sorted differently in different countries. In Germany ö is alphabetized under o, but according to the Swedish rules it is the last letter of the alphabet.
We have to give a more explicit formulation of the notion of international alphabetizing, because national rules conflict. In every case of alphabetizing one language must be predominant, assuring that all alphabetizing rules of this language be followed. Foreign letters that do not exist in this language can be alphabetized following the rules of the foreign language. In a Swedish encyclopaedia it will be possible to place ß under ss according to German rules, because ß is not found in any Swedish word.

The special rules in the languages I have investigated may be grouped in four categories:

1. Special letters with a specific place:

The bold printed special letters are sorted as follows in the countries concerned:
1) Danish and Norwegian: ... Z, z, Æ, æ, Ø, ø, Å, å.
2) Swedish: ... z, Z, å, Å, ä, Ä, ö, Ö.
3) Icelandic: ... X, x, Y, y, Þ, þ, Æ, æ, Ö, ö.
4) Spanish: ... N, n, Ñ, ñ, O, o ...

2. Special letters alphabetized as ordinary letters:

I. Some letters are ligatures that are decomposed:
1) French: Œ and œ (to be alphabetized as OE or Oe, and oe).
2) German: ß (to be alphabetized as ss, but after the sequence ss, even if it is originally a ligature of s + z).

II. These special letters are treated as D/d with a diacritical mark:
1) Faroese: Ð and ð.
2) Icelandic: Ð, ð and d with a cross (which cannot be represented here).

3. Letter compositions with a specific place (digraphs):

1) Danish/Norwegian: aa (as å; in Danish before 1948 as aa, in Norwegian before 1917 as aa).
2) Spanish: ch (between c and d); ll (between l and m).
(In Spanish lynchar presedes llaca, and czar precedes chabacanada).
Note: In 1994 it was decided that in future the special Spanish sequences should be abandoned. I consider this fact to be without importance to the principal notion of specific national rules, because it is a general problem not only concerning Spain. In Welsh alphabetization "ch" and "ll" are also treated as separate letters for alphabetizing among other double letters. Thus the Spanish problem can be seen as just one example of a difficulty to be treated with.

4. Letters to be sorted in one group:

Letters with diacritical marks are to be sorted as the basis letter (the primary letter) without a diacritical mark (with the exceptions mentioned above for individual languages). But when two text strings are equal without regarding the diacritical marks, the diacritical marks are decisive for the sequence. A letter without diacritical mark will always precede the same letter with a diacritical mark. Thus we have a sequence as e.g. Foret, Forêt, Forez.
But having a sequence like Foréz, Forèz we must realize a requirement for a rule to sequence accented letters.
In French alphabetization the sequencing rule is:

´ (accent aigu, acute accent)
` (accent grave, grave accent)
^ (accent circonflexe, circumflex)
¨ (tréma, dieresis)

According to Danish standard the same sequence is used, when such letters occur in proper names, i.e. e.g. é, è, ê, ë. But in Italian the two first accents have to be interchanged; thus è precedes é in Italian.
When sorting is done using the sequence of the native character set of a VAX computer (ANSI with small deviations), the sequence of the diacritical marks is: ` ´ ^ ˜ ¨ . Here we also have the tilde ( ˜ ). In other languages than Spanish it would be apposite to treat the tilde as a diacritical mark, but in Spanish ñ and Ñ are treated as specific letters (as described under 1) special letters with a specific place).
A colleague of mine thinks that the ANSI sequence being the same as the Italian order (especially ` grave accent, and ´ acute accent) is due to the greater penetration of the Italian lobbying in America.

Alternative sorting criteria

The following two examples relate to Danish and German, but they should be seen as general problems. Analogous conditions may be found in any other language not dealt with here.
Danish: Previously it was common to sort v and w in one group, but with v preceding w, e.g.: ver, wer, vet. In the dictionary Nudansk Ordbog this sorting principle was used until the middle of the 1980ies. Now it is considered to be obsolete and it is hardly used any more.
After the spelling reform in 1948 the letter å was introduced to replace the letter combination (digraph) aa corresponding with one phonetic sound. The letter å had been used to represent this sound in Swedish for more than 200 years, and has been used in Norwegian since 1917. It was allowed that aa could remain unchanged in proper names, and most people having names with the digraph aa in it did not want to change the spelling of their name. The choice was to write, e.g. Østergaard or Østergård. According to standardized Danish alphabetizing rules aa has to be sorted as å, the last letter of the Danish alphabet. This rule should be applied also when aa is pronounced as a long a, e.g. in the river name Maas, but this rule is considered foolish by most people, and that is why it is seldom used. For example telephone books deviate from this rule.
German: Nowadays the German letters ä, ö, ü are normally treated as respectively a, o, u with the diacritical mark ¨ (umlaut). Alphabetizing these letters, there is another possibility, mainly used in older dictionaries, ordering the letters according to their origin: ae, oe, ue. The umlaut mark is the reminiscence of an old German hand-written e being stylized to two strokes, and later two points. Using the first principle we have a sort order as: ae, aeh, afh, äg, äi. Using the second principle these syllables will be sorted: ae, äg, aeh, äi, afh. The first principle is the most common one, and in Austria and in Switzerland it is the only principle used. The second principle is for example used in German state libraries and in German telephone books (e.g. Grützner preceding Gruffke).

Three optional sorting criteria

1) With significant space character (»Nothing preceding Something«)
Example: à condition, a conto, Ackja, Acre.
The space character is considered to be a character preceding all visible characters. According to Danish standard (Dansk Standard Nr. 377), this criterion is applied by Danish libraries, and it is used in Denmark as the main rule.

2) Strictly alphabetically (»character by character«)
Example: Ackja, à condition, a conto, Acre.
The space character is considered as being not present. This sequencing rule is used, e.g. in the Danish ortography dictionary edited by Dansk Sprognævn, and in the German dictionaries Duden Deutsches Universalwörterbuch and Duden Rechtschreibung.

3) According to interdependence
Example: Affe, Affengeschwindigkeit, Affenhaus, Affekt.
These words are German words. The first one means 'monkey', and the next two words are compound words initiated with Affe[n]-, and the last one has no connection to the others.
Words with similar meanings, derivatives, and compounds are held together, but the sequencing is a logical one depending on meanings, not on the alphabetical order of the letters.
A modified version of the interdependence criterion is found in the German dictionary Duden Das große Wörterbuch der deutschen Sprache in 6 Bänden. Here the words are found in this sequence: Affe, Affekt, Affenhaus, Affengeschwindigkeit. 'Affe' has been separated from the 'Affe[n]' compounds. Next the 'Affe[n]' compounds have been divided into two groups, and Affenhaus is classed with the first group, and Affengeschwindigkeit with the second one. Thus Affenhaus precedes Affengeschwindigkeit, even if they had to be interchanged according to the alphabetical order of the letters.

Doing the sorting with a computer the first criterion can be managed very easily, because two text strings are compared without any intervening changes, i.e. in their original representation.
Using the second criterion all space characters will have to be removed from the two text strings that are to be compared in order to determine their alphabetical sequence. But a remembrance of the original character sequences of the original text strings would have to be maintained, because two text strings might be equal after removing spaces, but in this case the sequence should be determined regarding the original character sequencing of the strings. An unambiguous sequencing is a logical requirement that must be present if computers are doing the sorting. An example is the two text strings: a ba and aba. Having removed the space character from the first string, the two strings would seem equal, but it would be inappropriate not being able to determine a sequence depending on primary conditions, i.e. if a ba precedes aba, or reversedly.
The third criterion, sorting according to interdependence, can not easily be implemented. It is obvious that a non-alphabetical sequencing cannot depend on alphabetical order. In some cases a special sort sequence is obtained through a sort key being stored in a database.
Some English compound words may be written as one word, as two (or more) words, or with a hyphen. That is problem because the same word would occur at different places. For instance making a frequency list from texts by different authors it will be a tedious work to gather differently spelled words into one group.
When the words to be sorted come from a text file with word divisions in it, it may be difficult to decide if a hyphen should be maintained because it is a component of a compound word, or if the hyphen is part of the compound. A human being can make the decision, but it is much more difficult to make an algorithm that will let computers make the decision. A problem arises in German because of the rule that words with ck are hyphenated k-k. In case of word division Lücke is written Lük-ke. But a personal name Lükke would also be hyphenated Lük-ke. A computer would not be able to decide the origin, Lücke or Lükke.

According to my research how alphabetizing is done in different languages the rule is that numerals (0-9) precede all letters. Lower-case and upper-case letters are sorted without respect of case (e.g. Deb, dek, Del). But two words being equal not considering the case upper-case will precede lower-case in most languages (e.g. Karl preceding karl). In Swedish karl will precede Karl.

Sort sequence of special characters

When computers are doing the sorting of text strings it is required that each special character has a predefined position in order to determine the sequence. This requirement is especially important when an entry word (lemma) consists of nothing else than special characters, e.g. in a computer book the shift operator <<.
In traditional lexicography there are three ways how to treat with special characters:

As blind character: The punctuation marks (e.g. comma and question mark) are not considered, i.e. the sorting is done as if these characters were not there. Apostrophes (e.g. in: Peter's Bar) and ellipsis hyphens (e.g. in German: Männer- und Frauenberufe) are also treated as blind characters.
Blind characters are . , ; : ? ! ( ) [ ] » « " „ ” “ ' -
A special case is the dash. According to typographical rules it is longer than a hyphen ( – en-dash, or — em-dash), and it is encompassed by space characters. This sequence (space dash space) is treated similar to one space.

As space character: Hyphens and slashes are treated like spaces. Examples: the name Heinz-Dieter (to be alphabetized as: Heinz Dieter), and/or (alphabetized as: and or).
However, no rule without exceptions. In some cases hyphens and slashes are treated as blind characters: 1) When two parts are closely connected and could not occur separated (e.g. U-Boot, U-boat), and 2) when a special character is optinal (e.g. in the abbreviation B/W = BW = black/white).
In the dictionary of computer abbreviations that I have edited, edb-forkortelser, I made the decision to treat all special characters as blind characters in order to make it easier to find abbreviations and acronyms, when an entry may contain hyphen, slash, or space, or only letters. For instance the dictionary has the entries A&T, A/T, and AT being sorted primarily as AT, and secondarily in the sequence stated; & preceding / according to the sequence in computer alphabets.
After the pronunciation: A special character being pronounced as a word should be alphabetized as the written word. Thus an ampersand, &, is changed to und in German, and in English, och in Swedish, et in French, etc. This rule is seldom used as almost nobody knows about it, and it conflicts with any effort of making international alphabetizing rules. It would be very troublesome to implement this rule in computers, and I think it should be neglected.

The Concept of Letters

The letters occurring in the alphabets of the countries I have dealt with, can be grouped into three classes, one of which has four subgroups.

Plain Letters

Plain letters are letters occurring in all Latin alphabets, A-Z and a-z. In all countries plain letters are alphabetized in the same sequence relatively to other plain letters. Exceptions are the combinations aa in Danish and Norwegian, and ch and ll in Spanish and Welsh.

Letters With Diacritical Marks

Letters with diacritical marks are alphabetized as the basic letters (plain letters) without regarding the distinction if some other difference occurs in the two text strings that are compared to determine their sequence. The letter é is a letter with diacritical mark, and is treated as e as the primary rule. The German letter ö is considered a letter with diacritical mark if ok precedes ö, but it is not if ö is disintegrated to oe rendering the sequence öm, ok.

Special characters

1) Special letters without resemblance with plain letters: In this group we have the Icelandic letter þ with its special place in the Icelandic alphabet.
2) Special letters with resemblance with plain letters: These letters look like letters with diacritical marks, but are special because they have their special place in the alphabet according to national rules. Examples are the Nordic letter å and the Swedish ö (opposite to German ö).
3) Dissolvable ligatures: These are letters that may be written as a ligature or as two separate letters, and being alphabetized as separate letters. Example: French œ = oe.
4) Ligatures considered as special letters: Letters written together, having a special name and their special place in the alphabet and considered as special letters in the countries concerned. From an orthoepic point of view these ligatures have to be considered as digraphs as they represent one sound. Example: in Danish and Norwegian æ (after z in the alphabet).

Diacritical marks are used to indicate a distinction between letters, mostly concerning pronunciation (e.g. c/ç, e/é). Some diacritical marks are also called accents (e.g. ´ over é). Most letters with a diacritical sign are sorted as the basis letter as the primary criterion, but others are treated differently having an exceptional position. In the Scandinavian countries the ring º above å is not seen as a diacritical mark, because å in all respects is another letter than a.
A ligature is a typographical type where two or more letters are joined. In some cases this is without importance to alphabetizing (e.g. ff joined, or fl joined). In other cases a ligature is considered a new independant letter that can only be represented as one letter. In the Scandinavian languages æ cannot be replaced by ae.

Establishing a Concept Model

I should like to define some terms of alphabetizing criteria to be used when dealing with alphabetizing, not only lexicographically but especially in connection with computer alphabetizing.

Primary Criterion

The primary criterion is to decide if two letters belong to the same group of a basis letter. The letters E, É, e, é, ê etc. are primarily equal because the basis letter in all cases is E/e. But for instance g and k are primarily unequal because the basis letters are different.
In the two text strings forêt and forez there is a primary difference occurring at the 5th letter of each string, and thus alphabetizing is done according to the primary criterion (the basis letters) without regarding the diacritical mark ˆ.
The two text strings foret and forêt are primarily equal. The primary criterion will not suffice to determine the sequence of these words.

Secondary Criterion

Two letters being primarily equal (same basis letter) are secondarily unequal if there is a difference regarding diacritical marks. For instance the letters e, é, è, ê and ë are primarily equal, but secondarily unequal.
Using the secondary criterion e precedes ê, foret will precede forêt. The two text strings are primarily equal, but secondarily unequal. Considering the two text strings forêt and forez, we find a primary difference (t different from z), and thus the secondary criterion is without importance to the alphabetizing of these words.

Tertiary Criterion

In most cases the two first criteria will lead to a satisfactory result. But sorting algorithms must be improved, if possible. Using lexicographical requirements lowercase and uppercase letters have to be sorted together, i.e. A a B b C c etc., uppercase letters having precedence in most languages. Using only two criterion levels, and the difference between lowercase and uppercase letters being a secondary criterion, we would have sequences like: Foret Forêt foret forêt Gómêt Gòmét gómêt gòmét.
Thus Foret would be separated from foret, Forêt from forêt, and likewise the <gomet> sequence would be sorted in a way letting the case of a letter have greater importance than diacritical marks in a text string.
The tertiary criterion shall tend to keep text strings together when they are equal according to the first criterion (equal basis letters) and the second criterion (equal basis letters with the same diacritical marks), and the only difference is due to upper and lowercase letters.
The second criterion is only needed when the first criterion is not sufficient, and likewise the third criterion is only used when the sort sequence is not determined using the first two criteria.
I agree with the opinion of Dansk Sprognævn (the Danish language advisory committee) that the example words should be alphabetized:
Foret foret Forêt forêt Gómêt gómêt Gòmét gòmét.
If we want to implement this sorting in a computer program we need three alphabetizing criteria.

Look-ahead

Keeping the rules given in the three criteria above depends on a look-ahead ability. When a secondary difference occurs (e.g. e and ê) it can still not be determined which text string comes first, because the second criterion is only applied when the two text strings being compared are equal according to the first criterion. When a sort algorithm will alphabetize in the sequence foret, forez, forêt we must conclude that it has no look- ahead ability, because a secondary difference was decisive when a primary difference was present. The correct sequence is: foret, forêt, forez.

Alphabetizing Problems

I) Common problems sorting manually and with computers:
A language of precedence must be determined because some letters are sorted differently according to national rules. In particular this concerns special letters and letter compositions with a specific place (digraphs); but also e.g. ö which in German is sorted under o, but in Swedish and Icelandic it is the last letter of the alphabet, and when it occurs in Danish in a proper name it is sorted under ø (in case of equal strings ø has precedence, e.g. Øster preceding Öster).
How to handle foreign characters has to be determined. Characters which occur in only one foreign language could be sorted according to the rules of this country, e.g. ß and œ, but again it is a question where the users of a dictionary would look for them. In a German encyclopaedia the physical unit Å (ångström, having its name after the Swedish physicist Anders Ångström, 1814-74) is found under A and not according to Swedish rules, even if Å in the countries where used is not considered a letter with diacritical mark but a special letter.
One of the optional sort criteria has to be chosen. The sorting can be strictly alphabetical (most frequently used), or with significant space (most simple to implement in computer programs), or according to interdependence (which may make it difficult to find entries, as this principle appears strange to most people).
A uniform sorting principle has to be applied, when more rules compete. In one language optionally more alphabetizing rules may be equally applied, or more rules are applied in practice. But more sets of rules must not be intermixed. In every single case one principle must be chosen, and letting a computer do the sorting there is no other way.
In German we have two competing principles concerning Ä/ä, Ö/ö, Ü/ü, either being basic letters with diacritical marks, or as ligatures to be considered as basic letter + e.
In Danish telephone books aa is alphabetized after two different criteria. When pronounced as å and being a variant of the spelling å it is alphabetized under å which is the last letter of the alphabet, but when pronounced [a] or [a:] it is alphabetized as aa, even if it is not according to the official standard.

II) Special problems when sorting with computers:
Sorting according to pronunciation: For example in the Austrian telephone book of Vienna (Wien) 1970, the letter composition ue is alphabetized according to two different principles. One principle is applied when ue is pronounced like ü, [y], and another when it is pronounced as a long u, [u:]. Similarly, in Danish telephone books aa is sorted as aa when pronounced as long a, [a:], and as å when pronounced as å. It is rather complicated to implement computer sorting according to pronunciation. At least a database with alphabetizing information is needed. A database record may have a display field (how the entry is spelled), and a sort key to determine the alphabetizing sequence (this might be an alternative spelling that would lead to correct alphabetization).
Digraphs: Some digraphs may be troublesome because of a special ordering sequence, but others are uncomplicated as e.g. ph pronounced as [f], as it is sorted as p-h manually (and lexicographically) and with computers. The troublesome digraphs are the ones being treated in some special way according to national rules.
Computer alphabetizing normally will rely on a table rendering the primary sort order keyed to a native character table of the computer, but this method is not suited to handle the digraph problem.
The first example originates from alphabetizing with the Danish release of WordPerfect version 5.1. When alphabetizing the digraph aa (to be sorted as å), and å, the result can be the sequence: aal, ål, aal, telling us that the algorithm used lets aa be equal to å without precedence. According to Danish orthography å precedes aa. It is possible to let computers distinguish between aa and å, but implementation algorithms will be somewhat complicated.
There are also difficulties with the Spanish (and Welsh) digraphs. The digraph ch is sorted as a special letter between c and d, and ll is between l and m. Often computer alphabetizing will not regard this aspect. Another problem is that ch and ll are not alphabetized as digraphs in foreign words, but are treated as c-h and l- l. No simple computer algorithm will be capable of determining if these digraphs occur in a native or in a foreign word.

Demonstration of Alphabetizing Criteria

Having analyzed alphabetizing principles in computer programs, I have found that sorting is done differently by diffent software, and that sort sequences in many cases vary from manual sorting based on lexicographic principles.
In order to determine how many criterion levels a computer program uses when alphabetizing, we will use an input sequence suitable of giving an output sequence that will tell which criteria and criterion levels are used. It is without importance to the principle of alphabetizing whether the implied text strings have any lexical meaning in any language.

Number of Criterion Levels

The number of criterion levels can be determined using this test input:

forêt Foret Forêt forez foret

Three Criterion Levels:
Text strings being equal regarding diacritical marks are sorted together without considering the difference between uppercase and lowercase letters. There is a priority range of letters being variants of the same basic letter. The basic letter without diacritical mark precedes the same letter accented, when no more important differences are found examining the sequencing letters (look-ahead). Normally an uppercase letter will precede a lowercase letter (when being characters of the same basic letter), and thus the three criterion level principle to the sequence:

III

Foret foret Forêt foreêt forez

When lowercase letters precede uppercase letters (e.g. Swedish sequencing) the sort sequence is:

III

foret Foret forêt Forêt forez

We still have three criterion levels. The determining fact is, that programs with one or two criterion levels are not able to render any of these sequences.

Two Criterion Levels:
With two criterion levels the difference between uppercase and lowercase letters does not differ from the nature of any other distinction concerning a basic letter. The first occurring difference is decisive to the alphabetizing order, except when the two characters being compared are predefined to be sorted together (i.e. primary equal and secondary unequal), and the sort sequence is determined through some subsequent primary difference (i.e. look-ahead). With two criterion levels, there are two possible sequences. In the first sequence an uppercase letter precedes a lowercase letter; and in the second one lowercase precedes uppercase:

Foret Forêt foret forêt forez

foret forêt forez Foret Forêt

The first sequence satisfies most national rules, but in some countries (e.g. Sweden) lowercase has precedence.
Of decisive importance is that the two sequences above connot be obtained when sorting with programs having only one criterion level.
At the time of my research (1994) I found two criterion levels using e.g. Paradox, WordPerfect v. 5.1, and MS-Word.

One Criterion Level
A program only has one criterion level, if it cannot produce any of the sort sequences given above to decide three or two levels. We can describe it as a lack of look-ahead, because e.g. e and ê will always be sorted in some fixed sequence without regarding subsequent characters. An example of one criterion level sorting is this simple ANSI sequence:

Foret Forêt foret forez forêt

The main criterion when having only one criterion level, is that the first difference is decisive; in this case F/f and e/ê (no look-ahead). There also is only one criterion level in the sort sequence: foret forez forêt.

It is possible to divide one criterion level sorting into three subgroups:

simple sequence according to the native character table of the computer (e.g. ANSI table)
improved one criterion sorting, grouping together some characters without priority
expanded improved one criterion sorting, grouping together some characters with priority.

It is more difficult to determine the one criterion subgroup levels, as determining the number of primary criterion levels. The text strings used above will not suffice.
When it is determined that a program has only one criterion level, and some sort sequence varies from the sequence of the native character table (e.g. ANSI), possibility 1) can be excluded.
When a test input like ret ret rét can lead to the output ret rét ret we realize that e and é are treated as one letter. Repeated sorting with the same input can give any sorting of the three text strings. Because there is no look-ahead we have only one criterion level, and because there is no priority between e and é we can tell that the sorting is done according to the one level criterion subgroup 2.
A simple ANSI or ASCII sequence (I-1) is obtained using only the compare function in C, strcmp; the version of Oracle that I investigated had this sorting. The expanded improved one criterion sorting I found using Ingres and Danbase (1994).

Partial Deviation From the Criterion Levels
Ordinarily a sorting program has only one criterion level when you encounter some undefined sequencing of characters sorted together, but some programs having two criterion levels may have only one level in connection with certain characters or character groups. Be sure to have a suitable test input not to make to hasty conclusions. WordPerfect (v. 5.1) in most cases uses two criterion levels, but in the Danish edition å ist grouped with a, however with no predefined order, so a sorting procedure may lead to the sequence: aal ål aal (one criterion level). Sorting the sequence ret rét ret will alway give the same output: ret ret rét (two criterion levels). Many programs with two criterion levels will use only one criterion level when sorting ß, Œ and œ (expanded improved one criterion level); this error has remained in WordPerfect version 6.1.

Strictly Alphabetically or With Significant Space Character?

It is seldom that sorting is done according to the interdependence principle which is difficult to implement in computer programs, but when used, the other two optional sorting criteria are not (always) observed. Examples are found in connection with compound words, e.g. German Finger and fingerförmig preceding Fingerei in: DUDEN Das große Wörterbuch der deutschen Sprache in 6 Bänden.
If the space character was regarded when alphabetizing, can be decided using the test input aba and a ca with two possibilities of output:

Strictly alphabetically: aba, a ca
Significant space character: a ca, aba

All the programs that I have examined sort with significant space character.

Grouping Special Characters

In the ANSI and ASCII character sets, special characters are dispersed throughout the sequence; thus you will find special characters preceding the numbers (e.g. &, *), between numbers and uppercase letters (e.g. <, ?), between uppercase and lowercase letters (e.g. [, ^), and succeeding lowercase letters (e.g. {, ƒ). Some programs will keep this sequencing (entirely or partially), and others will let the special characters preceed or succeed alphanumeric characters. For your first investigation you may use the test input: & * < ? [ ^ { ƒ 3 f F. The special characters may be grouped, or dispersed, but in order to know if the sequence of the native character set of the computer is kept unaltered, you would have to sort all available characters.

Sequence of Accents

In most programs the accents have a predefined sequencing independent of the basic letter. To determine this sequence, use the test input: a á à â ã ä. But there may be differences depending on the basic letter. To obtain a complete overview, it is required to sort all characters of the character set; thus being able to determine the position of other letters with diacritical marks.

Specific National Rules

We might investigate specific rules of any number of languages. The examples given cannot be exhaustive, but will be an indication of problems to be dealt with.

Danish, Norwegian, and Swedish:
When aa and å are grouped together according to Danish and Norwegian rules, a sort sequence will be, e.g.: z ål aal. In cases where aa is no digraph, but occurs in each part of a compound word, aa should not be sorted as å, but as two occurrences of a. An example is the Danish word ekstraarbejde (German: Extraarbeit, English: extra work). It is difficult to let a computer decide if aa is a digraph or not. It can only be done if you have a database with information about all words, but there are many possibilities of making compound words.
The Icelandic letter þ is sorted differently from the Icelandic sequence, i.e. as th.
According to Danish and Norwegian rules the umlaut letters are grouped with national letters having the same pronunciation, thus: Y Ü y ü Z z Æ Ä æ ä Ø Ö ø ö Å å.
The Swedish alphabetizing rules are according to the ANSI sequence: å ä ö.

German:
The letter ß is alphabetized as ss, but when two text strings are equal, not considering the difference ss/ß, ss will precede ß. Thus the following sequence of words is sorted correctly according to lexicographic rules:

haspeln Haß hassen hassenswert haßerfüllt Many programs do recognize ß to be grouped with ss, but when all preceding characters are equal, all strings with ss will preceed the strings with ß, which is wrong (e.g. asst, asstm, aßt, which should be: asst, aßt, asstm).
There are two approved ways to sort the umlaut letters:
1) The most used principle is to sort the umlaut letters as letters with a diacritical mark, rendering this example sequence:

ae aeh äg äi vor vör vore vöre 2) In older dictionaries, in telephone books, and in libraries, the ordering bases on the dissolution of the umlaut letter as the basic letter + e, e.g. ü = ue, which renders this sequence:

ae äg aeh äi vör vöre vor vore

Faroese and Icelandic:
The letters Ð, ð (»eth«), and d with a cross, are treated as D/d with a diacritical mark, rendering the sequence: D Ð d [d with cross] ð E e.
The Icelandic Þ/þ (»thorn«) comes after y, but before other national special letters (there is no z in Icelandic words). In pure Icelandic connection the sequence is y þ æ, and in connection with foreign words and proper names the sequence is: Y y Z z Þ þ Æ æ Ö Ø ö ø Å å.

French:
According to the rules of alphabetizing the French letters Œ and œ we have these sort sequences:

and

Depending on priority criteria implemented in the sorting program, these two sequences may be grouped with either lowercase or uppercase letters having priority, but this problem is not a special French one. The result is:

or:

The first row, letting uppercase letters have precedence, is according to the norm.

Dutch:
The Dutch ligatures of IJ/ij make no exception, but are treated as non-ligatures IJ/ij.

Spanish:
In Spanish word the digraph ch is considered a letter between c and d, and the digraph ll is a letter between l and m, and in Spanish Ñ/ñ is not a letter with diacritical mark, but a special letter between N/n and O/o. Accordingly we have a sequence as e.g.:

czar chabacanada lynchar llaca nz ña o

Basic Concepts of Computer Alphabetizing

The operating system of a computer has a native character set which is the basis of most sorting procedures. Some computer programs have additional software character sets that can only be used in the programs where implemented (e.g. WordPerfect). Fundamentally the succession of characters to be sorted has to be defined. The succession is laid down in a table, and each character is assigned an integer value to define the sort sequence. The simplest sort procedures will use the integer values of the native character set, or a redefinition table is created. Different computers have different native character sets, but common to all of them the succession of characters has to be redefined in order to achieve correct lexicographic alphabetizing. To the basic considerations in this paper it is without importance which character set is used. For my examples I have used the ANSI character set that is used in PCs with Windows, in Prime computers, in VAX computers, on the Internet etc. (minor differences may occur). The first 128 characters (#0-127) are the same as in the 7-bit ASCII character set and in the PC DOS character set.

ANSI Character Set


000     032 [SP] 064 @    096 `    128      160      192 À    224 à

001     033 !    065 A    097 a    129      161 ¡    193 Á    225 á

002     034 "    066 B    098 b    130 ‚    162 ¢    194 Â    226 â

003     035 #    067 C    099 c    131 ƒ    163 £    195 Ã    227 ã

004     036 $    068 D    100 d    132 „    164 ¤    196 Ä    228 ä

005     037 %    069 E    101 e    133 …    165 ¥    197 Å    229 å

006     038 &    070 F    102 f    134 †    166 ¦    198 Æ    230 æ

007     039 '    071 G    103 g    135 ‡    167 §    199 Ç    231 ç

008     040 (    072 H    104 h    136 ˆ    168 ¨    200 È    232 è

009     041 )    073 I    105 i    137 ‰    169 ©    201 É    233 é

010     042 *    074 J    106 j    138 Š    170 ª    202 Ê    234 ê

011     043 +    075 K    107 k    139 ‹    171 «    203 Ë    235 ë

012     044 ,    076 L    108 l    140 Œ    172 ¬    204 Ì    236 ì

013     045 -    077 M    109 m    141      173     205 Í    237 í

014     046 .    078 N    110 n    142      174 ®    206 Î    238 î

015     047 /    079 O    111 o    143      175 ¯    207 Ï    239 ï

016     048 0    080 P    112 p    144      176 °    208 Ð    240 ð

017     049 1    081 Q    113 q    145 ‘    177 ±    209 Ñ    241 ñ

018     050 2    082 R    114 r    146 ’    178 ²    210 Ò    242 ò

019     051 3    083 S    115 s    147 “    179 ³    211 Ó    243 ó

020     052 4    084 T    116 t    148 ”    180 ´    212 Ô    244 ô

021     053 5    085 U    117 u    149 •    181 µ    213 Õ    245 õ

022     054 6    086 V    118 v    150 –    182 ¶    214 Ö    246 ö

023     055 7    087 W    119 w    151 —    183 ·    215 ×/Œ  247 ÷/œ

024     056 8    088 X    120 x    152 ˜    184 ¸    216 Ø    248 ø

025     057 9    089 Y    121 y    153 ™    185 ¹    217 Ù    249 ù

026     058 :    090 Z    122 z    154 š    186 º    218 Ú    250 ú

027     059 ;    091 [    123 {    155 ›    187 »    219 Û    251 û

028     060 <    092 \    124 |    156 œ    188 ¼    220 Ü    252 ü

029     061 =    093 ]    125 }    157      189 ½    221 Ý    253 ý

030     062 >    094 ^    126 ~    158      190 ¾    222 Þ    254 þ

031     063 ?    095 _    127      159 Ÿ    191 ¿    223 ß    255 ÿ

The first 32 characters (0-31) are control characters that are defined in the ASCII table for telecommunications (e.g. teletyping). In the PC DOS character set some of these values are used for special printable characters, and the values 128-255 represent other characters than those of the ANSI character set. In the VAX VMS character set the values 128-159 are occupied by special control characters, and some characters with bigger integer values are missing.
Þ and þ are only available in some versions of the VMS operating system.
In VMS Œ has the integer value 215, and œ has the value 247, thus correctly grouped with O/o, but most printers will print × and ÷ according to the ANSI norm.
A character set is alphabetized according to the character table used, with one criterion, or with more criteria as a primary criterion level. Using the ANSI table (and in this case also the ASCII table) F has the value 70, and b has the value 98. Thus F will precede b because 70 is less than 98, when sorting using this table.

Principles of Some Sort Procedures

.
....

... to be continued!

This article is currently being translated from German.
In the meantime you can read the German version: Alphabetisierung auf Computer, or download the German version alphabet.zip, a compressed file that can be unzipped to a WordPerfect document.
My primary languages are Danish and German. Please, feel free to correct my English, and let me know about any wording or terminology that you do not understand, in order to improve the language of this article.
Other comments are also welcome.

Hans Christophersen