First part of a translation currently being done from German:
Computer Alphabetizing
Principles, Problems, and an Attempt of Improvement
By Hans Christophersen
© Copyright 1996
Introduction
In this article I shall describe some of the problems of computer alphabetizing.
A special problem arises when sorting national characters. Different countries
have different requirements concerning the succession of accented and special
letters.
Software programs generally do not have the ability of satisfying national
demands. This fact is unsatisfing to countries having accented and special letters
in their national alphabet that must be sorted in another order than the sequence
in which the order can be produced by software programs made in other countries.
I shall put forth an algorithm that can be adopted to reproduce any acquired
sorting order.
International Alphabetizing
Alphabetizing words and phrases in dictionaries and encyclopaedias is due to the
acquirement that information should be retrieved as quickly as possible, and thus
text strings should be sorted in a well-known order. But this order may vary from
country to country. Most complicated is alphabetizing when not only one language
is involved. In a telephone book there will be names origitaning from many
different languages, and it would be impossible that these names be sorted by
different criteria.
The sorting order of the letters A-Z, a-z is the same in all the European
languages I have examined, but some national special letters are sorted
differently in different countries. In Germany ö is alphabetized under
o, but according to the Swedish rules it is the last letter of the
alphabet.
We have to give a more explicit formulation of the notion of international
alphabetizing, because national rules conflict. In every case of alphabetizing
one language must be predominant, assuring that all alphabetizing rules of this
language be followed. Foreign letters that do not exist in this language can be
alphabetized following the rules of the foreign language. In a Swedish
encyclopaedia it will be possible to place ß under ss
according to German rules, because ß is not found in any Swedish word.
The special rules in the languages I have investigated may be grouped in four
categories:
1. Special letters with a specific place:
The bold printed special letters are sorted as follows in the countries concerned:
1) Danish and Norwegian: ... Z, z, Æ, æ, Ø, ø,
Å, å.
2) Swedish: ... z, Z, å, Å, ä, Ä, ö,
Ö.
3) Icelandic: ... X, x, Y, y, Þ, þ, Æ, æ,
Ö, ö.
4) Spanish: ... N, n, Ñ, ñ, O, o ...
2. Special letters alphabetized as ordinary letters:
I. Some letters are ligatures that are decomposed:
1) French: and (to be alphabetized as OE or Oe,
and oe).
2) German: ß (to be alphabetized as ss, but after the sequence
ss, even if it is originally a ligature of s + z).
II. These special letters are treated as D/d with a diacritical mark:
1) Faroese: Ð and ð.
2) Icelandic: Ð, ð and d with a cross (which cannot be
represented here).
3. Letter compositions with a specific place (digraphs):
1) Danish/Norwegian: aa (as å; in Danish before 1948 as aa, in
Norwegian before 1917 as aa).
2) Spanish: ch (between c and d); ll (between l and m).
(In Spanish lynchar presedes llaca, and czar precedes chabacanada).
Note: In 1994 it was decided that in future the special Spanish sequences
should be abandoned. I consider this fact to be without importance to the
principal notion of specific national rules, because it is a general problem not
only concerning Spain. In Welsh alphabetization "ch" and "ll" are also treated as
separate letters for alphabetizing among other double letters. Thus the Spanish
problem can be seen as just one example of a difficulty to be treated with.
4. Letters to be sorted in one group:
Letters with diacritical marks are to be sorted as the basis letter (the primary
letter) without a diacritical mark (with the exceptions mentioned above for
individual languages). But when two text strings are equal without regarding the
diacritical marks, the diacritical marks are decisive for the sequence. A letter
without diacritical mark will always precede the same letter with a diacritical
mark. Thus we have a sequence as e.g. Foret, Forêt, Forez.
But having a sequence like Foréz, Forèz we must realize a
requirement for a rule to sequence accented letters.
In French alphabetization the sequencing rule is:
- ´ (accent aigu, acute accent)
- ` (accent grave, grave accent)
- ^ (accent circonflexe, circumflex)
- ¨ (tréma, dieresis)
According to Danish standard the same sequence is used, when such letters occur in
proper names, i.e. e.g. é, è, ê, ë. But in Italian the
two first accents have to be interchanged; thus è precedes é in
Italian.
When sorting is done using the sequence of the native character set of a VAX
computer (ANSI with small deviations), the sequence of the diacritical marks is:
` ´ ^ ¨ . Here we also have the tilde
( ). In other languages than Spanish it would be apposite to
treat the tilde as a diacritical mark, but in Spanish ñ and Ñ are
treated as specific letters (as described under 1) special letters with a
specific place).
A colleague of mine thinks that the ANSI sequence being the same as the Italian
order (especially ` grave accent, and ´ acute accent) is
due to the greater penetration of the Italian lobbying in America.
Alternative sorting criteria
The following two examples relate to Danish and German, but they should be seen as
general problems. Analogous conditions may be found in any other language not
dealt with here.
Danish: Previously it was common to sort v and w in one
group, but with v preceding w, e.g.: ver, wer, vet. In the
dictionary Nudansk Ordbog this sorting principle was used until the middle
of the 1980ies. Now it is considered to be obsolete and it is hardly used any
more.
After the spelling reform in 1948 the letter å was introduced to
replace the letter combination (digraph) aa corresponding with one phonetic
sound. The letter å had been used to represent this sound in Swedish for
more than 200 years, and has been used in Norwegian since 1917. It was allowed
that aa could remain unchanged in proper names, and most people having
names with the digraph aa in it did not want to change the spelling of
their name. The choice was to write, e.g. Østergaard or
Østergård. According to standardized Danish alphabetizing rules
aa has to be sorted as å, the last letter of the Danish
alphabet. This rule should be applied also when aa is pronounced as a long
a, e.g. in the river name Maas, but this rule is considered foolish by most
people, and that is why it is seldom used. For example telephone books deviate
from this rule.
German: Nowadays the German letters ä, ö, ü are
normally treated as respectively a, o, u with the diacritical mark
¨ (umlaut). Alphabetizing these letters, there is another
possibility, mainly used in older dictionaries, ordering the letters according to
their origin: ae, oe, ue. The umlaut mark is the reminiscence of an old
German hand-written e being stylized to two strokes, and later two points.
Using the first principle we have a sort order as: ae, aeh, afh, äg, äi.
Using the second principle these syllables will be sorted: ae, äg, aeh,
äi, afh. The first principle is the most common one, and in Austria and in
Switzerland it is the only principle used. The second principle is for example
used in German state libraries and in German telephone books (e.g. Grützner
preceding Gruffke).
Three optional sorting criteria
1) With significant space character (»Nothing preceding
Something«)
Example: à condition, a conto, Ackja, Acre.
The space character is considered to be a character preceding all visible
characters. According to Danish standard (Dansk Standard Nr. 377), this criterion
is applied by Danish libraries, and it is used in Denmark as the main rule.
2) Strictly alphabetically (»character by character«)
Example: Ackja, à condition, a conto, Acre.
The space character is considered as being not present. This sequencing rule is
used, e.g. in the Danish ortography dictionary edited by Dansk Sprognævn,
and in the German dictionaries Duden Deutsches Universalwörterbuch and Duden
Rechtschreibung.
3) According to interdependence
Example: Affe, Affengeschwindigkeit, Affenhaus, Affekt.
These words are German words. The first one means 'monkey', and the next two words
are compound words initiated with Affe[n]-, and the last one has no connection to
the others.
Words with similar meanings, derivatives, and compounds are held together, but the
sequencing is a logical one depending on meanings, not on the alphabetical order
of the letters.
A modified version of the interdependence criterion is found in the German
dictionary Duden Das große Wörterbuch der deutschen
Sprache in 6 Bänden. Here the words are found in this sequence: Affe,
Affekt, Affenhaus, Affengeschwindigkeit. 'Affe' has been separated from the
'Affe[n]' compounds. Next the 'Affe[n]' compounds have been divided into two
groups, and Affenhaus is classed with the first group, and Affengeschwindigkeit
with the second one. Thus Affenhaus precedes Affengeschwindigkeit, even if they
had to be interchanged according to the alphabetical order of the letters.
Doing the sorting with a computer the first criterion can be managed very easily,
because two text strings are compared without any intervening changes, i.e. in
their original representation.
Using the second criterion all space characters will have to be removed from the
two text strings that are to be compared in order to determine their alphabetical
sequence. But a remembrance of the original character sequences of the original
text strings would have to be maintained, because two text strings might be equal
after removing spaces, but in this case the sequence should be determined
regarding the original character sequencing of the strings. An unambiguous
sequencing is a logical requirement that must be present if computers are doing
the sorting. An example is the two text strings: a ba and aba.
Having removed the space character from the first string, the two strings would
seem equal, but it would be inappropriate not being able to determine a sequence
depending on primary conditions, i.e. if a ba precedes aba, or
reversedly.
The third criterion, sorting according to interdependence, can not easily be
implemented. It is obvious that a non-alphabetical sequencing cannot depend on
alphabetical order. In some cases a special sort sequence is obtained through a
sort key being stored in a database.
Some English compound words may be written as one word, as two (or more) words, or
with a hyphen. That is problem because the same word would occur at different
places. For instance making a frequency list from texts by different authors it
will be a tedious work to gather differently spelled words into one group.
When the words to be sorted come from a text file with word divisions in it, it
may be difficult to decide if a hyphen should be maintained because it is a
component of a compound word, or if the hyphen is part of the compound. A human
being can make the decision, but it is much more difficult to make an algorithm
that will let computers make the decision. A problem arises in German because of
the rule that words with ck are hyphenated k-k. In case of word
division Lücke is written Lük-ke. But a personal name
Lükke would also be hyphenated Lük-ke. A computer would
not be able to decide the origin, Lücke or Lükke.
According to my research how alphabetizing is done in different languages the rule
is that numerals (0-9) precede all letters. Lower-case and upper-case letters are
sorted without respect of case (e.g. Deb, dek, Del). But two words being equal not
considering the case upper-case will precede lower-case in most languages (e.g.
Karl preceding karl). In Swedish karl will precede Karl.
Sort sequence of special characters
When computers are doing the sorting of text strings it is required that each
special character has a predefined position in order to determine the sequence.
This requirement is especially important when an entry word (lemma) consists of
nothing else than special characters, e.g. in a computer book the shift operator
<<.
In traditional lexicography there are three ways how to treat with special
characters:
As blind character: The punctuation marks (e.g. comma and question
mark) are not considered, i.e. the sorting is done as if these characters were not
there. Apostrophes (e.g. in: Peter's Bar) and ellipsis hyphens (e.g. in German:
Männer- und Frauenberufe) are also treated as blind characters.
Blind characters are . , ; : ? ! ( ) [ ] » « "
' -
A special case is the dash. According to typographical rules it is longer than a
hyphen ( en-dash, or em-dash), and it is encompassed by space
characters. This sequence (space dash space) is treated similar to one space.
As space character: Hyphens and slashes are treated like spaces.
Examples: the name Heinz-Dieter (to be alphabetized as: Heinz Dieter), and/or
(alphabetized as: and or).
However, no rule without exceptions. In some cases hyphens and slashes are treated
as blind characters: 1) When two parts are closely connected and could not
occur separated (e.g. U-Boot, U-boat), and 2) when a special character is
optinal (e.g. in the abbreviation B/W = BW = black/white).
In the dictionary of computer abbreviations that I have edited, edb-forkortelser, I made the decision to treat all special
characters as blind characters in order to make it easier to find abbreviations
and acronyms, when an entry may contain hyphen, slash, or space, or only letters.
For instance the dictionary has the entries A&T, A/T, and
AT being sorted primarily as AT, and secondarily in the sequence stated;
& preceding / according to the sequence in computer alphabets.
After the pronunciation: A special character being pronounced as a
word should be alphabetized as the written word. Thus an ampersand, &,
is changed to und in German, and in English, och in Swedish,
et in French, etc. This rule is seldom used as almost nobody knows about
it, and it conflicts with any effort of making international alphabetizing rules.
It would be very troublesome to implement this rule in computers, and I think it
should be neglected.
The Concept of Letters
The letters occurring in the alphabets of the countries I have dealt with, can be
grouped into three classes, one of which has four subgroups.
Plain Letters
Plain letters are letters occurring in all Latin alphabets, A-Z and a-z. In all
countries plain letters are alphabetized in the same sequence relatively to other
plain letters. Exceptions are the combinations aa in Danish and Norwegian,
and ch and ll in Spanish and Welsh.
Letters With Diacritical Marks
Letters with diacritical marks are alphabetized as the basic letters (plain
letters) without regarding the distinction if some other difference occurs in the
two text strings that are compared to determine their sequence. The letter
é is a letter with diacritical mark, and is treated as e as
the primary rule. The German letter ö is considered a letter with
diacritical mark if ok precedes ö, but it is not if
ö is disintegrated to oe rendering the sequence öm,
ok.
Special characters
1) Special letters without resemblance with plain letters: In this group we
have the Icelandic letter þ with its special place in the Icelandic
alphabet.
2) Special letters with resemblance with plain letters: These letters look
like letters with diacritical marks, but are special because they have their
special place in the alphabet according to national rules. Examples are the Nordic
letter å and the Swedish ö (opposite to German
ö).
3) Dissolvable ligatures: These are letters that may be written as a
ligature or as two separate letters, and being alphabetized as separate letters.
Example: French = oe.
4) Ligatures considered as special letters: Letters written together,
having a special name and their special place in the alphabet and considered as
special letters in the countries concerned. From an orthoepic point of view these
ligatures have to be considered as digraphs as they represent one sound. Example:
in Danish and Norwegian æ (after z in the alphabet).
Diacritical marks are used to indicate a distinction between letters, mostly
concerning pronunciation (e.g. c/ç, e/é). Some
diacritical marks are also called accents (e.g. ´ over
é). Most letters with a diacritical sign are sorted as the basis
letter as the primary criterion, but others are treated differently having an
exceptional position. In the Scandinavian countries the ring º above
å is not seen as a diacritical mark, because å in all respects
is another letter than a.
A ligature is a typographical type where two or more letters are joined. In some
cases this is without importance to alphabetizing (e.g. ff joined, or fl joined).
In other cases a ligature is considered a new independant letter that can only be
represented as one letter. In the Scandinavian languages æ cannot be
replaced by ae.
Establishing a Concept Model
I should like to define some terms of alphabetizing criteria to be used when
dealing with alphabetizing, not only lexicographically but especially in
connection with computer alphabetizing.
Primary Criterion
The primary criterion is to decide if two letters belong to the same group of a
basis letter. The letters E, É, e, é, ê etc. are
primarily equal because the basis letter in all cases is E/e. But
for instance g and k are primarily unequal because the basis
letters are different.
In the two text strings forêt and forez there is a primary
difference occurring at the 5th letter of each string, and thus alphabetizing is
done according to the primary criterion (the basis letters) without regarding the
diacritical mark .
The two text strings foret and forêt are primarily equal. The
primary criterion will not suffice to determine the sequence of these words.
Secondary Criterion
Two letters being primarily equal (same basis letter) are secondarily unequal if
there is a difference regarding diacritical marks. For instance the letters e,
é, è, ê and ë are primarily equal, but
secondarily unequal.
Using the secondary criterion e precedes ê, foret will precede
forêt. The two text strings are primarily equal, but secondarily
unequal. Considering the two text strings forêt and forez, we
find a primary difference (t different from z), and thus the
secondary criterion is without importance to the alphabetizing of these words.
Tertiary Criterion
In most cases the two first criteria will lead to a satisfactory result. But
sorting algorithms must be improved, if possible. Using lexicographical
requirements lowercase and uppercase letters have to be sorted together, i.e. A a
B b C c etc., uppercase letters having precedence in most languages. Using
only two criterion levels, and the difference between lowercase and uppercase
letters being a secondary criterion, we would have sequences like: Foret
Forêt foret forêt Gómêt Gòmét
gómêt gòmét.
Thus Foret would be separated from foret, Forêt from
forêt, and likewise the <gomet> sequence would be sorted
in a way letting the case of a letter have greater importance than diacritical
marks in a text string.
The tertiary criterion shall tend to keep text strings together when they are
equal according to the first criterion (equal basis letters) and the second
criterion (equal basis letters with the same diacritical marks), and the only
difference is due to upper and lowercase letters.
The second criterion is only needed when the first criterion is not sufficient,
and likewise the third criterion is only used when the sort sequence is not
determined using the first two criteria.
I agree with the opinion of Dansk Sprognævn (the Danish language advisory
committee) that the example words should be alphabetized:
Foret foret Forêt forêt Gómêt gómêt
Gòmét gòmét.
If we want to implement this sorting in a computer program we need three
alphabetizing criteria.
Look-ahead
Keeping the rules given in the three criteria above depends on a look-ahead
ability. When a secondary difference occurs (e.g. e and ê) it
can still not be determined which text string comes first, because the second
criterion is only applied when the two text strings being compared are equal
according to the first criterion. When a sort algorithm will alphabetize in the
sequence foret, forez, forêt we must conclude that it has no look-
ahead ability, because a secondary difference was decisive when a primary
difference was present. The correct sequence is: foret, forêt,
forez.
Alphabetizing Problems
I) Common problems sorting manually and with computers:
A language of precedence must be determined because some letters are sorted
differently according to national rules. In particular this concerns special
letters and letter compositions with a specific place (digraphs); but also e.g.
ö which in German is sorted under o, but in Swedish and
Icelandic it is the last letter of the alphabet, and when it occurs in Danish in a
proper name it is sorted under ø (in case of equal strings
ø has precedence, e.g. Øster preceding Öster).
How to handle foreign characters has to be determined. Characters which
occur in only one foreign language could be sorted according to the rules of this
country, e.g. ß and , but again it is a question where the users of a
dictionary would look for them. In a German encyclopaedia the physical unit
Å (ångström, having its name after the Swedish physicist Anders
Ångström, 1814-74) is found under A and not according to Swedish
rules, even if Å in the countries where used is not considered a
letter with diacritical mark but a special letter.
One of the optional sort criteria has to be chosen. The sorting can be
strictly alphabetical (most frequently used), or with significant space (most
simple to implement in computer programs), or according to interdependence (which
may make it difficult to find entries, as this principle appears strange to most
people).
A uniform sorting principle has to be applied, when more rules compete. In
one language optionally more alphabetizing rules may be equally applied, or more
rules are applied in practice. But more sets of rules must not be intermixed. In
every single case one principle must be chosen, and letting a computer do the
sorting there is no other way.
In German we have two competing principles concerning Ä/ä,
Ö/ö, Ü/ü, either being basic letters with diacritical marks,
or as ligatures to be considered as basic letter + e.
In Danish telephone books aa is alphabetized after two different criteria.
When pronounced as å and being a variant of the spelling
å it is alphabetized under å which is the last letter of
the alphabet, but when pronounced [a] or [a:] it is alphabetized as aa,
even if it is not according to the official standard.
II) Special problems when sorting with computers:
Sorting according to pronunciation: For example in the Austrian telephone
book of Vienna (Wien) 1970, the letter composition ue is alphabetized
according to two different principles. One principle is applied when ue is
pronounced like ü, [y], and another when it is pronounced as a long
u, [u:]. Similarly, in Danish telephone books aa is sorted as
aa when pronounced as long a, [a:], and as å when
pronounced as å. It is rather complicated to implement computer
sorting according to pronunciation. At least a database with alphabetizing
information is needed. A database record may have a display field (how the entry
is spelled), and a sort key to determine the alphabetizing sequence (this might be
an alternative spelling that would lead to correct alphabetization).
Digraphs: Some digraphs may be troublesome because of a special ordering
sequence, but others are uncomplicated as e.g. ph pronounced as [f], as it
is sorted as p-h manually (and lexicographically) and with computers. The
troublesome digraphs are the ones being treated in some special way according to
national rules.
Computer alphabetizing normally will rely on a table rendering the primary sort
order keyed to a native character table of the computer, but this method is not
suited to handle the digraph problem.
The first example originates from alphabetizing with the Danish release of
WordPerfect version 5.1. When alphabetizing the digraph aa (to be sorted as
å), and å, the result can be the sequence: aal, ål,
aal, telling us that the algorithm used lets aa be equal to
å without precedence. According to Danish orthography å
precedes aa. It is possible to let computers distinguish between aa
and å, but implementation algorithms will be somewhat
complicated.
There are also difficulties with the Spanish (and Welsh) digraphs. The digraph
ch is sorted as a special letter between c and d, and
ll is between l and m. Often computer alphabetizing will not
regard this aspect. Another problem is that ch and ll are not
alphabetized as digraphs in foreign words, but are treated as c-h and l-
l. No simple computer algorithm will be capable of determining if these
digraphs occur in a native or in a foreign word.
Demonstration of Alphabetizing Criteria
Having analyzed alphabetizing principles in computer programs, I have found that
sorting is done differently by diffent software, and that sort sequences in many
cases vary from manual sorting based on lexicographic principles.
In order to determine how many criterion levels a computer program uses when
alphabetizing, we will use an input sequence suitable of giving an output sequence
that will tell which criteria and criterion levels are used. It is without
importance to the principle of alphabetizing whether the implied text strings have
any lexical meaning in any language.
Number of Criterion Levels
The number of criterion levels can be determined using this test input:
forêt Foret Forêt forez foret
Three Criterion Levels:
Text strings being equal regarding diacritical marks are sorted together without
considering the difference between uppercase and lowercase letters. There is a
priority range of letters being variants of the same basic letter. The basic
letter without diacritical mark precedes the same letter accented, when no more
important differences are found examining the sequencing letters (look-ahead).
Normally an uppercase letter will precede a lowercase letter (when being
characters of the same basic letter), and thus the three criterion level principle
to the sequence:
III-1) Foret foret Forêt foreêt forez
When lowercase letters precede uppercase letters (e.g. Swedish sequencing) the
sort sequence is:
III-2) foret Foret forêt Forêt forez
We still have three criterion levels. The determining fact is, that programs with
one or two criterion levels are not able to render any of these sequences.
Two Criterion Levels:
With two criterion levels the difference between uppercase and lowercase letters
does not differ from the nature of any other distinction concerning a basic
letter. The first occurring difference is decisive to the alphabetizing order,
except when the two characters being compared are predefined to be sorted together
(i.e. primary equal and secondary unequal), and the sort sequence is determined
through some subsequent primary difference (i.e. look-ahead). With two criterion
levels, there are two possible sequences. In the first sequence an uppercase
letter precedes a lowercase letter; and in the second one lowercase precedes
uppercase:
II-1) Foret Forêt foret forêt forez
II-2) foret forêt forez Foret Forêt
The first sequence satisfies most national rules, but in some countries (e.g.
Sweden) lowercase has precedence.
Of decisive importance is that the two sequences above connot be obtained when
sorting with programs having only one criterion level.
At the time of my research (1994) I found two criterion levels using e.g. Paradox,
WordPerfect v. 5.1, and MS-Word.
One Criterion Level
A program only has one criterion level, if it cannot produce any of the sort
sequences given above to decide three or two levels. We can describe it as a lack
of look-ahead, because e.g. e and ê will always be sorted in
some fixed sequence without regarding subsequent characters. An example of one
criterion level sorting is this simple ANSI sequence:
I Foret Forêt foret forez forêt
The main criterion when having only one criterion level, is that the first
difference is decisive; in this case F/f and e/ê (no look-ahead). There also
is only one criterion level in the sort sequence: foret forez
forêt.
It is possible to divide one criterion level sorting into three subgroups:
- simple sequence according to the native character table of the
computer (e.g. ANSI table)
- improved one criterion sorting, grouping together some characters
without priority
- expanded improved one criterion sorting, grouping together some
characters with priority.
It is more difficult to determine the one criterion subgroup levels, as
determining the number of primary criterion levels. The text strings used above
will not suffice.
When it is determined that a program has only one criterion level, and some sort
sequence varies from the sequence of the native character table (e.g. ANSI),
possibility 1) can be excluded.
When a test input like ret ret rét can lead to the output ret
rét ret we realize that e and é are treated as one
letter. Repeated sorting with the same input can give any sorting of the three
text strings. Because there is no look-ahead we have only one criterion level, and
because there is no priority between e and é we can tell that
the sorting is done according to the one level criterion subgroup 2.
A simple ANSI or ASCII sequence (I-1) is obtained using only the compare
function in C, strcmp; the version of Oracle that I investigated had this
sorting. The expanded improved one criterion sorting I found using Ingres
and Danbase (1994).
Partial Deviation From the Criterion Levels
Ordinarily a sorting program has only one criterion level when you encounter some
undefined sequencing of characters sorted together, but some programs having two
criterion levels may have only one level in connection with certain characters or
character groups. Be sure to have a suitable test input not to make to hasty
conclusions. WordPerfect (v. 5.1) in most cases uses two criterion levels, but in
the Danish edition å ist grouped with a, however with no
predefined order, so a sorting procedure may lead to the sequence: aal ål
aal (one criterion level). Sorting the sequence ret rét ret will
alway give the same output: ret ret rét (two criterion levels). Many
programs with two criterion levels will use only one criterion level when sorting
ß, and (expanded improved one criterion level); this error has
remained in WordPerfect version 6.1.
Strictly Alphabetically or With Significant Space
Character?
It is seldom that sorting is done according to the interdependence principle which
is difficult to implement in computer programs, but when used, the other two
optional sorting criteria are not (always) observed. Examples are found in
connection with compound words, e.g. German Finger and
fingerförmig preceding Fingerei in: DUDEN Das große
Wörterbuch der deutschen Sprache in 6 Bänden.
If the space character was regarded when alphabetizing, can be decided using the
test input aba and a ca with two possibilities of output:
- Strictly alphabetically: aba, a ca
- Significant space character: a ca, aba
All the programs that I have examined sort with significant space character.
Grouping Special Characters
In the ANSI and ASCII character sets, special characters are dispersed throughout
the sequence; thus you will find special characters preceding the numbers (e.g.
&, *), between numbers and uppercase letters (e.g. <, ?), between uppercase
and lowercase letters (e.g. [, ^), and succeeding lowercase letters (e.g. {,
). Some programs will keep this sequencing (entirely or partially), and
others will let the special characters preceed or succeed alphanumeric characters.
For your first investigation you may use the test input: & * < ? [
^ { 3 f F. The special characters may be grouped, or dispersed,
but in order to know if the sequence of the native character set of the computer
is kept unaltered, you would have to sort all available characters.
Sequence of Accents
In most programs the accents have a predefined sequencing independent of the basic
letter. To determine this sequence, use the test input: a á à
â ã ä. But there may be differences depending on the basic
letter. To obtain a complete overview, it is required to sort all characters of
the character set; thus being able to determine the position of other letters with
diacritical marks.
Specific National Rules
We might investigate specific rules of any number of languages. The examples given
cannot be exhaustive, but will be an indication of problems to be dealt with.
Danish, Norwegian, and Swedish:
When aa and å are grouped together according to Danish and
Norwegian rules, a sort sequence will be, e.g.: z ål aal. In cases
where aa is no digraph, but occurs in each part of a compound word,
aa should not be sorted as å, but as two occurrences of
a. An example is the Danish word ekstraarbejde (German: Extraarbeit,
English: extra work). It is difficult to let a computer decide if aa is a
digraph or not. It can only be done if you have a database with information about
all words, but there are many possibilities of making compound words.
The Icelandic letter þ is sorted differently from the Icelandic
sequence, i.e. as th.
According to Danish and Norwegian rules the umlaut letters are grouped with
national letters having the same pronunciation, thus: Y Ü y ü Z z
Æ Ä æ ä Ø Ö ø ö Å
å.
The Swedish alphabetizing rules are according to the ANSI sequence: å
ä ö.
German:
The letter ß is alphabetized as ss, but when two text strings
are equal, not considering the difference ss/ß, ss will precede
ß. Thus the following sequence of words is sorted correctly according
to lexicographic rules:
haspeln Haß hassen hassenswert haßerfüllt
Many programs do recognize ß to be grouped with ss, but when
all preceding characters are equal, all strings with ss will preceed the strings
with ß, which is wrong (e.g. asst, asstm, aßt, which should be: asst,
aßt, asstm).
There are two approved ways to sort the umlaut letters:
1) The most used principle is to sort the umlaut letters as letters with a
diacritical mark, rendering this example sequence:
ae aeh äg äi vor vör vore vöre
2) In older dictionaries, in telephone books, and in libraries,
the ordering bases on the dissolution of the umlaut letter as the basic letter +
e, e.g. ü = ue, which renders this sequence:
ae äg aeh äi vör vöre vor vore
Faroese and Icelandic:
The letters Ð, ð (»eth«), and d with a cross, are treated as
D/d with a diacritical mark, rendering the sequence: D Ð d [d with cross]
ð E e.
The Icelandic Þ/þ (»thorn«) comes after y, but before other
national special letters (there is no z in Icelandic words). In pure Icelandic
connection the sequence is y þ æ, and in connection with foreign
words and proper names the sequence is: Y y Z z Þ þ Æ æ
Ö Ø ö ø Å å.
French:
According to the rules of alphabetizing the French letters and we
have these sort sequences:
boea boef bf boeg boez bof
and
BOEA BOEF BF BOEF BOEZ BOF
Depending on priority criteria implemented in the sorting program, these two
sequences may be grouped with either lowercase or uppercase letters having
priority, but this problem is not a special French one. The result is:
BOEA boea BOEF boef BF bf BOEG boeg BOEZ boez BOF bof
or:
boea BOEA boef BOEF bf BF boeg BOEG boez BOEZ bof BOF
The first row, letting uppercase letters have precedence, is according to the
norm.
Dutch:
The Dutch ligatures of IJ/ij make no exception, but are treated as non-ligatures
IJ/ij.
Spanish:
In Spanish word the digraph ch is considered a letter between c and
d, and the digraph ll is a letter between l and m, and
in Spanish Ñ/ñ is not a letter with diacritical mark, but a
special letter between N/n and O/o. Accordingly we have a sequence
as e.g.:
czar chabacanada lynchar llaca nz ña o
Basic Concepts of Computer Alphabetizing
The operating system of a computer has a native character set which is the basis
of most sorting procedures. Some computer programs have additional software
character sets that can only be used in the programs where implemented (e.g.
WordPerfect). Fundamentally the succession of characters to be sorted has to be
defined. The succession is laid down in a table, and each character is assigned an
integer value to define the sort sequence. The simplest sort procedures will use
the integer values of the native character set, or a redefinition table is
created. Different computers have different native character sets, but common to
all of them the succession of characters has to be redefined in order to achieve
correct lexicographic alphabetizing. To the basic considerations in this paper it
is without importance which character set is used. For my examples I have used the
ANSI character set that is used in PCs with Windows, in Prime computers, in VAX
computers, on the Internet etc. (minor differences may occur). The first 128
characters (#0-127) are the same as in the 7-bit ASCII character set and in the PC
DOS character set.
ANSI Character Set
000 032 [SP] 064 @ 096 ` 128 160 192 À 224 à
001 033 ! 065 A 097 a 129 161 ¡ 193 Á 225 á
002 034 " 066 B 098 b 130 162 ¢ 194 Â 226 â
003 035 # 067 C 099 c 131 163 £ 195 Ã 227 ã
004 036 $ 068 D 100 d 132 164 ¤ 196 Ä 228 ä
005 037 % 069 E 101 e 133
165 ¥ 197 Å 229 å
006 038 & 070 F 102 f 134 166 ¦ 198 Æ 230 æ
007 039 ' 071 G 103 g 135 167 § 199 Ç 231 ç
008 040 ( 072 H 104 h 136 168 ¨ 200 È 232 è
009 041 ) 073 I 105 i 137 169 © 201 É 233 é
010 042 * 074 J 106 j 138 170 ª 202 Ê 234 ê
011 043 + 075 K 107 k 139 171 « 203 Ë 235 ë
012 044 , 076 L 108 l 140 172 ¬ 204 Ì 236 ì
013 045 - 077 M 109 m 141 173 205 Í 237 í
014 046 . 078 N 110 n 142 174 ® 206 Î 238 î
015 047 / 079 O 111 o 143 175 ¯ 207 Ï 239 ï
016 048 0 080 P 112 p 144 176 ° 208 Ð 240 ð
017 049 1 081 Q 113 q 145 177 ± 209 Ñ 241 ñ
018 050 2 082 R 114 r 146 178 ² 210 Ò 242 ò
019 051 3 083 S 115 s 147 179 ³ 211 Ó 243 ó
020 052 4 084 T 116 t 148 180 ´ 212 Ô 244 ô
021 053 5 085 U 117 u 149 181 µ 213 Õ 245 õ
022 054 6 086 V 118 v 150 182 ¶ 214 Ö 246 ö
023 055 7 087 W 119 w 151 183 · 215 ×/ 247 ÷/
024 056 8 088 X 120 x 152 184 ¸ 216 Ø 248 ø
025 057 9 089 Y 121 y 153 185 ¹ 217 Ù 249 ù
026 058 : 090 Z 122 z 154 186 º 218 Ú 250 ú
027 059 ; 091 [ 123 { 155 187 » 219 Û 251 û
028 060 < 092 \ 124 | 156 188 ¼ 220 Ü 252 ü
029 061 = 093 ] 125 } 157 189 ½ 221 Ý 253 ý
030 062 > 094 ^ 126 ~ 158 190 ¾ 222 Þ 254 þ
031 063 ? 095 _ 127 159 191 ¿ 223 ß 255 ÿ
The first 32 characters (0-31) are control characters that are defined in the
ASCII table for telecommunications (e.g. teletyping). In the PC DOS character set
some of these values are used for special printable characters, and the values
128-255 represent other characters than those of the ANSI character set. In the
VAX VMS character set the values 128-159 are occupied by special control
characters, and some characters with bigger integer values are missing.
Þ and þ are only available in some versions of the VMS operating
system.
In VMS has the integer value 215, and has the value 247, thus
correctly grouped with O/o, but most printers will print × and ÷
according to the ANSI norm.
A character set is alphabetized according to the character table used, with one
criterion, or with more criteria as a primary criterion level. Using the ANSI
table (and in this case also the ASCII table) F has the value 70, and
b has the value 98. Thus F will precede b because 70 is less
than 98, when sorting using this table.
Principles of Some Sort Procedures
.
....
... to be continued!
This article is currently being translated from German.
In the meantime you can read the German version:
Alphabetisierung auf Computer, or
download the German version
alphabet.zip, a compressed file that can be unzipped to
a WordPerfect document.
My primary languages are Danish and German. Please, feel free to correct my
English, and let me know about any wording or terminology that you do not
understand, in order to improve the language of this article.
Other comments are also welcome.
Hans Christophersen