University of Virginia Library

Search this document 


  

expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
A Method for Compiling a Concordance for a Middle English Text by Sidney Berger
  
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
collapse section 
 1.0. 
collapse section2.0. 
expand section2.1. 
collapse section2.2. 
 2.2a. 
 2.2b. 
  

expand section 

219

Page 219

A Method for Compiling a Concordance for a Middle English Text
by
Sidney Berger

Alan Markman's pioneering essay on the creation of "A Computer Concordance to a Middle English Text"[1] was one of the earliest attempts to explain how to make such a concordance. However, he concentrated primarily on the problems he and Barnett Kottler faced in the production of their concordance to the works of the Pearl Poet, rather than on a general method of compiling a concordance to a Middle English text. I am now nearly finished compiling a concordance to LaƷamon's Brut, done with an ingenious programmer, George Rompot, and the University of Iowa's IBM 360-65 computer. The results are promising, I think, and some of the methods deserve to be set forth as a practical and practicable means of making a concordance. I have formed some of my ideas with this in mind: that the person compiling a concordance should keep in mind the many uses to which his labors will be put. If this means "extra" work in categorizing and cross-referencing words, separating homographs, or gathering variant forms of one word, this extra work is not only worth-while but necessary to make a useful concordance.

The first step in preparing a concordance is selecting a copy text. Most Middle English texts worth making a concordance for exist in more than one edition, and if there is no "standard" edition of the work, one should select a single text and stick to it, noting in footnotes or a separate table the variants in the other major editions, if they are worth noting. The quality of the edition and its ancillary material should help to determine which one will be used. I emphasize this because it is necessary for the person making the concordance to adopt a "neutral" stance toward his material, and he can do this most easily if he does not have to worry about editorial matters in the text — matters which the editor of the text should be relied upon to decide. By "neutral" I mean that he should rely on his edition for decisions such as the reading of a carelessly-spelled word in the manuscript or the meaning of a word on which other editors disagree. For example, if an editor whose edition is being used glosses a word which is spelled "gode" as "good," the concordancer[2] should list this word under


220

Page 220
this meaning, despite another editor's decision to interpret this word as "god." Warnings that this is the concordancer's policy must appear in the concordance; for if one tries to adopt one reading from one editor, one from another, and yet a third from his own reading of the text, he is no longer a mere concordance-compiler, but an editor, and his concordance will be of little use to anyone trying to find a word. To be useful, a concordance must enable its user to locate quickly the word he wants to find; if the text used for the concordance is edition A, but the concordance lists words under interpretations given in edition B or C, the user will not be able to find the word he seeks.[3]

Markman says that the next major problem faced by the concordancer is to put the text into what he calls a "machine readable" format "so that the computer system can digest it and perform operations with it" (p. 55). By now many people in the humanities have worked enough with computers to consider this a fairly trivial problem; but one important thing must be kept in mind before this phase of the computer project is begun: what will the final output look like, and therefore, what sort of information is it necessary to feed into the computer to arrive at this design? For example, before I actually began the work on my concordance, Professor Stephen Parrish of Cornell University Press suggested that to make my final output more readable and attractive than some of the earlier computer-assisted concordances, I should try to design the final output to be printed in upper and lower case letters, rather than the usual upper case computer printing. So when I put the text onto IBM cards, I coded in a special character which preceded all upper case letters. Eventually the special symbol was deleted, the letter following it was printed in upper case, and the line was back-spaced to fill up the space formerly taken by the special character.

Concordancers who will be separating their texts into grammatical categories for, say, nouns, verbs, and adjectives, may want to indicate in the initial input of data which words fall into which categories. If a separation of homographs and a gathering of variant forms can be made while or before the material is fed into the computer, perhaps during the feeding in of the text some codes can be inserted to separate these forms automatically later.[4] The frequently-used modern term for this activity is "pre-editing,"


221

Page 221
much of which is necessary to help one anticipate what sort of things he must tell the computer when his text is being put onto cards, tape, or disc.

It was my experience that working with IBM cards was the easiest and least expensive way of handling the material. I found it more convenient to have the cards in order to proofread and make corrections in my own study rather than at a computer terminal. Therefore, I put the entire text into card form, had the cards listed (i.e. printed out by the computer exactly as they had been punched on the cards), and did my proofreading of the text from these lists. If time and finances permit, the proofreading should be done with someone reading the lists aloud, while another person checks this oral reading against the edition used for the basis of the cards. Because of the great variations in spelling which most Middle English texts present, it is advisable to proofread by spelling out most words rather than by trying to pronounce them. It may slow the work considerably, but it will result in a vastly more accurate text, and is thus worth the effort.

While the text is being proofread, some of the corrections can be made and inserted into the deck of cards. It is of course practical to overlap these two tasks if time allows. And again, by dealing directly with the cards, the concordancer remains closer to his text than if he were working with the much more costly computer terminal. Ultimately using cards is more accurate.

As I said earlier, I used a method of "pre-editing" which I found to be most helpful and efficient. While the text is being punched, the key-punch operator is required to do little thinking, and also has little room on the machine for extra papers. But during the proofreading stage, the mind can "think" or "interpret," when much of the sorting of lexical items which will eventually need to be done can be accomplished. The proofreading goes slowly enough to allow the concordancer and his assistant (assuming his assistant knows something about the language being read) to think; therefore during the reading homographs can be separated, variant spellings gathered,[5] typographical errors and key-punch errors located, and so on. These simple tasks also help to break the monotony of the proofreading.

For example, I made charts separating 'good' from 'god,' 'idon' ('did') from 'idon' ('excellent, noble'), 'ræd' ('advice, counsel, situation') from 'ræd' ('to advise, tell') and from 'ræd' ('ready' or the past tense of 'ride' or


222

Page 222
'read'), and so on. It is important to anticipate all homographs in the text, so that later there will be little trouble separating them.[6]

Also during this proofreading phase, the concordancer can list for his own benefit any irregularities he might encounter which deserve to be glossed in special places in the concordance, which ought to be given special cross-references, or which deserve mention in an introduction.[7] Following the editorial convention of his day, Madden had the printers set his edition of the Brut with all the abbreviations and peculiarities of the manuscript, so I was able to locate all instances of such things as dotted y's [y], unusual abbreviations, interchanging of u's, v's, f's and w's for one another, etc., during the proofreading. My charts were eventually helpful in the commentary, and in the compiling of the final concordance.

The First Program

Concurrently with the proofreading I ran the deck through a special program designed by my programmer, Mr. George Rompot. This clever program checked the data for "illegal" characters, other "illegal" combinations of key-strikes, and sequence. It was easy to tell the computer what keys were "legal" in our punching, and to direct the machine to locate all those key-strikes which were of non-allowable characters. Though this program cannot find other spelling errors, it is a back-up for the oral proofreading. Further, we could tell the computer which combinations of key-strikes were not legal (e.g. consecutive blank spaces, blank spaces on either side of a hyphen, blank spaces at the beginning of a line, consecutive q's, capital letters within a word [except in hyphenated compounds], and so on), and the machine would list for us all occurrences of these. Finally, since most texts will have some consecutive numbering of words, lines, or pages, and since this 'consecutive numbering' information will most likely be fed into the computer along with the initial input of raw data, it is a simple thing to have the machine check for correct 'sequence' (i.e. whether the line numbering is correct). Since the deck I worked with had lines numbered consecutively from 1 to over 32,240, it was easy to locate errors in line numbering in the deck of cards. I think I can safely say that there will be no errors in this area in the final concordance. And the time saved by not having to proofread all those numbers was great.


223

Page 223

The Second Program

Once the proofreading is completed and the corrected cards inserted into the deck, an alphabetical word-list may then be generated. Whereas the first program gave in output only a list of errors which it found, this program could give an alphabetical list of every combination of key-strikes in the work (the machine does not think in terms of 'words'; it recognizes combinations of key-strikes bounded by blank spaces), along with a frequency count of each item. The frequency-list is what Larry D. Benson calls "the normal by-product of a machine-produced concordance,"[8] and is so easy to produce that one should appear in every computer-assisted concordance. A quick proofreading of this list may turn up errors not caught in the oral proofreading.

The Third Program

The next step one should take is to run the material — now from a computer tape since the material must be on a tape to generate the frequency-list — through the machine to generate the preliminary concordance. This program should give an output of a complete alphabetized list of words in the work with only the line numbers in which the words appear. It may be possible to get a line concordance at this stage (for shorter texts — say under 5,000 lines of verse), but that is not needed. It is from such a list that the concordancer works to separate further homographs which he no doubt will find during the proofreading, to help him refer to the text to verify peculiar spellings, to help in the gathering of variant spellings which will eventually be listed under a single headword, and so on. I found several errors which slipped by me in the other proofreading phases of the project when I had this preliminary concordance.[9]

The Fourth Program

Since the material is now on tape, and since one may very well find errors in the text, there is a simple but necessary program that will allow the concordancer to make up cards with corrections of the errors, and to insert these cards, automatically substituting the corrected readings for the incorrect ones. The output for this program will be a list of the original readings, and the corrected ones which have been substituted. The lines with errors can easily be located since each line has a unique number in the data.

At this point one is at a crossroads: the data, including headwords and


224

Page 224
cross-references, could all be put onto computer tapes and the computer could be directed from a terminal how to group and categorize the words, and which to separate from which; or, human hands could manipulate a deck into the final concordance, which could then be printed out from the cards. The former plan would require hundreds of hours at an IBM terminal, feeding information into the computer. The costs of this method are very high (between $2.50 and $3.00 per hour plus the salary of the person sitting at the terminal). Furthermore, it would still be necessary for the person sitting at the terminal to type out all headwords and cross-references, a task which would complicate matters greatly and increase the amount of time considerably. The method I used, however, was inexpensive, and was, I believe, more accurate than the 'terminal method' would have been.

Using the preliminary concordance and the glossary in the text I was dealing with, I punched all headwords and cross-references on IBM cards. In this form they can easily be listed and proofread, and corrections can be made with simple substitutions of cards. One must anticipate what sort of information will be needed for these cards, and how many key-strikes he may allow himself. For example, I designed the page to be six inches in width, so the headword and cross-reference cards could not exceed sixty (or perhaps 62) key-strikes.

As Benson complains (p. 273), the Beowulf concordance does not always separate homographs, and the cross-referencing is somewhat inadequate. I suggest that every form of every word (or at least every potentially unrecognizable form) be cross-referenced. It may take a good deal of trouble (as I so painfully found out), but the ultimate benefit derived is worth-while. Users of the concordance will find it a more versatile and valuable tool if this is done.

And I think Markman and Kottler have set an excellent precedent in their method of handling headwords (a method I have slightly modified): 1) All Middle English words which have close Modern English equivalents in spelling and meaning should be glossed under the Modern English form (perhaps with a few of the basic and unusual Middle English spellings given in parentheses, and certainly with most of the various forms the word takes cross-referenced, directing the reader to the proper headword). Thus, the headword might be as follows:

ABOUT (ABEOT, ABUTE(N), ABOUTEN, IBUTEN, YBOUTEN)
Then, at the various spellings there should be entries as follows:
IBUTEN (see ABOUT)
2) All Middle English words with Modern English equivalents in spelling only, should be glossed under their main or most common Middle English form with definitions given in parentheses, as follows:

225

Page 225
LAUERD (LORD, MASTER, HUSBAND)
Listing all forms of this word under 'LORD' is misleading, for the word could also mean "master" or "husband"; the addition of a few basic meanings of the word is extremely helpful for the user of the concordance. 3) All Middle English words with no Modern English equivalents should be glossed under their main or most common Middle English form with definitions in parentheses, as follows:
HENDE (FAIR, COURTEOUS, SKILLED, GOOD)[10]
The cross-reference and headword cards will be simple to punch in one large deck, using the preliminary concordance and the glossary of the text as guidelines.

The Fifth Program

Although it is little trouble to alphabetize these headword and cross-reference cards by hand, it is simpler to let the machine do it. The computer should be used for every mechanical process it is capable of performing. And the computer can errorlessly alphabetize and punch out a new deck of headword and cross-reference cards, which could then be inserted manually into another deck.

The Sixth Program

The next to last program needed by the concordancer is one which has the computer punch out a deck of concordance cards, on which are 1) the word being glossed, 2) the complete line of text which will be concorded (either a single line of verse or the specific context 'surrounding' the word for a KWIC concordance), and 3) the line number or other designation which will identify the location of this line. The first item of these three will eventually be deleted from the final print-out of the concordance, but will be most helpful in the sorting of the cards. This deck of cards — large though it may be (I had nearly fifty boxes of cards) — is much less costly to work with than a computer terminal, and keeps the concordancer closer to his text, because he can see each line individually while organizing the cards, and can easily insert the headword and cross-reference cards and rearrange others to their proper places in the deck. That is, one can carefully consider any form of a word which might not belong in the alphabetical sequence in which it automatically fell. Making the separations and gatherings of forms by hand during this phase of the project is no more difficult than doing it at a terminal.


226

Page 226

One thing must be added here. The program which I used to produce the concordance cards was modified by Mr. Rompot in such a way that I could delete from the deck before it got punched any word or words which were to be deleted from the final concordance.[11] Thus, before submitting the program to punch out concordance cards for all the words beginning with the letter a, I inserted into the program a small, alphabetized deck of cards with the words 'a,' 'an,' 'and,' and so on, which were never punched by the computer. The few occurrences of, say, 'æ' which were words that I wanted to include in the concordance (e.g. where it meant 'river' or 'ever') I was able to locate and punch myself. And it would not be necessary to search through the scores (or hundreds) of occurrences of such words if one makes note of such individual occurrences during the proofreading stage. It is just such a thing which must be anticipated before beginning the proofreading.

During this hand-sorting stage it is advisable to have an extra box for IBM cards nearby in which to place cards the computer had fed out in alphabetical order, but which belong in the deck under a headword which falls later in the alphabet. For example, one will have many past participles beginning with the letter i which should be glossed under their root words; ITAKEN should be given under TAKE, and at ITAKEN there should perhaps be a cross-reference to TAKE. Until one gets to the sorting of the t's, he should store the ITAKEN's apart. This is the method I used, and I found it to be most effective.

The Seventh Program

Once the two decks (of cross-reference and headword cards and of concordance cards) are integrated into one, with the cards all in the desired order, one has the entire concordance in deck form. It is simple, then, to run the deck through the computer to get the final print-out. But it may also be necessary to remove from the IBM cards certain symbols which had


227

Page 227
been substituted for other characters.[12] This can be done by means of a fairly simple program. For example, as I mentioned above, I coded into the deck during the key-punching of the text a symbol which preceded capital letters. The special program deleted the symbol, back-spaced the line into the space formerly used by the symbol, and printed the entire line in lower case letters except those letters formerly preceded by that symbol. Any symbol will do, of course, just so long as it is not one of the letters used in the text. (I used an equals sign.) This special program also deletes from the concordance cards the word being glossed, which is no longer needed since both the headword and the line of the text will 'speak for themselves' concerning what word is under consideration at this point in the glossary.

Further, since it is easiest to type the headword and cross-reference cards in all upper case letters (on the key-punch machine that is all one has), rather than insert a 'capitalization symbol' before each letter to be capitalized in the final print-out, this last program can be instructed to recognize all headword and cross-reference cards (because of the fact that they have no line numbers or text indications on them), and can be instructed to print them all in upper case. Hence, all my headword and cross-reference cards were printed in upper case, and all my text cards in lower case except those characters preceded by an equals sign. There is only one small exception to this: on the cross-reference cards I did not want the word 'see' to be in capital letters, so the program included a statement to the effect that on all cross-reference cards the word 'see' should be printed in lower case. (I did this, too, for the word 'or' on the headword cards. See the sample pages below.)

This last program is also the one which must be 'taught' how the page is to be printed. With special IBM charts, I mapped out what I considered to be an aesthetic and legible format, and Mr. Rompot programmed the material to match my design.[13] I suggest (following Mr. Benson's critique) that concordancers keep in mind that their efforts are likely to be used by eye-weary scholars, and thus the page should not scare away potential users of the concordance. If room allows, the text lines should be 'legibly' apart from one another, they should be indented from the headword and


228

Page 228
cross-reference lines, and the text citations should be close enough to the text lines to be easily linked with the correct line (without the unsightly row of dots separating them). If one is going to take the trouble to make a concordance — and no small trouble it is! — he should at least make his volume legible and attractive.

The specific problems each concordancer faces concerning homographs, homonyms, variant spellings, compounds, hyphens and prefixes (especially negative prefixes for Middle English verbs), words to be deleted, alphabetization, abbreviations of all sorts (depending, of course, on the type of text being used, a published edition or a manuscript), apparent typographical or scribal errors in the 'copy text' chosen, proper nouns, punctuation, and so on — all these and other specific problems directly related to each text must be dealt with by each concordancer. One could not possibly state any general rules to follow for so varied a set of problems. All one can expect is that the policies adopted will be logical and consistently adhered to, that they will be stated clearly at the beginning of the concordance so that the user of the volume will know how to proceed, and that they will produce an easy to use, legible tool. My suggestions about how to arrange and use headwords and cross-references seem to me logical, and were practicable for the text with which I was working, but may not be so for every text. The usefulness of the final volume must be the guiding principle in every phase of the compilation of data, regardless of how much extra work the compiler subjects himself to. What good is a giant volume of data so hard to use and so confusing in its arrangement of material that it hardly pays to struggle with it?

The following are sample sheets from the Brut concordance. This preliminary copy was printed on the University of Iowa's computer which does not have the proper characters on its print chain; I have therefore substituted the following:

for þ the number 3
for ð the number 6
for Ʒ the number 9
for æ the symbol >
The final print-out will of course contain the Middle English letters, in both upper and lower case.

As I stated above, I have 1) listed Middle English words with close Modern English equivalents in spelling and meaning under the Modern English spelling [e.g. FOSTERED]; 2) listed Middle English words with Modern English equivalents in spelling only under their Middle English forms with a few definitions [e.g. FULLE]; and 3) listed Middle English words with no Modern English equivalents in their most common form with definitions [e.g. FULLUHT]. I have further cross-referenced some of the variant forms which can be found under headwords elsewhere in the concordance [e.g. FOT(E) (N), FULLEHT].

Notes

 
[1]

Alan Markman, "A Computer Concordance to a Middle English Text," Studies in Bibliography, 17 (1964), 55-75.

[2]

I use this word for brevity; Professor John McLaughlin, of the Linguistics Department at the University of Iowa, suggested the term to me.

[3]

In my work on the Brut I used the only complete edition of the poem ever published, that of Sir Frederick Madden (Society of Antiquaries, 1847); Professor G. L. Brook, who is preparing a new edition for the Early English Text Society, has concurred with my selection of texts. Though I disagreed with Madden's interpretations of some words, I glossed them under the meanings he gave for them. Thus the word 'bihedde' Madden glosses as 'viewed' for line number 30155; it seems to me that he is in error here, for the word seems to mean 'greeted.' But in my concordance the word may be found under the meaning he gives.

[4]

Later I will suggest a different method for separating homographs and gathering variant forms which worked very well for me. Professor Richard L. Venezky of the University of Wisconsin has other suggestions on a similar level in Computers and Old English Concordances (1970), which is a record of what was said at a conference of the same name held in Toronto in 1969.

[5]

The problems of homographs and variant spellings have been recognized for many years; see for example Stephen Parrish's article "Problems in the Making of Computer Concordances," Studies in Bibliography, 15 (1962), 8-9. Unfortunately, Mr. Parrish's article, like Mr. Markman's, deals too much with the problems they encountered on their respective projects, and not enough on general principles and methods of compiling a concordance.

[6]

A few other Middle English homographs to watch for are: soon/son/sun, here (here) /here (army), Brutus/Bruttes (which could mean Brutus or Britons), aðel(e) (n) (noble) /aðel(e)(n) (chief, elder), nomen (names) /nomen (take), are (form of the verb 'to be') /are (mercy), for (the conjunction)/for (went), sunnen (sons)/sunnen (sins), in (in or within)/in (inn or hostelry), græten (to greet) /græten (great), and so on. A careful perusal of the glossary of the edition will help the concordancer to locate the ones he needs.

[7]

For example, I separated the word 'worse' (the comparative form of 'bad') from the word "Worse' (a substantive extension of that comparative form, but which has come to mean 'the devil').

[8]

Speculum, 45, No. 2 (April 1970), 274.

[9]

Mr. Rompot has assured me that even though it is possible to combine this program with the previous one, it is easier and safer not to do so. Many computer programs take several 'modifications' before they work the way they were designed to, and it is much less costly and much safer to separate the complex functions of alphabetization and counting from the function of compiling the line-number concordance.

[10]

In 1960 when Mr. Parrish delivered the paper mentioned in note 5 above, he outlined three possible ways of handling the format of the headwords, cross-references, and text of the concordance (p. 10); he apparently did not think of the method I have chosen, which seems quite practical and eminently 'usable' from the reader's point of view.

[11]

Professor Venezky has called these "stop-words . . . usually high-frequency function words" (Computers and Old English Concordances, p. 67). Professor John McGalliard, of the Department of English at the University of Iowa, has suggested to me that since it has generally been the policy of concordancers to tend more toward completeness than toward abridgment in their listing, and since with the computer's aid it is no trouble at all to print every word of a text, it seems to me that only the most 'dispensable' words be deleted from the final concordance. What took scholars of the pre-computer era years to compile takes us literally minutes. There is no reason, therefore, for omitting words like 'after,' 'but,' 'can,' and so on; these words have their interest to linguists, statistical-stylists, and others. But of course there is no reason for including words like 'a,' 'an,' 'and,' 'be,' 'the,' or hundreds of personal or relative pronouns. Words of such abundance (I encountered over 7400 'and's' in the Brut), if they are to be studied, can be amassed in great quantities without the need of a concordance. But even a complete record of these is readily available so long as the computer tapes which were used to generate the original concordance are still available.

[12]

The key-punch machine does not have an ash, a thorn, an eth, or a yogh, so I substituted symbols for these. Eventually they will be converted to the proper characters when they are run through a computer with a print chain with the proper characters. Philip Smith, at the University of Waterloo, has informed me that his computer now has such a chain, and he will allow me to use it for some reciprocal labor of key-punching. Hence, the sample pages included here will have a 3, 6, 9, and > substituted for the thorn, eth, yogh, and ash respectively.

[13]

I included in my design page numbering. It should be be remembered, in case the concordance is to be published, that one should number the even numbered pages on the left, and the odd numbered pages on the right, as in most printed books, or all pages in the center bottom.