University of Virginia Library

Search this document 


  

collapse section 
 1. 
 2. 
II
 3. 
  
collapse section 
 1. 
 2. 
 3. 
 4. 
 5. 
 6. 
 7. 
 8. 
  
collapse section 
 1. 
 2. 
  
collapse section 
 1. 
  
collapse section 
 1. 
 2. 
  
collapse section 
 1. 
collapse section 
collapse section1. 
  
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
 2. 
 3. 
  
collapse section 
 1. 
 2. 
  
collapse section 
 1. 
 2. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1. 
  
  
collapse section 
 1. 
  
collapse section 
 1. 
 2. 
 3. 
  
collapse section 
 1. 
  
collapse section 
collapse section1. 
  
  
collapse section 
 1. 
  
collapse section 
 1. 
  
collapse section 
 1.0. 
collapse section2.0. 
collapse section2.1. 
 2.1a. 
 2.1b. 
collapse section2.2. 
 2.2a. 
 2.2b. 
  

collapse section 
 1. 
 2. 
 3. 
 4. 
 5. 
 6. 
 7. 
 8. 
 9. 

II

But it is time to move to more immediately relevant problems. When we began at Cornell in the spring of 1957 to develop a concordance technique on the IBM 704 computer, we had no models to imitate. Neither the Revised Standard Bible concordance (made on a Remington Rand Univac)[1] nor any papers describing electronic indexing of the Dead Sea Scrolls and the works of Thomas Aquinas[2] had yet appeared — nor had the word-index to Dryden, which was made by hand[3] (it took some twenty years) then checked and printed by means


4

Page 4
of IBM accounting-machines. As we surveyed the problem, it seemed to us that the indexing process should remain rigidly under computer control to ensure speed and accuracy, yet that correction of errors must somehow be provided for; we felt, moreover, that for economy the computer should be induced to give us a finished page print that might be photographed for publication.

As we ultimately worked out our technique for a pilot run on the poems of Matthew Arnold, the process went roughly like this. The lines of Arnold's verse were punched on IBM cards, one line per card. We used the standard edition of Arnold, edited by Tinker and Lowry, adding to each line card, by an automatic process, the line number and page number shown in that volume. Variant lines, made up from the Tinker and Lowry collations were also punched (each with an identifying "V") then grouped at the end of each poem; a separate title card was punched and inserted before each poem. The entire deck of cards (some 17,000) was now "listed" by an IBM printer and proofread. At this stage errors could be corrected by simply pulling and replacing cards. When we were satisfied that the deck was accurately punched, we fed the cards into an IBM Card Reader, which transferred the data on them to magnetic tape.

At this juncture the 704 computer came into play. Since alphabetical sorting is not one of the operations which the 704 was designed to perform, the computer program had to be an innovative piece of research, involving much trial and error. Thanks to the creative ingenuity of our programmer, Mr. James A. Painter of the IBM Corporation, we were ultimately provided with a program that perfectly suited our needs. The program had three distinct steps. In the first, Arnold's words were picked out of his lines of verse and collected on a separate tape; in the second, the words were sorted alphabetically; in the third, they were re-united with their lines (to which titles had now been attached) and prepared for "listing." Before beginning the first step, the machine assigned to each line of verse an arbitrary serial number, thus making what we have called a "line dictionary." The machine then scanned each line word by word, reading from the beginning to the first space, then on to the next space, and so on. As each word was picked up by the computer it was automatically checked against a list of some 150 common, "non-significant" words (that is, words not to be indexed) previously stored in the computer's "memory." If the word proved to be on the list, it was dropped, and the next word on the line picked up; if the word was not on the list, the computer transferred it,


5

Page 5
along with the serial number of its source line, to another tape for sorting.

The second stage of the program began when all "significant" words had been collected. The sorting procedure is too intricate to be described in detail, but briefly it involves a lengthy series of comparisons. As each letter — and of course each word — goes onto magnetic tape from the punch card, it is coded as a series of binary digits on which any of the operations of binary arithmetic can be performed. When two different words are compared, therefore, the one which proves to be the "smaller" is sorted first alphabetically. Since Arnold wrote about 64,000 "significant" words, the number of comparisons required was very large, in spite of some ingenious short-cuts devised by Mr. Painter; although the computer is capable of making approximately 2500 comparisons per second, the sorting took 25 hours. It is fair to add that much of this time was consumed by auxiliary machine operations, including an elaborate checking routine written into the program. While the 704 is an exceptionally reliable machine — which is to say that its error rate is very low — long runs increase the probability of error. To ensure absolute accuracy a sum-check on the numeric operations of the machine was performed automatically about every ten minutes of the Arnold run; if the check failed to clear, the program was rolled back to the last successful check and re-started. One ought further to add that recent refinements of the sorting routine have reduced the time to less than ten hours.

At the end of the second stage of the program we had a tape on which all significant words in Arnold's text were arrayed in alphabetical order, each accompanied by the serial number of the line in which it occurred. All that remained, in the third stage of the program, was to recover the lines of verse themselves from the line-dictionary tape (by means of their serial numbers) and prepare them for listing. Once recovered, the lines were arranged on another tape, divided into pages 90 deep, and indented beneath the index words. The order in which the lines fell under each index word was determined by the order in which the cards had been fed onto the line-dictionary tape; in this case, it was page- and line-order in the Tinker and Lowry edition. On the page tape the identifying information was attached to each line, dots were supplied to fill out short lines, long lines were doubled back where necessary, and the word "CONTINUED" was supplied wherever an entry ran past a page break. The final listing was made directly from this page tape by an IBM Printer running "off-line," that is, not


6

Page 6
involving the computer at all. The resulting pages were reproduced by an offset process, and the Arnold volume was published in 1959, the first in a series to be known as the Cornell Concordances.[4]

I present these details in order to give some sense of the way in which an electronic calculator operates, and I have, of course, passed over a number of textual and programming difficulties.[5] Perhaps a single example will suffice to show how some understanding of the machine's operation is necessary to deal intelligently with editorial problems. There was, for instance, the matter of punctuation. The standard IBM print wheel is equipped with some but not all punctuation symbols. For the pilot run it therefore seemed wisest to dispense with punctuation. Some lines were thus rendered mysterious, or ludicrous; some, especially those stripped of apostrophes, became misleading (without the apostrophe possessives usually become indistinguishable from plurals; moreover, we'd becomes WED, I'll ILL, she'll SHELL, I'd ID, and he'll HELL). But we were pleased to see how little the appearance of most lines was changed for the worse.

Now, we did preserve the hyphen, which made it unnecessary to join or separate words artificially, but which also led us into a dilemma. If we instructed the machine to treat the hyphen as a letter, all hyphenated compounds would show up as index words, but the second portions of the compounds would not. Arnold's liking for compounds made this result seem undesirable (calling my humanist instincts into play, I once counted, by hand, more than 40 compounds in the "Scholar-Gipsy" alone — "green-muffled," "frail-leaf'd," "black-wing'd," "red-fruited," "close-lipp'd," and so on). We took the only alternative open to us and instructed the machine to treat the hyphen as a space. By this means we saved the second half of each compound but lost the whole as an index entry. Somewhat disturbing was the realization that we were causing compounds with both halves on the list of omitted words to vanish entirely. (If Arnold ever used the hyphenated noun "TO-DO," I am afraid we know nothing about it). As a way out of this dilemma, available for forthcoming concordances, we have incorporated a cross-indexing feature in the computer program. The machine is now


7

Page 7
directed to treat the hyphen as a letter and thus to print the entire compound as an index entry; it is further directed to list as a separate index entry the second portion of every hyphenated word, followed by the word "SEE" and the whole compound (it was not thought necessary to cross-reference the first portion). Naturally, the lines of verse containing the compound are to be listed under the whole compound, not under the cross-reference.

This innovation has one drawback: it lengthens and complicates the sorting routines. For the program has to be expanded to accommodate the longest known word in the text. We felt safe in setting this limit for Arnold at 21. We failed to ask the machine to produce for us a list of index words in order of length — a chore it could readily have performed — so I cannot say how close we came to this limit. I can only offer "inextinguishable," with 16 letters, again discovered by an old-fashioned process. But with hyphenated words to be taken care of we felt obliged to run the allowance up to 30 letters, including the hyphen, and even this may not be enough for Old English texts, or for some of Yeats's remarkable compounds.

I have not even yet finished with the simple matter of punctuation. Desiring to add sophistication — not to speak of intelligibility — to forthcoming concordances, we resolved to acquire a special set of print wheels bearing punctuation. But the design of these wheels was not easy to fix. The 47 positions on the standard wheel provide for 26 letters, 10 digits, and only 11 "special characters," whereas the ordinary typewriter keyboard has, besides letters and digits, some 18 symbols. We had either to sacrifice such useful symbols as brackets, dash, ampersand (which abounds in Blake), asterisk, and the like, or to displace letters or digits on certain of the wheels. Since we wanted to include among the new characters three Old-English letters, we took the latter alternative. We ordered a 120-wide bank of print wheels made up of two designs: the left-hand 80 wheels, to be used for printing index words and lines of text, are of our new design, with full punctuation but no digits; the right-hand 40 wheels, to be used for printing page and line numbers and title abbreviations, are of a standard design, with all the digits but only minimal punctuation. This complex, but work-able, compromise imposes limitations that must be taken account of editorially. No title abbreviation can contain any special characters (such as thorn) because these are present only on the text wheels. Similarly, where digits occur in the text (as they occasionally do, for example, in Blake), they must be spelled out before punching, or spaces must be left for paste-overs on the final print. Unfortunately, the


8

Page 8
absence of digits on the left side and center of the page prevents us from having the machine print page numbers at the bottom of the finished sheets, as we had once hoped it might do (regretfully, we turned down our programmer's offer to spell the numbers out).

I hope this one example will suggest how complicated even the simplest editorial problem can become. A number of other minor misadventures occurred during completion of the Arnold concordance, some of them exasperating, some amusing. For instance, my unaccountable failure to list "IT" among the words to be omitted required the removal of ten and a half pages of IT from the final print. And when the first full-scale test of the intricate sorting routine produced as the first two items in Arnold's vocabulary AAR and AARAU, our distracted programmer was driven back to his drawing board — until we convinced him that they were perfectly good Swiss place names. But tempting as it is to share these griefs I shall pass them over in order to get to a more important, indeed an over-riding, problem, one that arose with the Arnold but remains to be faced whenever verbal text is processed by mechanical means.

No machine at the present stage of its development, not even the most advanced electronic computer, is able to recognize anything but the physical characteristics of a word. This means that homographs are indiscriminately thrown together in a concordance. Now in some instances this result is unobjectionable. Most of the hand-made concordances show under a single entry all the occurrences of such mixed items as "rose," "left," "long," and the like. But difficulty arises with certain common words like "art" and "will," which have one important meaning submerged among the occurrences of another, high-frequency but unimportant meaning. In the Arnold program we had no choice but to include all occurrences of these words, leaving it to the reader to find the important meanings. Yet this is wasteful, and it would clearly be desirable to exclude "art" and "will" where they occur as verbs and keep them where they are nouns. Moreover, in concordances which are likely to be used in linguistic or philological research, such as the Old English, users may expect to find homographs discriminated. What we have had to face, therefore, is the general problem of devising means by which discriminations made by the editor can be incorporated economically into a machine program. I am not sure that we have solved this problem to our entire satisfaction, but we have made two tentative solutions, both now being employed in our work in progress on Yeats, Blake, Ben Jonson, Emily Dickinson, and the Old English poetic remains.


9

Page 9

The first allows for editorial discrimination of homographs before punching. A homograph discriminant is marked by the editor on the copy and punched into the line card immediately following the appropriate word; the discriminant, which consists of an arbitrary symbol followed by a letter, with different letters signifying different meanings of the word, is used in indexing but suppressed from the print. To employ this technique, the editor must, of course, know where the homographs are in his text and decide before the punch begins which of them are worth his attention. With the second technique, however, discriminations can be made after indexing has been completed and variant meanings have been drawn together under a single index word. This technique involves listing all lines as they are to occur in the finished print but with no page spacing; each line is given an arbitrary serial number. Working with this unpaged list the editor can punch instructions into cards and feed them onto a separate tape, which is played against the main tape to produce the final paged print. The instructions may be of two sorts: (1) delete line X (2) between lines X and Y insert the following line. By combining deletions and insertions the editor can thus either drop all non-significant meanings from an index entry or separate the lines representing one entry into two entries, each with its own index word. If desired, the index words can be followed with grammatical identifications: ROSE (NOUN), and ROSE (VERB).

The counterpart of the homograph problem, again originating in the machine's incapacity to deal with anything but physical measurements, is what might be called the general problem of variants. Here we are concerned not to separate forms that the machine has undiscriminatingly mixed, but to bring together forms that the machine, blindly following its mechanical routines, has failed to recognize as related. Again, the need for a solution to the problem is frequently slight. Few of the hand-made concordances have attempted to group grammatical variants under a single entry, and when the text being indexed belongs to the 19th or 20th century, or has been modernized, no significant variants of any other sort exist. It is, of course, old-spelling texts that present the challenge here. In an effort to meet it we have begun work at Cornell on a concordance to the poems of Ben Jonson, based on the Herford and Simpson edition. Some of the thinking we have done, and some of our provisional procedures may be of general interest.

To begin at the most elementary level of this problem, one might observe that there are three possible forms which a concordance of a pre-nineteenth-century text might take. First, both index words and


10

Page 10
quoted lines of context can be modernized, as in Bartlett's Shakespeare concordance. Second, both index words and text lines can be given in old-spelling. Both these types of concordance could be produced on a computer, but neither type is wholly satisfactory. The first involves all the risks and difficulties of modernizing, sufficiently well-known to need no rehearsing here. The second requires a very considerable crossreference apparatus to lead the user to all the various spellings of the word he is interested in. Clearly, the optimum concordance to an early text is of a third type, in which the lines of verse are given in their original form but index words are modernized, exactly as in the Spenser concordance, for one example, where, to look up a word, as the editor remarks, "the reader has only to recall its modern spelling."

It is true that this third type of concordance, no less than the first type, would oblige us at one point or another to modernize every "significant" word in the text. But there is a difference. Here we would not be altering the text but only setting down as index words some arbitrary equivalents as a means of locating particular forms. We have here a chance, in other words, to avoid making commitments and to hedge those we do make. In view of the jungle of textual problems that surrounds the act of modernizing, this is surely an advantage. But the important advantage over a modernized concordance lies, quite simply, in the fact that studies of old-spelling forms — grammatical, linguistic, or textual — become possible when the concordance preserves these forms. The scholar interested in comparing Shakespeare's use of "virtue" to Arnold's is satisfied with a modernized text. But the scholar interested in comparing compositors' habits or in tracing dialectal survivals finds no use at all for modernized forms. It may be worth adding that our practice at Cornell will be to keep on file the line tapes of finished concordances. Within 20 or 30 minutes, the computer can, upon demand, search an average tape and produce all occurrences of any specified word, whether on the omitted list or not. If the words are stored on tape in old spelling, it will be possible to look up "nonsignificant" forms which for one reason or another become a matter of interest, whereas if the words are modernized, these forms are irretrievably lost.

Our decision, therefore, has been to produce a Jonson concordance with modernized index words and old-spelling text. The difficulties of producing such a concordance by mechanical means should be obvious. Even before old-spelling text can be punched, a good deal of pre-editing has to be done, for elided words and contractions, if punched as they stand, would grievously foul the index. In later stages a fairly sophisticated


11

Page 11
computer program will be required. For at some point in the process we shall have to replace old-spelling index words with modernized forms, at the same time rearranging lines so as to bring together under the modern form all the scattered variant spellings. The substitution of index words will have to be carried out by means of a separate tape on which each old-spelling index word is associated with its modern equivalent; this tape will be used to control the necessary resorting of lines. Where scattered lines show up under the wrong modern entry, owing to ambiguity in the old-spelling form, we expect to re-sort them by means of the technique I have described for handling homographs on the final unpaged listing.

Since the Jonson project is still in an experimental stage, I shall not commit myself to any more detailed specification of its routines, for they may yet change substantially. If I have already gone into closer detail than the ordinary humanist sensibility can bear without anguish, my purpose has been a wholesome one — to reassure the reader that we have not yet been able to reduce all our techniques to automatic routines. Man is still master of the machine. The impatient hopes with which we embarked on the making of concordances have been pretty well checked, as we have learned that special problems are likely for some time yet to multiply, instead of vanishing. The creation of a standard program into which we can pour lines of verse and out of which a finished concordance emerges is still, alas, some way in the future. Before it can come into being we may even have to see the creation of a new breed of editor and critic, one as well versed in binary arithmetic and computer programming as in literary history and the principles of textual criticism.

But machines are easier to program than men, and we must not become visionary. Let us look forward, instead, to consider, finally, one or two of the particular ways in which electronic computers are becoming useful to scholars in the Humanities, and especially to editors and textual critics of the breed that now exists.