:: :: University of Virginia Library

[title page]

Problems in the Making of Computer Concordances by S. M. Parrish

III

Notes

Electronic Computers and Elizabethan Texts by Ephim G. Fogel

Printing Methods and Textual Problems in A Midsummer Night's Dream Q1 by Robert K. Turner, Jr.

The Printing of John Webster's Plays (III): The Duchess of Malfi by John Russell Brown

The Shares of Fletcher and his Collaborators in the Beaumont and Fletcher Canon (VII) by Cyrus Hoy

National Library of Scotland and Edinburgh University Library Copies of Plays in Greg's Bibliography of the English Printed Drama by Marion Linton

Scottish Printers and Booksellers 1668-1775: A Second Supplement (II) by Robert Hay Carnie

Early Editions of The Tatler by William B. Todd

Samuel Richardson's London Houses by T. C. Duncan Eaves and Ben D. Kimpel

Wordsworth's Poetry and Stuart's Newspapers: 1797-1803 by R. S. Woof

Some Observations on the Text of Dubliners: "The Dead" by Robert E. Scholes

The English Editions of James Gould Cozzens by James B. Meriwether

The Errata Lists in the First Aldine Editions of Caro's Rime and of the Due Orationi of St. Gregorius Nazianzenus by Curt F. Bühler

On the Tercera Parte of Calderon — 1664 by Edward M. Wilson

Gorboduc, Ferrex and Porrex: The First Two Quartos by I. B. Cauthen, Jr.

New Year's Day Gift Books in the Sixteenth Century by Edwin Haviland Miller

A Note on Printers' Measures by W. Craig Ferguson

The Earliest London Printings of "Verses on the Death of Doctor Swift" by A. H. Scouten

The Printing of Fielding's Miscellanies (1743) by Donald D. Eddy

Rasselas: Purchase Price, Proprietors, and Printings by Gwin J. Kolb

Dwight's Triumph of Infidelity: Text and Interpretation by Jack Stillinger

William Dean Howells and The Breadwinners by George Monteiro

A Harold Frederic First by Stanton B. Garner

Hidden Printings in Edith Wharton's The Children by Matthew J. Bruccoli

Half-Sheet Imposition of Eight-Leaf Quires in Formes of Thirty-Two and Sixty-Four Pages by Oliver L. Steele

A SELECTIVE CHECK LIST OF BIBLIOGRAPHICAL SCHOLARSHIP FOR 1960

[section]

Collapse All | Expand All expand section

Problems in the Making of Computer Concordances
by
S. M. Parrish [*]

I

Most of us have read with private delight Yeats's withering verses about scholars, those "old, learned, respectable bald heads," who "edit and annotate the lines / That young men, tossing on their beds, / Rhymed out in love's despair." Perhaps because we all feel uncomfortably vulnerable to the indictment, we can share a macabre enjoyment at wondering what Yeats would have thought about the electronic computer, the lightning-rapid, passionless, remorseless, soul-less editor and annotator that cannot cough — in ink, or anything else — and wears no shoes to wear the carpet with. What magnificent wrath and scorn would Yeats have let fall upon us for invading his world of symbol and Irish legend to count "gyres" and index the varieties of "love" on an IBM machine!

In thoughts like these, and the fears they represent, lies the first great problem of making computer concordances. For every good humanist feels ambivalent about the intrusion of technology into his domain. While we may, with one part of our minds, accept the fact that electro-mechanical devices must inevitably take over the routine chores of scholarship — collation of texts, for example, and enumerative bibliography — with another part of our minds we warmly commend the Dante Society of America for resisting the help of a computer to complete its monumental, new Dante Concordance now in progress. Members of the Society, it turns out, scattered through the nation, working alone by hand on their assigned blocks of pages, value too highly the sense of community that seals them into one tribe to wish to sacrifice it for the advantages of speed. (What is five years — or twenty-five — in the timeless world of Dante studies?) Here, we like to

think, is the embattled humanist courageously holding out against automation, and deserving of our whole-hearted support.

Our psychological resistance to automation in the Humanities is likely to be stiffened by our superb innocence. Delightedly, we indulge ourselves with terrors that are meaningless to people who know anything about computers. If electronic brains can index and edit poetry, we inquire fearfully, how long will it be before they begin to compose poetry? But we rarely stay for an answer, so ardently do we cherish our fancies. We cannot, I suppose, be expected to welcome the arrival of a computer-poet, though it might be extremely interesting to have some of his productions on which to test our critical principles. (Would it be committing the biographical heresy to identify the poet as a computer? If we refuse to take any account of the poet, the better to scrutinize the internal order of the poem, which would prove the more stimulating exercise — the search for irony, or the discrimination of a persona?) It is hardly to our credit, however, that we find "sinister" implications in every technological advance, unshakable in our conviction that literature and technology don't mix — a conviction probably held by the monk in his scriptorium, gloomily contemplating the first moveable type. Could we not be expected to show at least as much maturity and vision as the mathematicians, who see no threat to their own supremacy in the arrival of machines that make thousands of calculations every second? "We can always think of more things to ask the machine to do than it can ever learn to do," they will say confidently, and get on with the business of developing the sensitivity and power of their marvellous tools, knowing that every advance yields them more freedom from drudgery, more opportunity for creative research. We might wonder whether it is these people or the Humanists who are the more dedicated to the human use of human beings.

But our innocence is not the only problem. For even those of us who try to come to terms with automation are likely to be frustrated, owing to our inability to communicate with computer scientists in their own language. When the computer programmer talks of a "word," he means "thirty-six bits," and by "thirty-six bits" he means six "six-bit" elements of the binary number system in which the machine counts. The basic number of bits (an abbreviation of "binary digits") happens to be six not, as a learned humanist friend of mine conjectured, because certain tribes of American Indians developed an effective number system on a base of six, but because six is the smallest number of binary digits that will accommodate the 47 characters on an IBM print wheel.

These, of course, are misunderstandings of the simplest kind, involving the transfer of metaphors from one discipline to another. Far more complex and disturbing are the misunderstandings that result when we attempt to parse a technical paper dealing with computer processes. Though the words are clearly English, and often familiar, we are likely to find the concepts beyond our grasp, and the language, somehow, impenetrable.

As a result of our inability to speak the language of computer science, the computer people are obliged to communicate with us in our language, and they have a way of telling us the things we seem so delighted to hear. "This machine," they will say reassuringly, speaking of a new computer, "is fairly stupid. It has only about a second-grade intelligence." "Of course," they add, after a carefully timed pause, "the last machine we had was only in the first grade. . . ." And they will go on thoughtfully to tell us about the "compiler," a new device by means of which the machine can be taught to learn from its own mistakes, and thus in a sense to program itself. The question that immediately rises to haunt our minds — "who confesses the compiler?" or something of the sort — has little meaning for the programmer because he has been using words and metaphors drawn from our world, not his own, and ours is so obviously remote from reality as to be almost a fairy-land.

These two worlds represent, of course, the two cultures so brilliantly portrayed by Sir Charles Snow in his memorable Rede Lecture of 1959, The Two Cultures and the Scientific Revolution. The separation between them is, as Sir Charles declared, one of the critical problems of our age. Its magnitude becomes distressingly clear to anyone who endeavors to apply the processes of computer technology to research in the Humanities.

II

But it is time to move to more immediately relevant problems. When we began at Cornell in the spring of 1957 to develop a concordance technique on the IBM 704 computer, we had no models to imitate. Neither the Revised Standard Bible concordance (made on a Remington Rand Univac)[1] nor any papers describing electronic indexing of the Dead Sea Scrolls and the works of Thomas Aquinas[2] had yet appeared — nor had the word-index to Dryden, which was made by hand[3] (it took some twenty years) then checked and printed by means

of IBM accounting-machines. As we surveyed the problem, it seemed to us that the indexing process should remain rigidly under computer control to ensure speed and accuracy, yet that correction of errors must somehow be provided for; we felt, moreover, that for economy the computer should be induced to give us a finished page print that might be photographed for publication.

As we ultimately worked out our technique for a pilot run on the poems of Matthew Arnold, the process went roughly like this. The lines of Arnold's verse were punched on IBM cards, one line per card. We used the standard edition of Arnold, edited by Tinker and Lowry, adding to each line card, by an automatic process, the line number and page number shown in that volume. Variant lines, made up from the Tinker and Lowry collations were also punched (each with an identifying "V") then grouped at the end of each poem; a separate title card was punched and inserted before each poem. The entire deck of cards (some 17,000) was now "listed" by an IBM printer and proofread. At this stage errors could be corrected by simply pulling and replacing cards. When we were satisfied that the deck was accurately punched, we fed the cards into an IBM Card Reader, which transferred the data on them to magnetic tape.

At this juncture the 704 computer came into play. Since alphabetical sorting is not one of the operations which the 704 was designed to perform, the computer program had to be an innovative piece of research, involving much trial and error. Thanks to the creative ingenuity of our programmer, Mr. James A. Painter of the IBM Corporation, we were ultimately provided with a program that perfectly suited our needs. The program had three distinct steps. In the first, Arnold's words were picked out of his lines of verse and collected on a separate tape; in the second, the words were sorted alphabetically; in the third, they were re-united with their lines (to which titles had now been attached) and prepared for "listing." Before beginning the first step, the machine assigned to each line of verse an arbitrary serial number, thus making what we have called a "line dictionary." The machine then scanned each line word by word, reading from the beginning to the first space, then on to the next space, and so on. As each word was picked up by the computer it was automatically checked against a list of some 150 common, "non-significant" words (that is, words not to be indexed) previously stored in the computer's "memory." If the word proved to be on the list, it was dropped, and the next word on the line picked up; if the word was not on the list, the computer transferred it,

along with the serial number of its source line, to another tape for sorting.

The second stage of the program began when all "significant" words had been collected. The sorting procedure is too intricate to be described in detail, but briefly it involves a lengthy series of comparisons. As each letter — and of course each word — goes onto magnetic tape from the punch card, it is coded as a series of binary digits on which any of the operations of binary arithmetic can be performed. When two different words are compared, therefore, the one which proves to be the "smaller" is sorted first alphabetically. Since Arnold wrote about 64,000 "significant" words, the number of comparisons required was very large, in spite of some ingenious short-cuts devised by Mr. Painter; although the computer is capable of making approximately 2500 comparisons per second, the sorting took 25 hours. It is fair to add that much of this time was consumed by auxiliary machine operations, including an elaborate checking routine written into the program. While the 704 is an exceptionally reliable machine — which is to say that its error rate is very low — long runs increase the probability of error. To ensure absolute accuracy a sum-check on the numeric operations of the machine was performed automatically about every ten minutes of the Arnold run; if the check failed to clear, the program was rolled back to the last successful check and re-started. One ought further to add that recent refinements of the sorting routine have reduced the time to less than ten hours.

At the end of the second stage of the program we had a tape on which all significant words in Arnold's text were arrayed in alphabetical order, each accompanied by the serial number of the line in which it occurred. All that remained, in the third stage of the program, was to recover the lines of verse themselves from the line-dictionary tape (by means of their serial numbers) and prepare them for listing. Once recovered, the lines were arranged on another tape, divided into pages 90 deep, and indented beneath the index words. The order in which the lines fell under each index word was determined by the order in which the cards had been fed onto the line-dictionary tape; in this case, it was page- and line-order in the Tinker and Lowry edition. On the page tape the identifying information was attached to each line, dots were supplied to fill out short lines, long lines were doubled back where necessary, and the word "CONTINUED" was supplied wherever an entry ran past a page break. The final listing was made directly from this page tape by an IBM Printer running "off-line," that is, not

involving the computer at all. The resulting pages were reproduced by an offset process, and the Arnold volume was published in 1959, the first in a series to be known as the Cornell Concordances.[4]

I present these details in order to give some sense of the way in which an electronic calculator operates, and I have, of course, passed over a number of textual and programming difficulties.[5] Perhaps a single example will suffice to show how some understanding of the machine's operation is necessary to deal intelligently with editorial problems. There was, for instance, the matter of punctuation. The standard IBM print wheel is equipped with some but not all punctuation symbols. For the pilot run it therefore seemed wisest to dispense with punctuation. Some lines were thus rendered mysterious, or ludicrous; some, especially those stripped of apostrophes, became misleading (without the apostrophe possessives usually become indistinguishable from plurals; moreover, we'd becomes WED, I'll ILL, she'll SHELL, I'd ID, and he'll HELL). But we were pleased to see how little the appearance of most lines was changed for the worse.

Now, we did preserve the hyphen, which made it unnecessary to join or separate words artificially, but which also led us into a dilemma. If we instructed the machine to treat the hyphen as a letter, all hyphenated compounds would show up as index words, but the second portions of the compounds would not. Arnold's liking for compounds made this result seem undesirable (calling my humanist instincts into play, I once counted, by hand, more than 40 compounds in the "Scholar-Gipsy" alone — "green-muffled," "frail-leaf'd," "black-wing'd," "red-fruited," "close-lipp'd," and so on). We took the only alternative open to us and instructed the machine to treat the hyphen as a space. By this means we saved the second half of each compound but lost the whole as an index entry. Somewhat disturbing was the realization that we were causing compounds with both halves on the list of omitted words to vanish entirely. (If Arnold ever used the hyphenated noun "TO-DO," I am afraid we know nothing about it). As a way out of this dilemma, available for forthcoming concordances, we have incorporated a cross-indexing feature in the computer program. The machine is now

directed to treat the hyphen as a letter and thus to print the entire compound as an index entry; it is further directed to list as a separate index entry the second portion of every hyphenated word, followed by the word "SEE" and the whole compound (it was not thought necessary to cross-reference the first portion). Naturally, the lines of verse containing the compound are to be listed under the whole compound, not under the cross-reference.

This innovation has one drawback: it lengthens and complicates the sorting routines. For the program has to be expanded to accommodate the longest known word in the text. We felt safe in setting this limit for Arnold at 21. We failed to ask the machine to produce for us a list of index words in order of length — a chore it could readily have performed — so I cannot say how close we came to this limit. I can only offer "inextinguishable," with 16 letters, again discovered by an old-fashioned process. But with hyphenated words to be taken care of we felt obliged to run the allowance up to 30 letters, including the hyphen, and even this may not be enough for Old English texts, or for some of Yeats's remarkable compounds.

I have not even yet finished with the simple matter of punctuation. Desiring to add sophistication — not to speak of intelligibility — to forthcoming concordances, we resolved to acquire a special set of print wheels bearing punctuation. But the design of these wheels was not easy to fix. The 47 positions on the standard wheel provide for 26 letters, 10 digits, and only 11 "special characters," whereas the ordinary typewriter keyboard has, besides letters and digits, some 18 symbols. We had either to sacrifice such useful symbols as brackets, dash, ampersand (which abounds in Blake), asterisk, and the like, or to displace letters or digits on certain of the wheels. Since we wanted to include among the new characters three Old-English letters, we took the latter alternative. We ordered a 120-wide bank of print wheels made up of two designs: the left-hand 80 wheels, to be used for printing index words and lines of text, are of our new design, with full punctuation but no digits; the right-hand 40 wheels, to be used for printing page and line numbers and title abbreviations, are of a standard design, with all the digits but only minimal punctuation. This complex, but work-able, compromise imposes limitations that must be taken account of editorially. No title abbreviation can contain any special characters (such as thorn) because these are present only on the text wheels. Similarly, where digits occur in the text (as they occasionally do, for example, in Blake), they must be spelled out before punching, or spaces must be left for paste-overs on the final print. Unfortunately, the

absence of digits on the left side and center of the page prevents us from having the machine print page numbers at the bottom of the finished sheets, as we had once hoped it might do (regretfully, we turned down our programmer's offer to spell the numbers out).

I hope this one example will suggest how complicated even the simplest editorial problem can become. A number of other minor misadventures occurred during completion of the Arnold concordance, some of them exasperating, some amusing. For instance, my unaccountable failure to list "IT" among the words to be omitted required the removal of ten and a half pages of IT from the final print. And when the first full-scale test of the intricate sorting routine produced as the first two items in Arnold's vocabulary AAR and AARAU, our distracted programmer was driven back to his drawing board — until we convinced him that they were perfectly good Swiss place names. But tempting as it is to share these griefs I shall pass them over in order to get to a more important, indeed an over-riding, problem, one that arose with the Arnold but remains to be faced whenever verbal text is processed by mechanical means.

No machine at the present stage of its development, not even the most advanced electronic computer, is able to recognize anything but the physical characteristics of a word. This means that homographs are indiscriminately thrown together in a concordance. Now in some instances this result is unobjectionable. Most of the hand-made concordances show under a single entry all the occurrences of such mixed items as "rose," "left," "long," and the like. But difficulty arises with certain common words like "art" and "will," which have one important meaning submerged among the occurrences of another, high-frequency but unimportant meaning. In the Arnold program we had no choice but to include all occurrences of these words, leaving it to the reader to find the important meanings. Yet this is wasteful, and it would clearly be desirable to exclude "art" and "will" where they occur as verbs and keep them where they are nouns. Moreover, in concordances which are likely to be used in linguistic or philological research, such as the Old English, users may expect to find homographs discriminated. What we have had to face, therefore, is the general problem of devising means by which discriminations made by the editor can be incorporated economically into a machine program. I am not sure that we have solved this problem to our entire satisfaction, but we have made two tentative solutions, both now being employed in our work in progress on Yeats, Blake, Ben Jonson, Emily Dickinson, and the Old English poetic remains.

The first allows for editorial discrimination of homographs before punching. A homograph discriminant is marked by the editor on the copy and punched into the line card immediately following the appropriate word; the discriminant, which consists of an arbitrary symbol followed by a letter, with different letters signifying different meanings of the word, is used in indexing but suppressed from the print. To employ this technique, the editor must, of course, know where the homographs are in his text and decide before the punch begins which of them are worth his attention. With the second technique, however, discriminations can be made after indexing has been completed and variant meanings have been drawn together under a single index word. This technique involves listing all lines as they are to occur in the finished print but with no page spacing; each line is given an arbitrary serial number. Working with this unpaged list the editor can punch instructions into cards and feed them onto a separate tape, which is played against the main tape to produce the final paged print. The instructions may be of two sorts: (1) delete line X (2) between lines X and Y insert the following line. By combining deletions and insertions the editor can thus either drop all non-significant meanings from an index entry or separate the lines representing one entry into two entries, each with its own index word. If desired, the index words can be followed with grammatical identifications: ROSE (NOUN), and ROSE (VERB).

The counterpart of the homograph problem, again originating in the machine's incapacity to deal with anything but physical measurements, is what might be called the general problem of variants. Here we are concerned not to separate forms that the machine has undiscriminatingly mixed, but to bring together forms that the machine, blindly following its mechanical routines, has failed to recognize as related. Again, the need for a solution to the problem is frequently slight. Few of the hand-made concordances have attempted to group grammatical variants under a single entry, and when the text being indexed belongs to the 19th or 20th century, or has been modernized, no significant variants of any other sort exist. It is, of course, old-spelling texts that present the challenge here. In an effort to meet it we have begun work at Cornell on a concordance to the poems of Ben Jonson, based on the Herford and Simpson edition. Some of the thinking we have done, and some of our provisional procedures may be of general interest.

To begin at the most elementary level of this problem, one might observe that there are three possible forms which a concordance of a pre-nineteenth-century text might take. First, both index words and

quoted lines of context can be modernized, as in Bartlett's Shakespeare concordance. Second, both index words and text lines can be given in old-spelling. Both these types of concordance could be produced on a computer, but neither type is wholly satisfactory. The first involves all the risks and difficulties of modernizing, sufficiently well-known to need no rehearsing here. The second requires a very considerable crossreference apparatus to lead the user to all the various spellings of the word he is interested in. Clearly, the optimum concordance to an early text is of a third type, in which the lines of verse are given in their original form but index words are modernized, exactly as in the Spenser concordance, for one example, where, to look up a word, as the editor remarks, "the reader has only to recall its modern spelling."

It is true that this third type of concordance, no less than the first type, would oblige us at one point or another to modernize every "significant" word in the text. But there is a difference. Here we would not be altering the text but only setting down as index words some arbitrary equivalents as a means of locating particular forms. We have here a chance, in other words, to avoid making commitments and to hedge those we do make. In view of the jungle of textual problems that surrounds the act of modernizing, this is surely an advantage. But the important advantage over a modernized concordance lies, quite simply, in the fact that studies of old-spelling forms — grammatical, linguistic, or textual — become possible when the concordance preserves these forms. The scholar interested in comparing Shakespeare's use of "virtue" to Arnold's is satisfied with a modernized text. But the scholar interested in comparing compositors' habits or in tracing dialectal survivals finds no use at all for modernized forms. It may be worth adding that our practice at Cornell will be to keep on file the line tapes of finished concordances. Within 20 or 30 minutes, the computer can, upon demand, search an average tape and produce all occurrences of any specified word, whether on the omitted list or not. If the words are stored on tape in old spelling, it will be possible to look up "nonsignificant" forms which for one reason or another become a matter of interest, whereas if the words are modernized, these forms are irretrievably lost.

Our decision, therefore, has been to produce a Jonson concordance with modernized index words and old-spelling text. The difficulties of producing such a concordance by mechanical means should be obvious. Even before old-spelling text can be punched, a good deal of pre-editing has to be done, for elided words and contractions, if punched as they stand, would grievously foul the index. In later stages a fairly sophisticated

computer program will be required. For at some point in the process we shall have to replace old-spelling index words with modernized forms, at the same time rearranging lines so as to bring together under the modern form all the scattered variant spellings. The substitution of index words will have to be carried out by means of a separate tape on which each old-spelling index word is associated with its modern equivalent; this tape will be used to control the necessary resorting of lines. Where scattered lines show up under the wrong modern entry, owing to ambiguity in the old-spelling form, we expect to re-sort them by means of the technique I have described for handling homographs on the final unpaged listing.

Since the Jonson project is still in an experimental stage, I shall not commit myself to any more detailed specification of its routines, for they may yet change substantially. If I have already gone into closer detail than the ordinary humanist sensibility can bear without anguish, my purpose has been a wholesome one — to reassure the reader that we have not yet been able to reduce all our techniques to automatic routines. Man is still master of the machine. The impatient hopes with which we embarked on the making of concordances have been pretty well checked, as we have learned that special problems are likely for some time yet to multiply, instead of vanishing. The creation of a standard program into which we can pour lines of verse and out of which a finished concordance emerges is still, alas, some way in the future. Before it can come into being we may even have to see the creation of a new breed of editor and critic, one as well versed in binary arithmetic and computer programming as in literary history and the principles of textual criticism.

But machines are easier to program than men, and we must not become visionary. Let us look forward, instead, to consider, finally, one or two of the particular ways in which electronic computers are becoming useful to scholars in the Humanities, and especially to editors and textual critics of the breed that now exists.

III

One of the Cornell projects now well under way is a concordance to the complete writings of William Blake, edited by David Erdman of the New York Public Library. This concordance is based upon the variorum edition by Sir Geoffrey Keynes, but it will incorporate hundreds of corrected or added readings derived from a fresh collation of all Blake texts carried out over the past year by Dr. Erdman and his world-wide team of devoted Blakeians. The concordance will be, in effect, a new edition of Blake, albeit in somewhat scrambled order. All

lines which differ in any significant way from their counterparts in Keynes will be "flagged" in the concordance index and printed up separately as an Appendix to the volume. Since we are establishing as well as indexing a text, it is reasonable to suppose that a study of the concordance in its unpaged form (where changes are still possible) will help us in making final editorial decisions. Conjectural readings, for example, which we have arranged to accommodate, may be materially strengthened, or weakened, by evidence that turns up elsewhere in the text.

This integration of editorial and indexing routines is, I think, an important development, especially in a form which it might take in connection with any edition in progress. If an editor can arrange to finish his collations and read proof on the text before proceeding to the rest of his task, he can be provided with a concordance made from his own text to assist him in composing textual and critical notes, and the introduction to his volume. (Any editor will understand how valuable this assistance might be). The concordance could then be published at about the same time as the text itself, perhaps even as a companion volume. There is no reason why this procedure could not become entirely conventional. When it does, the punching of text on IBM cards, fast and simple as it is, will probably become obsolete. It is already possible to feed text into the computer directly from the perforated tapes produced by an ordinary monotype machine; it may soon be possible to scan print photoelectrically and transfer it directly to magnetic tape for computer processing.

This is one of the directions that computer work will inevitably take within a few years. To illustrate another I shall again present a single example in the hope that wider inferences may be drawn from it. In the field of stylistic analysis a whole new world of possibilities seems about to open up, the shape of it already discernible. A unique feature of the Arnold concordance (and one which we expect to furnish on all the Cornell concordances) is a list of index words in order of frequency, produced for us by the computer. But counting frequencies is only one of many operations a computer might be expected to do by way of analyzing characteristics of a literary text. The computer's insensitivity to anything but physical characteristics is a smaller handicap than one would imagine, for it can still do something like the things we do ourselves when we identify style. Where we may observe that Samuel Johnson wrote in rotund oratorical sentences and used a Latinate vocabulary, the computer would measure the unusually long intervals between spaces and between periods, and would record the high

frequency of commas and of letter patterns like "MENT," "TION," "ATE," "PRE," and the like. Where we might notice, and mark as characteristic, certain conceptual relationships in Johnson's thinking, the computer, with its infallible memory, might accumulate definitive evidence of a kind to which we are normally insensitive, such as keyword clusters, syllable frequencies, trigraph patterns, verb-noun ratios, and other concrete properties which every concept takes on when it is given literary form. When the computer is dealing with an unknown text, the properties it measures might be matched against analogous properties of known texts, and a series of weighted scores assigned to show the degree of correspondence. Thus — to become wholly fanciful for a moment — a newspaper sonnet of unknown authorship might yield a score of, say, 35 as Coleridge, 39 as Wordsworth, 28 as William Lisle Bowles, or Charles Lloyd, or Mrs. Mary Robinson, but a score of, say, 71 as Southey. On grounds like these, provided that there is no external evidence that is contradictory, we would be tempted to attribute the sonnet to Southey.

Now the critical imagination may shudder at the thought of running enough tests on the sonnets of Bowles, or Charles Lloyd, or Mrs. Robinson to build up the necessary bank of scores. But there is likely to be a far more serious objection to the procedure I have fanci-fully outlined — even if the results could be made to turn out cleanly. Some readers will perhaps recognize that what I have described resembles a cryptanalytic attack on a piece of cipher text, and will be properly skeptical of its validity as applied to literary text. For as Colonel and Mrs. William Friedman have recently reminded us, in their brilliant and amusing account of the search for cipher in Shakespeare,[6] a cryptanalytic attack is valid only if an underlying system does in fact exist in the text. And who will declare that a system exists in a man's literary style? When we measure the properties of style we measure the man himself, his reason, his logos. Surely what I have proposed is a more terrible thing even than any possible menace to our human supremacy exerted by electronic brains!

Yet I am prepared, I think, to press the idea. When one considers soberly the progress we have made during the last half-century in measuring attributes of the mind, we would be incautious indeed to conclude that we have reached the end of this investigation. More probably, we have only begun. But to put the matter in these terms at all is less realistic, I suggest, than to regard the computer technique as simply an extension of the kind of stylistic analysis now being practiced.

If the computer does most of the things we do ourselves when we analyze style, the crucial difference is that the powers of this sort of analysis can be fantastically multiplied by electronic application, even with machines now available. And we must understand that we are just entering the age of computer technology. Today, hardly more than 15 years from the opening of this age, computers regularly become obsolete as rapidly as they can be built. I have already mentioned input devices that are learning to read photoelectrically; operating speeds are rising astronomically as tubes give way to transistors, and transistors to lowtemperature crystals (the new IBM 7074 is twenty times as fast as the 7070, itself a transistorized machine and some three times as fast as our lumbering old 704); computer memories are expanding to accommodate hundreds of thousands of words; programs are becoming sophisticated enough to perform accurate and grammatical translation. All these developments, and others equally breath-taking, suggest that it cannot be long before computers will undertake successfully the most delicate and complex programs of stylistic analysis.

Nor can it be long, I trust, before the research scholar in the Humanities will recognize these developments and learn to turn them to his advantage — to venture more freely across the boundary that, lamentably, separates his culture from that of the scientist. I am not suggesting that we should celebrate the coming of a new god (ex machina, naturally) as Yeats might have done, had he lived into the age of cybernetics and "information theory" — perhaps seeing the electronic brain as a great smooth beast, with "gaze blank and pitiless," rolling evenly towards New York to be born. I am only suggesting that the "scientific revolution," which is being created without much help from us, is probably the greatest single fact of our century; that it will go on expanding whether we recognize it or not; that we have nothing to fear and everything to gain from coming to terms with it; and that if we learn to exploit its potentialities we shall be serving the cause of the Humanities in the best possible way.

Notes

[*]

A paper read at the English Institute, Columbia University, September, 1960.

[1]

By the Rev. John William Ellison (New York, 1957).

[2]

Paul Tasman, "Literary Data Processing," IBM Journal of Research and Development, I (July, 1957), 249-256.

[3]

By Guy Montgomery (Berkeley, 1957).

[4]

The program used for Arnold was described by Mr. Painter in "Computer Preparation of a Poetry Concordance," Communications of the ACM, III (February, 1960), 91-95. Both Mr. Painter's account of the program and mine, however, now fall in the realm of history, for the techniques have been wholly revised for forthcoming concordances.

[5]

One might be mentioned — the surprising necessity of collating texts printed in the Tinker and Lowry Commentary on Arnold (1940) with texts printed in the Tinker and Lowry edition of Arnold (1950).

[6]

The Shakespearean Ciphers Examined (Cambridge, 1957).

University of Virginia Library

Problems in the Making of Computer Concordances by S. M. Parrish [*]

I

II

III

Notes

Problems in the Making of Computer Concordances
by
S. M. Parrish [*]