University of Virginia Library

Search this document 


expand section 
expand section 
expand section 
collapse section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 
expand section 

expand section 

Although a few computer-generated concordances to the works of English authors have been published, not much progress has been made with the computation of Shakespeare. Professor Bowers's remarks in On Editing Shakespeare (1966) and Professor Louis Marder's evangelistic comments from time to time in Shakespeare Newsletter have fallen on stony ground. The reason is not hard to seek. On E. K. Chambers's count, the average length of the thirty-seven canonical Shakespearian plays is 2,720 lines. In the typographical lines of the First Folio, the plays range from The Comedy of Errors with 1,920 lines to Hamlet with over 3,900. Such an extent of material imposes serious and often inhibiting demands on even the largest computing systems, while the problems encountered in preparing the texts and in arranging for a properly legible concordance are great. One must admire the selfless industry with which the editors of earlier concordances to Shakespeare worked from manually-transcribed paper slips, despite the considerable deficiencies of these works. Their inadequacies are too familiar to require rehearsal here. The nature of John Bartlett's concordance, a modern spelling concordance to the Old Cambridge Shakespeare in a single alphabet, is only remotely germane to the present subject, but his 1,910 pages set in 6 point, with 110 lines in each column, clearly illustrates how greatly the problem of scale affects the study of Shakespeare's plays by computer.

Had I appreciated properly the extent of the undertaking before I began work on the Old-Spelling Concordances for the Clarendon Press, no doubt my enthusiasm would have been short-lived. However, for some years, study of spellings as evidence for the derivation of printed texts from scribal or holographic copy had reinforced my view that a


Page 144
swifter advance in the thorough analysis of Shakespeare's texts would come, not from the partial study of problems of individual texts, with inadequate understanding of how related evidence from other texts might affect conclusions, but from the detailed study of features common to many texts. There was nothing original in this conclusion: it will be generally agreed that hypotheses which are grounded on a suitably-thorough study of the facts are more likely to survive the examination of time and scholarship than those which are not. The means of securing the general survey of Shakespeare's texts, however, which I thought necessary did not exist; for Bartlett's omissions as well as his dependence on an obsolete modern-spelling edition renders his concordance practically useless for the detailed orthographical study which might lead to a better understanding of Shakespeare's text and language.

It was essential to have at least a comprehensive and accurate concordance or concordances to the early texts: 'at least' because, although concordances are projected and received with enthusiasm, they are relatively unsophisticated working tools: to use a computer just to produce a concordance seems an indefensible waste of time and resources. For the work on the scribes and compositors which interested me, more elaborate computational routines were essential. Nevertheless, concordances were the obvious starting point, and if they were to be published for the help of all scholars, so much the better.

For some time I could do little but contemplate the outlines of the work I wished to pursue; for, until Hinman's monumental Printing and Proof-Reading of the First Folio of Shakespeare was published, detailed investigation was hazardous without the aid of his description of the composition and printing of the Folio. When I found that Hinman had not, as I had hoped, been able to assign all the pages of F with certainty to their respective compositors, it seemed inevitable that old-spelling concordances would have to be produced. These would present all the spelling evidence necessary to settle compositor attribution, and would allow me to get on with my work on Ralph Crane and Edward Knight. The problem was how the concomitant expense, too great for a private student, could be borne. This question was satisfactorily resolved after discussion with Dr. Alice Walker, the editor of the Oxford Old-Spelling Shakespeare, when the Delegates of the Clarendon Press agreed to support publication of old-spelling concordances to each of the texts selected by Miss Walker as copy-texts for her edition, and commissioned me as editor.


Page 145

It may not be clear what information old-spelling concordances can provide that a modern-spelling concordance cannot, and the question will come to mind why the concordances were not made from edited texts. In general terms, an old-spelling text of Shakespeare represents a complex interaction of compositor(s), scribe(s), and, ultimately, author. Each agent of transmission had a function and responsibility partly the same as and partly different from that of the preceding; and each could perform his task with different degrees of accuracy and consistency. Viewed in this light, a Shakespearian text presents a pastiche of textual and orthographical features deriving from the individual links in the transmissional chain. The agents of transmission might vary in importance for consideration of the substantives of the text, but are of equal importance for a consideration of its accidental features, through which alone the substantives can be examined. In a modern-spelling text, or an edited text, a superficial consistency has been given to all textual features, which may or may not have been justified by the distribution and occurrence of the various linguistic forms. The editor works on the basis of variously determined assumptions and conclusions as to what is or what is not 'normal' in a page or text in a particular linguistic environment, and he smooths out what appear to him to be abnormalities, mistakes, or corruptions. The important point here is that abnormalities can only be distinguished by thorough understanding of what is normal, and what is normal for the author, scribe, compositor, or any other hand in the text, is best determined from the whole range of information bearing on the study. Hence, it is hoped, old-spelling concordances of unedited, unsophisticated texts, and subsequent detailed analysis, will not only reveal the normal orthographical and textual tone of the milieu, whether line, column, page, scene, act, or text, but will also throw into relief various abnormal forms for special consideration.

Especially interesting are forms which may not demonstrate any particular peculiarity but which are exceptional in the immediate textual environment. Such matters as the creation of anomalous spellings by compositorial justification, or from efforts to make rhymes visual as well as auditory, can be investigated in the concordances. The results of such analysis will be incorporated in the Oxford Old-Spelling Shakespeare, and from this edition will be prepared, one hopes, the definitive single-volume concordance to Shakespeare's plays, in both old- and modern-spelling. In brief, a concordance with pretentions to 'definitiveness' must itself depend on an edition which can itself be called definitive, and such an edition does not yet exist.


Page 146

The concordances of the single plays are intended to provide an important aid towards the definitive edition, which one hopes the Oxford edition will be. The techniques chosen for the preparation of the concordances and the arrangement of material in them have been determined not by any premature aspiration to produce a 'definitive' concordance, but by the more humble wish to afford a means towards the definitive edition. Consequently, the discussion which follows centres less on the nature and function of a concordance envisaged as itself a final product of scholarship, than on the old-spelling concordance as a basic working-tool from which scholars can derive the information they need to work towards the edition which we hope to see in our lifetimes.

I begin by discussing the principles and the particular features of the text which determined the structure of the programs for the computer and the organization of the data preparation.

Any concordance should aim to provide the fullest possible array of information for the reader and at the same time present a coherent text: a proliferation of editorial symbols is not likely to be helpful, especially when, as in works of early date, the language and typographical conventions of the text may be unfamiliar to many who will use the concordances. No concordance editor could possibly foresee all the uses to which his work might be put as linguistic technique advances; and, indeed, many possible applications would require different and irreconcilable arrangements. Nevertheless, it is desirable so to arrange the concordance that, even if an enquirer cannot immediately find the information he requires, with a little work he can extract it. No industry on his part, however, can compensate for actual omissions of matter, so the foremost desideratum is that a concordance present all the matter embraced by its subject. A concordance to the dramatic works of Shakespeare should therefore include the entire canon, and the texts should be presented in entirety. Even so, the principle of comprehensiveness has obvious reservations which must be faced by editor and user alike: I discuss some of these below in passing. There is, too, the obligation of fidelity to the texts, and there are other desiderata, such as providing sufficiently extensive illustrative contexts for each concorded word, and devising a clear system of referring the user from the item in the concordance to the appropriate place in the original text. These requirements are satisfied at two distinct stages: when the texts are being prepared for presentation to the concordance programs,


Page 147
and when the headings and reference-lines are arranged by the main concordance program.

Before I discuss these stages I should note that there are several methods of preparing concordances by computer; the method chosen in any particular case is largely determined by the facilities (including editorial time, and time available on the computer) at the command of the editor. In particular, whether one edits the text before, at an intermediate stage of, or after the main sequence of computer operations is determined by considerations of cost and efficiency. This in its turn determines the extent of editorial intervention off-line, and what is actually carried out by the computer. Usually, one tries to minimize the labours of the editor at any stage, whether the input text is to be pre-edited, or the concorded output is to be post-edited, after the concordance programs have been run, say, by the insertion of special characters which were not available for use in the computer, or by a further arrangement of heading-words. One attempts so to order the procedures that the computer can deal with the text in one passage through the text, without intermediate intervention by the human editor.[1]

For Shakespeare, where considerations of scale are especially pertinent, I decided that intermediate editing should be avoided and textual forms which needed distinctive treatment in the concordance should be identified in the text during the preparation of the texts. The computer on which the programs were run, an English Electric KDF9 at the Oxford University Computing Laboratory, at this time was oriented towards paper-tape input, and this determined my choice of the method by which the texts were prepared for computation. Although the computer's core-store consisted of 32,000 machine words, able to hold 192,000 characters, this was not large enough to hold even one text completely in store during computation. Accordingly, magnetic tape operations figure prominently in the concordance programs. The main concordance program is organized to write a file of messages for subsequent sorting and editing during a single passage through the text. As it was uneconomical to require the program to backtrack through the text, any information the program might require to form the appropriate concordance messages for writing to magnetic tape had either to be present in the text itself or be added by the editor during the data preparation.


Page 148

In brief, the concordance program writes a file of messages from the edited texts (held on magnetic tape) in the word-order of the text, on magnetic tape. These messages are then sorted into alphabetical order on to another magnetic tape by a standard program, which allows the editor to control such matters as whether numerals precede or follow letters; and are afterwards combined into the final concordance arrangement by another program, the editing program. The result is a concordance on magnetic tape from which the legible output used by the scholar can be prepared. Partly because this final stage may be quite some time after the generation of the concordance, but chiefly to allay the editor's anxiety (for it is some time since he has had any direct control), there are facilities to enable the concordance to be printed directly to the lineprinter, in legible but unpublishable form. At one time it was thought that the concordances would be printed on a Xeronic printer, which has the large array of upper- and lower-case characters necessary for this work, but ultimately it was decided to filmset them. Therefore, another computer run is necessary to convert the concordance in Xeronic form into a coding which makes film-setting possible.

It is generally recognized that the preparation of concordances is no longer a novel application of computers; many programs for concordances in many different programming languages are available. However, no program acceptable to the KDF9 computer has been written which does not impose severe constraints on the characteristics of the input texts, and therefore, on the legibility of the output. None of the published programs at the time when this work was being considered could cope with the particular requirements of early modern dramatic texts which show a fair amount of formal inconsistency, and at the same time preserve all the textual and linguistic features which might be of interest. It was necessary to devise a suite of programs which could handle quite inconsistent material and which would at the same time be able to deal with more orthographically-consistent material in modern English. Accordingly, although the preediting was restricted to the minimum identification of textual features which a computer could not reasonably be expected to distinguish unaided, the editor was able to use special devices in order to deal with the special characteristics of dramatic texts, such as speech-prefixes.[2] I describe these below.


Page 149

The texts were edited on the principle that the typographical environment of the individual text-line should be preserved as faithfully as possible, and much of the pre-editing was devoted to ensuring that the characteristics of the various compositors (in so far as they would be studied in the concordances) were preserved. The typists of the input paper-tapes could not distinguish italic from roman type, and were not competent to deal with word spacing, and clear instructions had to marked on copy. Usually, even though words in long lines are run together in the source texts, they can be separated with reasonable confidence: there seemed no advantage in preserving portmanteau noncewords which were simply the result of the compositor's concern to fit his text to his measure. But for forms such as to day, any body, and your selfe, which might be divided or not, the spacing of the text had to be observed, even though there was reason to suspect that the compositor treated them differently in verse lines. With elisions such as i'th, o'th'King, the narrowness of the spaces did not always allow certainty that the text was marked correctly. Where there was legitimate doubt about the compositor's practice for any particular form, I preferred to space or not according to the evidence of similar forms in the immediate area, that is, to regress to the norm rather than risk creating anomalous forms.

There seemed to be no good reason to preserve obvious typographical blemishes such as wrong-fount letters, turned letters, misprints from foul case, and blanks such as Qu ene) in Ant. The printer's character set did not allow me to preserve compound characters such as tildes, ligatures, digraphs, accents, and the pronominal contractions ye, yt, yu, and wc, occurrences of which have been listed together, I hope not too annoyingly, under Y and W respectively, to avoid confusion with such forms as ye and yt. I vacillated over the conventional abbreviations L., D., S., and M. but finally concluded that the reader would best be served by finding these forms listed under LORD, DUKE, SAINT and so on. I did this by expanding them between parentheses in the text; so, L.(ord). Faced with the choice of expanding M. or Mist. and the like into their variant spellings, I again consulted the evidence of the full spellings in the area. I retained the contraction when there was no reasonable warrant for expansion, secure in the knowledge that improper expansions will be immediately obvious in the concordance. Tilded forms such as cōuerted are similarly expanded in parentheses to co(n)uerted, but I have had to divide digraphs such as Æ silently. I have lowered the superscript r in Mr., forming Mr., the point being retained.


Page 150

For contracted names such as Bulling, which are often subject to considerable spelling variation, it is less easy to find warrant for expansion in the surrounding text, and I tended to leave them alone, particularly in stage-directions. I intend later to analyze the vocabulary and spellings of stage-directions in order to determine, as it seems reasonable to assume, whether certain arrays of variants occur which might lead to the preliminary grouping of texts dependent upon copy from a common source. Contraction of characters' names in the body of the text is relatively infrequent and whether or not they are expanded is of slight importance: I mention this only because McKerrow and most other editors consider expansion desirable.

The proper treatment of misprints gave the greatest concern in the pre-editing stage. The individual volumes contain a list of the particular alterations I have made beyond the general categories mentioned here, but the reader will very likely consider that I have not corrected obvious nonsense assiduously enough, especially when many corruptions are apparent and the corrections are accepted by modern editors. However, my greatest concern in marking up the texts has been to avoid emendation under the guise of correcting compositors' misprints, a task which I did not see as my responsibility. Since misprints which make some sort of sense will be illuminated by the contexts in which they are found under the headings of the concordances, it seems wrong-headed to frustrate more informed emendation by endeavouring to correct them before concording. In general, when a form in the text is a word making any sense at all, I have retained it. The lists printed in each concordance show the sort of correction which I have felt free to make.

The asterisk to preface long lines marks the first printable symbol introduced into the text. It is essential that spellings which might have been affected by the compositor's need to fit the line to the span of his measure should be identified; they have been separately counted, and the line in which they occur has been identified by the preliminary asterisk. Where only line numbers are given, the line number is asterisked if I have marked it as a long line. The asterisk signifies not that spellings have been altered, but that they might have been, since they occur in lines which extend to the full extent of the measure. In some texts such as TN., a compositor has justified many lines without making the text extend right to the margin; although these (and indeed all lines) are 'justified', they have not been marked as such in the concordances. The compositors used many tricks to achieve justification, most of which could have been used with variation of spelling,


Page 151
and I should warn the reader that because a word is not identified as occurring in a long line, it should not be taken for granted that the copy spelling or the compositor's customary spelling has not been varied. I hope to discuss justification more fully in another place. Where the measure varies, as it does in some quartos, this has been noted in the relevant introduction. On the other hand, when run-on prose lines have been justified with a terminal quadrat or space, I have made no special note.

The only other common printable symbol introduced into the text is the end-of-line bar. It is unilluminating to concord short lines, which might be vocatives, passages of stychomythia, or single-word turn-overs, as single context lines. The reader who refers to ANGELO in the concordance to MM. will not be pleased to find 'Duke. Angelo: . . . 32'; it is better that he should find 'Duke. Angelo: | There is a kinde of Character in thy life, . . . 32'. Therefore, in order to provide the most efficient context based on the typographical line, short lines have been joined together, or have been attached to longer lines. The end of each line of text is marked by a vertical bar. This device is particularly useful in prose passages where a word has been divided between the end of one line and the beginning of the next (as at Tmp., l.17: Ma-|ster?). In such cases, the second part of the word has been joined to the first in order to ensure that the whole word is listed, rather than MA and STER, with the point of division marked by the vertical line. For such divided words, the line number given must obviously refer to that part of the context line up to the end-of-line bar, although when a turn-over is found, as at Tmp., l.118, '. . . Sit | (downe,' the line number refers to the whole line. This case is discussed again with reference to line-numbering. It is not always possible to give a sufficient context under each heading to enable the reader to comprehend the precise circumstances in which each word has been used. Although there is little technical objection to the joining of a large number of lines together, it is not usually economical to do so for too frequent use of this device would expand the concordances excessively. This is one of the points at which the concordance editor who works from slips has an advantage over the computer, for he can exercise his judgment for every case.

When the editor has dealt with long lines, italics (marked on paper-tape by a distinctive character before and after the portions of the text which are to be in italic fount in the final out-put), the linking of short lines together in order to enlarge the context line quoted under the headings, the expansion of contractions, correction


Page 152
of misprints, and the clarification of any matters such as word-division which might trouble the typist, he has still to consider the text in relation to the kind of arrangement he desires to achieve, and his obligation to present all available material.

There are categories of words where the information conveyed by the full contextual reference would hardly warrant the consequent uneconomical expansion of the concordance were they to be included in full. Such words might be a, of, and the. One must balance here the partly conflicting factors of economy and utility. To provide full contextual references is uneconomical and, for many words, not very useful. To omit some words entirely and to represent others partially (Bartlett's expedient) is unsatisfactory. To give representative quotations for the most frequent words in the text increases the complexity of the editing for little real advantage. To represent a certain number of words by a frequency count only offers merely a minimum aid to the scholar while still leaving the concordance (unless the list of words treated in this way was very long) uneconomically large. Finally, to provide for some words the numbers of the lines in which they occur, together with a frequency count (which is in effect the arrangement of Montgomery's Dryden concordance) still offers the reader an uninformative list under such headings as A or THE, while not really effecting economies of space.

The compromise I adopted combines the last possibilities: the concordances give frequency counts only for twenty-two most common and less significant words, and line numbers and frequency counts for 199 other spellings. In making up the 'count only' list, I assumed that the reader would more readily pick out the frequently-occurring words by visual inspection than by referring from the concordances. The questions of which words to select and on which list they should be put is difficult to settle when one lacks word frequency counts for English of this period. I found that the rank order by magnitude of frequency varied considerably within the first few concordances. I was guided at first by Dewey's table of most-frequently-occurring words, compiled in 1923 from a sample of 100,000 words of modern American English.[3] He found that nine words (the, of, and, to, a, in, it, that, and I in that order) formed over twenty-five percent of the sample. These words were obvious candidates for the 'count only' list in order to keep the size of the concordances within reasonable bounds. That was transferred to the 'locations' list, and the 'count only' list now consists


Page 153
of a, am, and, are, at, be, by, he, I, in, is, it, of, on, she, the, they, to, was, we, with, you; in other words, common tenses of the verb 'to be', personal pronouns, articles, conjunctive 'and', and the most frequent prepositions. On Dewey's figures, these words would account for about twenty-eight percent of a sample of modern English. The composition of this list might have been varied from concordance to concordance (which would have presumed knowledge of the most frequently occurring words in the particular text before the concordance program was run) but it seems reasonable, even though the texts are concorded separately, for the reader to have consistent material from concordance to concordance.[4]

The list of heading words to be represented by line-numbers and counts (the 'locations' list) has a more complex composition, if only because it includes a large number of variant spellings and many homographs. The 'locations' list comprises 199 items, mainly act and scene headings, exclamations, foreign language articles and pronouns, other infrequent pronominal forms which otherwise would be concorded with homographs (e.g., heel|hee'l, hell|he'll, ill|I'll, well|we'll), auxiliary verbs, prepositions, negatives, relative pronouns and conjunctions, and a miscellaneous group of &, &c., 1, 2, 3, 4, and sir. Naturally, the composition of these lists would have been improved had I known which were the most frequently occurring words before the first text was run. However, the evidence of the concordances will be useful in determining the composition of the lists for any one-volume concordance to Shakespeare or other authors of his period.

The pre-editing discussed so far affects the legibility and intelligibility of the context lines, but hardly influences the running of the concordance programs at all. A more complicated part of the preediting is the addition of symbols acting as markers or discriminants. The particular character selected to serve this purpose I call a 'tag' (two together, a 'double-tag'): they do not appear in the printed concordances. Tags are convenient to distinguish words in the text from other words, and they are used to mark words which are to be excepted from the usual routines of the program. Tags are used for several purposes, the effect of the tag depending on its relation to the line or word, and its adjacent characters. Since the tag amounts to


Page 154
an exception marker, we should discuss the features of the text which require exception routines.

Speech-prefixes are of little glossorial interest and, since there is small point in listing a large part of the text under the speech-prefixes for the characters in it, or even in giving line-references (since speech-prefixes may more readily be traced directly in the text), I decided to treat them as 'count only' words, with the speech-prefix list varying from play to play. It appeared at first that the obvious test for a speech-prefix was for the program to inspect the beginning of each line for an italics character, and having found one, to check that word against the list of speech-prefixes. However, any list of speech-prefixes contains spellings which are forms of general occurrence (e.g., Off.|off; An.|an) and common at the beginning of italicized lines. Furthermore, not all speech-prefixes start with an italics character (e.g., 1.Sol.) or occur at the beginning of a line of text. To deal with these exceptional cases, tags were inserted during the pre-editing. A single tag before italicized spellings known to be on the speech-prefix list but not speech-prefixes, was used to inhibit comparison with the list of speech-prefixes. A double tag before speech-prefixes which did not satisfy the usual positional test was used to force the program into the speech-prefix routine. (If a spelling is double-tagged but not on the list, it is not treated as a speech-prefix.) Intentional omissions from the lists were forms like All., Both, and the names of actors such as Sinklo which were felt to have exceptional interest. Too late I discovered that speech-prefixes such as And. for Andrew, for which there were equivalent forms on the 'count only' list, could not be included on the speech-prefix list, for there were inadequate means in the main editing program to allow an extra count to be printed under that heading. These spellings have been left off the speech-prefix lists, and lines containing them have been printed in full under the appropriate heading, contrary to normal practice.

The versatility of this device can readily be perceived: for a prose text in which there are no speech-prefixes but there are other sporadic words not on the 'count only' or 'locations' lists (e.g., chapter headings, numbers, or foreign language material), the provision of a full context line can be suppressed by putting these forms on the speech-prefix list, and inserting such double tags as are necessary. The list of speech-prefixes is a special exception list in contrast to the other lists which are general exception lists.

The tag has a further important application in connection with compounds. Hyphenated compounds have always presented difficulties


Page 155
in computer-generated concordances. One may give an entry under the whole hyphenated form, or remove the hyphens and list each part of the compound separately, with obvious loss of information. Without an excessively elaborate program, a great deal of pre-editing, and possibly post-editing, one cannot satisfactorily deal with all possible types of hyphenated compound, particularly when the hyphen, by which the program distinguishes the hyphenated compound, also frequently occurs marking the division of a word between lines. In order to avoid separating the components of hyphenated compounds, and to ensure as comprehensive a listing as possible, I used the tag again, inserted before subsequent parts of hyphenated compounds for which a separate entry was desired. This caused the program to treat each tagged component as an additional word. Hence, hyphenated compounds are recorded both under the full form, and again under their parts. When hyphenated compounds have had no tags inserted, they are listed only under the full form.

Other compounds also yield to use of the tag. Such forms as o'th'King have tags before the grapheme for which a separate entry is desired, but in these cases, because there is no hyphen, the tag has the effect of terminating the first part of the compound and starting a new reference word after the tag. Hence a line containing o' tag th' tag King would be listed under O, TH, and KING. I chose to keep such elisions as i'th' together, and so inserted tags only before the substantial form to which the elision was attached.

In extended passages of prose, there are many words which are divided between lines, with the division indicated by a hyphen in the familiar way. Since it is pointless to concord the two parts of divided words like ob-|scure separately, I have inserted end-of-line bars, as the example shows. To form the heading word in such a case, the program removes the formal characters — |. This point is relevant to hyphenated compounds, for when a tagged hyphenated compound for which entry is desired under all parts occurs divided between two lines, as with Noyse-| tag maker at l.52 of Tmp., the removal of the characters — | converts this word into the second type of compound discussed above. The presence of the tag causes the word to be split into its components when the concordance messages are formed. Hence this line is listed under NOYSE and MAKER, not NOYSE-MAKER as it would have been had the usual hyphenated compound routine applied. There are various expedients which could have been devised to overcome this inconsistency, but they are so laborious and complex, and the gain so little, that I have decided to leave well enough alone, and have


Page 156
accepted the inconsistency. Had the tag not been inserted, this line would have been listed under NOYSEMAKER. This might have been acceptable in this case, but with other more complicated divided compounds such as cat-a-mountaine or father-in-law, the results would not be pleasant.[5]

Another aspect of hyphenated compounds should be noted. When, as with giue-a-me-your, a hyphenated compound contains a particle such as 'a' for which one might hope to provide a full context line under the heading A (by tagging the 'a' to ensure that it is not treated as a 'count only' word), the tag inserted before a in order to decompose the hyphenated compound does not also except it from being treated as a 'count only' word. One tag, one function. Contrarily, neither does the tag supply a full listing for parts of compounds when they are on the exception lists. Giue-a-me-your will therefore be listed under the full form, and its constituents included in the counts for A, ME, and YOUR. References in the concordances will draw the reader's attention to such cases.

A further application of the tag as an exception marker ensures that homographs of words on the 'count' and 'locations' lists are given full contextual references. A tag at the beginning of a word inhibits the search of these lists, and ensures that ill, for example, is listed in full, while the homographic I'll is given only line references. The use of the tag for rudimentary homograph distinction is restricted to words on the 'count' and 'locations' lists. Homographs most frequently distinguished in this manner are therefore a (of/he, etc.), an/and (if), I (ay), to (too), art n., bee n., but n., di'd, doe n., ha', h'as (he has), hast n., (or has't), heel n., hel/hell n., ill adj., it (its), may n., might n., mine n., of (off), off (of), wast n. (or was't), wer't, wil/will n., and wil't.

The tag was also used to suppress the shorter listing of a form on these lists when the word was used in an interesting or ambiguous way, although the provision of a full context line should not be taken as implying that every occurrence of and, for example, listed in full, should be read as an = if. Such excepted occurrences are those that seemed at the time of marking up the texts to warrant closer consideration when the concordance to the particular text was available. At l.940 of TN., for example, the 'I' in 'Duke. I prethe sing," has been tagged to provide a full listing because the ambiguity here warrants careful study: it has not been excepted because I am sure that the I is equivalent to ay. I know of no concordance program which can


Page 157
properly separate homographs without extensive editing, and the insertion of elaborate grammatical routines. The tag has been thus used to preserve potentially useful material which might otherwise have been concealed in the frequency counts or list of line numbers.

In the concordance listings, the first count refers to the occurrences of the form on the 'count' or 'locations' list, and the second count is of the excepted (tagged) occurrences listed in full under the same heading. In most concordances, the entry for A affords an example of this. Where there is a third count, it applies to the spelling used as a speech-prefix.

One other printable symbol has been introduced into the texts. Following McKerrow, a standard system of Folio through line numbering has been adopted. For lines in the quarto texts not present in the corresponding Folio texts, a + has been added at the beginning of the line. When this character is encountered by the program, incrementing of the main line counter is suspended and an auxiliary line counter is incremented each time a new line starting with + is processed, until a line is found which does not begin with the + character. At this point the main line count is resumed. Thus, 'Heere comes the Prince and Claudio.' of Q Ado., which is not found in the Folio text, is listed as line 2588+1. This shows that the Q line is not in F, or contrarily, that the F text lacks a line after line 2588. In order to obtain the Folio through line numbering, each typographical line containing text from the first act or scene heading or stage direction, excluding catchwords, has been counted. Turned-over lines have been treated, according to the compositor's manifest intention, as part of the line to which they belong, even when, as after l.1665 in Err. and after l.2018 in IH6, the turn-over is found on a line by itself.[6]

A final point which the editor had to consider during the preediting arises from the order of the alphabetized old-spelling headings, with which readers unaccustomed to the vagaries of early-modern spelling may not be familiar. The reader will find that words containing i/j and u/v have been sorted in one sequence under I and V respectively. Words starting with numerals are listed at the end of the concordances, after A-Z. But there are other difficulties associated with different kinds of spelling variation in the early texts. However, most


Page 158
of the problems readers are likely to meet in locating particular headings they require can be offset by the references which are incorporated in the concordances, during the running of the editing program. References carry out a number of housekeeping tasks. Where the spelling differs from the modern in the first and/or second letters, a reference has been given from the modern spelling (with which the reader might legitimately begin his search) to the old spelling of the text. When a heading in modern spelling or substantially (that is, initially) in modern spelling is already in the concordance, the reference is a see also reference, directing the reader to additional entries which concern his enquiry. I have not attempted or seen fit to give references from every modern spelling to the old spelling when the reader would naturally begin his search in that area of the concordance; and I have attempted to attach references to headings already in the concordance. Hence, 'AY see I', but 'YIELDE see also yeild'. I have also referred from the names of characters to the speech-prefixes found in the text, e.g., 'KING see also K., Ki. Kin.'; from words on the 'count' or 'locations' list compounded with other forms, e.g., 'AT see also bemockt-at-stabs'; from forms for which the modern and early word division differs, as 'TONIGHT see night', 'HERSELFE see selfe'; from full to elided forms, 'OF see also o''; from numbers to figures, 'FOURE see also 4'; from most homographs even though they are in natural alphabetical order, e.g., 'THAN see also then', 'LOSE see loose', 'OFF see also of', and 'REIN see raigne'; and to forms which are so contracted that the variant spellings might be widely separated or listed with other unrelated forms, as with 'LETTER see also Let.'. Infrequently I have provided a reference from a widely accepted emendation to a form which appears corrupted in the text, as for H5, 'HONNEUR see also honeus', but in general corruptions have been listed without such comment.

Most of the references were written during the pre-editing, but a check on their accuracy was provided by another program which printed the headings from the sorted concordance tape. This list, together with the intermediate printing program, allows the editor to be sure that the references are properly distinguished as see or see also, and to add more references as necessary.

A particularly demanding aspect of this work was the interaction between data preparation and program writing. As progress was made with one or the other, various points came up which made it necessary


Page 159
to modify either the data preparation conventions or the program specifications. One such modification was the programmer's discovery that every page in the text was preceded by a page-label which had no function in the preparation of the concordances apart from its use as a positioning aid for the insertion of corrections. By the time the programmer had devised a test for page-labels, and had arranged to skip over them, quarto texts with page-labels of somewhat different format had been typed, and had therefore to be corrected. It is often difficult to see the programming implications of conventions adopted for the data, which more often than not are initially established more for the typist's convenience than for suitability for computation. This tension between data and programming might have been reduced had the program specifications been settled in substance before the texts were typed, but the matter is not so simple.

In the first place, this order assumes that all data characteristics with programming implications are known at the beginning, as is often the case in commercial data-processing when the format of the data can be manipulated at the point of generation to facilitate efficient programming and computation. For Shakespeare's texts, which can not be rearranged to suit programming convenience, various attributes of the text emerged only when the pre-editing was well advanced, and even then were sometimes not recognized as having programming implications. When, for example, one has specified that speech-prefixes may be distinguished by their having a terminal italics character and no internal spaces, the discovery of forms like Old Cou. requires either program or data modification. Similarly, when the [has been used for ?, the three [in the text (foul case for the parenthesis?) required substitution. For the Shakespeare concordances, these considerations were not allowed to influence the course of events too greatly, for it was clear from the beginning that the preparation and proofreading of the texts was going to be the main limiting factor. The reader will appreciate the enormity of the task when he reads an account of the processes involved in the preparation of a single text for computation; for thirty-seven texts, this process tends to become laborious.

The typing of the texts to paper-tape was entrusted to the manufacturer's data processing bureau in London. In June, 1964, the first test was run on MM., selected because of its average length and complexity. Suggestions that the error rate would be so minimized as to make proofreading unnecessary if there were double-typing on two differently arranged key-boards proved ill-founded, for the typists made a rather large number of identical misreadings of the unfamiliar material


Page 160
and tended to introduce archaisms in common. Proofreading was seen to be essential, and the rest of the procedures were organized on the basis that the uncorrected text was to be written to magnetic tape from paper-tape as early as possible, in order to use the computer's unrivalled facilities for accurate fast copying and recopying, with elaborate checks to prevent the introduction of errors.

During the preliminary stage, the introduction of such computational aids as tags, and certain character substitutions, were decided upon. The machine characteristics of conventional character sets differ amongst computers of different manufacture, and for this work the characters in the texts were represented by a selection from the KDF9 paper-tape code, which had no representation of!, ?, ", ', &, and | (end of line bar). For these were substituted the otherwise unassigned characters £, [, =, ↑, z, and]. Since z does not occur by itself in English, it was used for the ampersand, and as zc, for &c. Other conventions were chosen to avoid obliging the typists to shift from one case to the other unnecessarily. The multiplication sign, and ÷, both in case shift (or lower-case), were typed for the point and hyphen respectively, and →, usually the end message character, was used for the tag. Dashes were represented by three hyphens, thus ÷÷÷. Other characters were represented conventionally, but characters not available on the Friden Flexowriter KDF9 keyboard, such as yt, wc, â, ô, and digraphs (which were expanded) could have no special representation. These were the characters with which the computer worked; for the printed concordances the conventional characters were restored.

The order of events in the data preparation stage was as follows. I edited the texts in the manner I have described, and at the same time prepared lists of speech-prefixes and references for each text. The editing required at least three scans through the text, or about thirty hours. The copy for Folio texts was sheets from the Lee facsimile of the Devonshire copy or, when these were not available (unfortunately, my unbound copy was imperfect), Xerox prints from University Microfilms' reproduction of the Grenville copy in the British Museum. For the quartos I used the Clarendon Press facsimiles edited by Sir Walter Greg and Professor Hinman, otherwise University Microfilms' facsimiles. The particular copy used is noted in the introduction to each concordance so the reader can consult the originals. Since I have followed the readings of the particular copy from which I worked, a note of the press-variants (which are not specially treated in the concordances) recorded by Hinman has been given in the introductions to the several concordances.


Page 161

After the facsimile copy was marked-up, the texts were then punched to paper-tape in London. The paper-tape was printed out, and was proofread twice by an independent team of proofreaders, who marked the necessary corrections on the printed copy, and returned it to me for a further check, a re-reading not amounting to a proofreading. This served mainly to check the original editing and the proofreaders' accuracy. In the worst case, only three substantive unmarked errors were found, a figure so low as to suggest that further proofing would not have significantly improved accuracy. Although one's primary concern is to keep the error rate as low as humanly possibly, there is an error rate in work of this nature which cannot be avoided. I would be grateful if readers would communicate to me any errors they come across.

After this check reading, the printed copy with the marked errors was returned to London for the preparation of correction messages, and the uncorrected paper-tape text was written to magnetic tape by the text establishment program. The text correction stage uses the facilities of a manufacturer's standard library program, called POST, originally designed to write and correct computer programs on magnetic tape, with complete lines as the unit of entry and correction. This program was modified to permit operations on words inside lines, in order to keep the extent of the corrections as small as possible and to reduce the possibility of introducing errors into text already passed as correct. Because the text was copied from magnetic tape to magnetic tape with automatic checks by the POST program, and could be output to paper-tape or in readable form to the line-printer at any time by another library program (UPDATER) for off-line checking, the possibility of machine corruption of the text was negligible. Occasionally, there were misadventures, but these arose more often than not from human fallibility or misunderstanding of the systems, and were readily detected and corrected.

Preparation of the correction messages for the POST correction program required writing, typing, and proof-reading of messages relating to consecutive lines and words of text, which replaced incorrect lines or words by the proper sequences of characters. This system made it impossible to make false corrections: if the sequence of characters in the text on magnetic tape did not correspond to that on the correction tape, the correction run failed, and a failure message identifying the reason for failure, and the particular faulty correction, was out-put for the editor's information. When this happened, the correction paper-tape had to be corrected, and the correction sequence re-run


Page 162
until all the corrections succeeded. This would be a laborious operation if it were carried out on a single text, but it is reasonably efficient when a large number of texts is corrected in a single correction run. The maximum number of texts which may be corrected at one time depends only on the number which can be held on magnetic tape. For a large project such as this where accuracy is essential this system is tolerable, but the time needed to get one text corrected makes it infuriatingly slow for anyone intending to concord just one text. A flexible on-line system using a visual display screen would make this process much more efficient.

Once all the corrections have been successfully inserted, the corrected text is printed out, and thoroughly proofread by the editor for the last time. This reassures him that the text faithfully represents the original and his intentions, and also enables him to check that it is properly organized for the running of the concordance programs, that is, that all the extra markers necessary to identify homographs, speech-prefixes, extra lines and so forth have been inserted consistently. After any necessary final corrections, and the running of another program to check whether all lines of text are present and that the numeration corresponds to various manual counts, the text becomes available for the main concordance program. As far as the editor is concerned, the main burden has been lifted from his shoulders.

The concordance program is run in three parts. The first performs what is generally understood as 'concording'; that is, it reads the lines of the text, extracts words from them, forms the keywords for sorting, and writes messages containing the key-word, reference line, and line number appropriately for the different kinds of out-put ('count only', 'location only', speech-prefix, or full) for every word in the text on to an out-put magnetic tape. The messages on this tape are naturally in text order, and so are sorted into alphabetical order by another program which produces a sorted magnetic tape as in-put for the editing program. The editing program produces two styles of out-put at the discretion of the editor who has the opportunity of sending the edited concordance to the line-printer, or to another magnetic tape for further editing for film-setting. This editing program takes the sorted messages and organizes them under the heading words, incorporating the references from paper-tape at the same time, to produce pages very much like those of the final concordance. The edited concordance on magnetic tape was organized for printing on an Xeronic printer (after conversion to IBM magnetic tape format) which at the time of writing the out-put specifications was the only available device with the necessary


Page 163
range of upper- and lower-case characters, and sufficiently fast to make the work economical. However, when at the last moment it became possible to contemplate film-setting, I wrote another program to convert the Xeronic out-put for film-setting. This tape was edited by Computaprint Ltd., London, who delivered a film to the Press for conventional printing, in the normal upper- and lower-case characters. To the film-set concordance it was necessary only to add a short handset introduction listing the line-numbers, corrected misprints, and other useful information, for the printer to print and bind, with no proofreading or press correction.

The final product in the first instance is thus thirty-seven concordances to the plays of Shakespeare, prepared from the best early texts. The greatest limitation of the present programs is that they were designed to handle only one text at a time. I have been working on programs to concord any number of texts into one concordance, and I hope that these programs will be applied in due course to concordances of the First Folio, and Shakespeare's poems.

It is tempting at this point to consider the task done, but concordances, despite their obvious and essential utility, afford only the simplest ordering of the linguistic evidence of their texts, and, save for the most important authors, this hardly justifies the effort necessary to produce them. For my own work the value of the concordances is small, and parallel with this work I have been developing programs which analyze the orthographical characteristics of the separate pages of each text. These programs throw into relief attributes of the text which remain in relative obscurity in the concordances. (See for example the distribution of capitalized and non-capitalized CHAINE in Err.). Because these programs are more discriminating than any concordance program could be expected to be, they offer easy means for the study of stage-directions, speech-prefixes, italicization, justification, capitalization, hyphenation, elision, rhymes, and the treatment of proper nouns, as well as variant spellings, on different pages of text. The unit may also be the individual column, scene, play, or indeed any other unit of text desired. They also provide for the combination of sections of the texts by different compositors, by date of composition, from different types of copy or in any other combination, and lead ultimately towards the application of multivariate analysis techniques which should remove many questions (particularly in regard to distribution of punctuation) from the realms of speculation. I do not expect that this will be completed overnight, but when it is done, I hope it will be shown to have justified the time and money spent in preparing


Page 164
Shakespeare's works for computation. From devoted study of the concordances by a host of expert and minute textual scholars, and from the further factual base which will be afforded by additional computational analysis, we can hope to make large strides towards the ultimate edition to which we all look.