:: :: University of Virginia Library

The analysis employed a number of computer programs to quantify membership of word lists, to provide vocabulary listings, and to identify a number of different phrasal patterns. Two word lists were used, the "closed set" and the "core vocabulary." The closed set comprises words which are functional, irrespective of their relative frequency in texts. This list includes some 400 words headed by the very frequently occurring words "the," "of," "to, "a," and "that." It also includes numbers, which can cover several different functions but are normally reliant on context for their meaning. Occurrences of words on this list account for 60% of all the texts examined in this study. If words are not on this list they are treated as "lexical," carrying the content

269

of the texts. The core vocabulary is a list of the 600 or so lexical words which occur most frequently in twentieth-century children's literature, and which form a quantifiable subset of the 40% lexical component of the texts under examination. (Its usage is explained below.) Two other specialist terms are used throughout the text. Words which occur only once in a given text are referred to as hapax legomena, and those occurring twice as hapax dislegomena.

The assessment employed computer programs developed from authorship attribution studies, but designed specifically for forensic linguistic purposes (Woolls and Coulthard 49-56). Three measurements of the vocabulary are used:

1. "Lexical richness" is a calculation based on the number of lexical hapax legomena used by the author of a given text. The score is arrived at by dividing the logarithm of the full text length by the proportion of the vocabulary which is used more than once. This gives a higher score when the divisor is low (i.e. fewer words used more than once), hence the concept of "richness." Transforming the text length to its logarithmic value gives comparable scores for texts of very different lengths, reflecting the fact that hapax legomena are usually spread fairly evenly throughout a text of any length, but obviously represent a smaller proportion of longer texts. The term "lexical richness" refers to a writer's use of the vocabulary in a given text rather than the size of his/her vocabulary. (As a rough guide, a richness level of 700 to 800 is found in the news items of the press and a level of 1200 or more is found in poetry.)
2. The "lexical hapax dislegomena percentage" is simply the number of lexical words which occur twice divided by the total vocabulary.
3. The "core vocabulary percentage" is the total occurrence of words which are on the core vocabulary list divided by the total lexical usage.

All three measurements reflect how writers have used their vocabulary, consciously or unconsciously, with no regard to structure or meaning but simply by levels of occurrence. These features are common to all texts and observations of the quantities of each may provide an objective indicator of similarity or difference, which should then also be observable in structure and meaning. All three measurements have been shown to discriminate between texts at the 5% significance level, which is the statistical limit usually taken to indicate that the results are not obtained by chance but reflect distinct differences.

The first two measurements relate to research reported in Holmes (259268). The theory underlying them is that comparison of the scores produced by the writings of two authors will allow them to be seen in different places on a scattergraph, when the lexical richness scores are plotted against the hapax dislegomena percentage in two-dimensional space. One author will appear on the left and another on the right, or one at the top and one at the bottom, depending on which discriminator is the stronger. When more than two authors are being examined, they should each appear in different segments. This is not always a clear-cut division, because there is no suggestion that authors always write in exactly the same way, but the theory maintains that the measurements used are likely to reflect general habits, of which an

270

author tends to be unaware during writing, and that observation of these results over a range of texts will reveal a tendency in one direction or another. The third measurement used, the core vocabulary, arose from an earlier use of the programs by Woolls in relation to the development of writing skills in children. This set has also been found to be present in substantial quantities, between 25% and 38% of all lexical occurrences in texts, in eighteenth- and nineteenth-century writing, and measuring the occurrence of the set has proved equally valid as a discriminator in writing for adults from the eighteenth century to the present day.[8]

The closed set, though used initially only to identify the lexical words by their absence from the set, is not discounted in the analysis, since it can provide indications of authorship habits, and indeed is used by authorship attribution analysts where large amounts of data are available for precisely that reason. In forensic linguistics, the text lengths are normally very much shorter than any of the texts examined here, so separation allows closer analysis of the interactions to be undertaken. In addition, this separation reveals that in texts of greatly varying lengths the actual number of closed set hapax legomena and hapax dislegomena is remarkably stable. The texts examined range from 776 words to 13,074 words in length, but most have around 50 hapax legomena and around 24 hapax dislegomena from the closed set. This is why, in this study, all scores are calculated on lexical quantities, to eliminate potential distortions caused by the higher proportionate representation of the closed set in the shorter texts.

The other feature of the lexical hapax legomena in particular is that they are usually spread throughout the text, not always evenly but in substantial quantity wherever an examination is made. The degree of regularity may be examined for any given text, to ensure that the discrimination between authors indicated by the lexical richness scores is in fact based on texts which broadly conform to the expected distribution. Where texts manifest different patterns, further investigation may be required. Such an investigation forms the core of this essay, following the initial stage of analysis.