Corpus Linguistics

by Tony McEnery and Andrew Wilson

Edinburgh: Edinburgh University Press, 1996


Statement in Response to Review in IJCL 2:2

In writing this response, we would like to make it clear that readers of books are entitled to hold whatever view they wish about a book they have read. When a reader turns those views over into a public statement on the merits and demerits of a book, then that again in our view is an entirely justifiable activity. We do feel, however, that excessive, unfair and unremitting criticism should be responded to. This response to Michael Stubbs's review of our work is made under such circumstances.

Handling Errors in a Community of Scholars

In responding to the review, the first thing we would like to make clear is that our book does contain errors - typographical and in one case attributional. Indeed, every book contains errors. We were aware of that, and when we established our web site for the book, we invited readers to write in with comments on what we had written and also with errors that they had spotted. We have a very open policy on errors.

In putting our heads above the parapet with the first general textbook on corpus linguistics we knew that we would collect the odd brickbat as well as the odd bouquet. Both have indeed been received. However, to take the heat out of any ensuing debate we had adopted the aforementioned policy of openness. This policy has failed in the case of this review, consequently we must write in response to elements of criticism we feel unfair. As for the fair comment, the review says nothing which has not been said, in an all together friendlier and more collegial fashion, by other corpus linguists and by interested linguists of the generative camp.

Criticisms

We will now proceed with a point by point rebuttal of the criticisms with which we disagree.

Stubbs complains that not all of the words that are in bold in the book are in the glossary. Bold is used for emphasis as well as denoting glossary terms. For this reason also, not all bold terms are in the index. There is no hidden agenda here.

Stubbs complains that we give no concordance program listings. While we do not list concordances, we do provide appendices which detail available concordancers and corpora for use with them. We must also note here that it was never our intention to cover the practical "how to do it" side of corpus analysis in our book. When we conceived the series, we envisaged a "foundation course" of three volumes. Ours was to be an overview and in particular an introduction to the background and achievements of the corpus-based approach; Geoff Barnbrook's volume was seen as covering the how-to-do-it of actually working with text on a computer (including concordancing); and Michael Oakes's recently published volume was intended as a detailed introduction to the statistics of corpus linguistics. Geoff Barnbrook has made an excellent job of discussing the practicalities of concordancing and it would have been pointless to repeat this information within the same series.

Stubbs criticises the student exercises in chapter 2 by claiming that "A quite unrealistic study question, asks students to 'compare two or more' corpus annotation systems (p. 60)". This is a very misleading representation of the question. The question asks the student to look at ONE scheme, and gives a reference to a book in which one may be found if the student does not have easy access to such a scheme. Additionally, at the book's web site (http://www.ling.lancs.ac.uk/staff/andrew/data.htm) we have two annotation schemes ready for downloading. Even if this were not the case, considering the increasing availability of annotation schemes this question hardly seems infeasible.

At the end of the question, we say "If possible, try to compare two or more systems". We do not see how the question we set marries up to that which Stubbs claims we set. Even so, we do not see that the question that Stubbs claims we set is any less realistic than the one we did set, especially since he does not bother to explain why he considers it unrealistic.

Stubbs complains that chapter three is too brief. This is an introductory textbook. We give references to the literature for those students interested in depth. We are trying to provide a fair degree of breadth within limited space.

Stubbs complains that the review of statistics for corpus linguistics given in the book is too brief. Page 67 of our book clearly states that we do not intend to produce a 'how to' guide to statistics in corpus linguistics. We note that this is the issue of a forthcoming book in the series (see above on our conception of the series). Rather our aim for this chapter is that students should be aware of what statistics are used and what people try to claim on the basis of them. "Our recommendation is that the student reads what we have to say here as a brief introduction to how the techniques can be used and then progress to the more detailed treatments in other texts for further explanation". We do not think that this is an unreasonable approach to take in an introductory overview, especially when references are given and a detailed treatment is forthcoming in the same series.

Furthermore, many of the techniques described can now be performed automatically in programs such as SPSS for Windows. Our view is that corpus linguists can and should make use of statistical tests but that they do not need to know the ins and outs of the mathematics, so long as they understand (a) when it is appropriate to use the tests, (b) what data are required, and (c) how to interpret the results. To take an analogy, a medical doctor is typically not trained how to perform the many laboratory analyses that he requests in the course of his work. But he is taught what to test for, what samples he needs to take, and how to interpret the results. This approach is no less valid with corpus statistics, especially since it will empower that not insignificant number of linguists who are inclined to glaze over at even the first paragraph of even the most non-technical guide to statistics.

Stubbs criticises the study question on page 85 by saying "A study question (p.85) asks students to choose an appropriate statistical test for different problems: the chapter does not give enough information to do this". Again the reviewer indulges in a rewording of the question which misleads - we actually say " Which of the statistical tests in this chapter would you use in the following situation". We are not asking the students to do anything other than consider the description of the statistical tests which we have used in the chapter. Given that the situations we then list have clear parallels in the studies covered in the chapter, we have set a simple comprehension test. We also list what we think (on the basis of the chapter) the answers should be and why.

Stubbs criticises the depth of the review of corpus based linguistics given in chapter four. The coverage here is, of necessity, brief, since we were aiming for breadth rather than depth. We state this quite clearly at beginning of the chapter and refer the student to various collections of papers for further examples. Although we are conscious of not being exhaustive, it seems to us that the main limitations of the chapter are not those which the reviewer suggests. We are much more keenly aware of having omitted the mass of work which is becoming available in the area of learner corpora and corpus based contrastive linguistics. In all honesty, we can say that when, as is planned, we produce a second edition of the book, we will devote space to a brief review of these as a matter of priority rather than anything suggested by the reviewer.

While in the grammar section we could have devoted some space to Cobuild, we thought it more important to devot e space to describing the work of the Nijmegen team, whose work (marrying rationalist and empirical approaches to linguistics) is very much in keeping with the spirit of the book and, we think, relatively under-reported. We are well aware of, and value, the work of Burro ws - and other scholars (e.g., Fortier 1989, 1991) - in literary computing and stylistics, but we chose to omi t that work here because it deals mainly with individual literary texts or oeuvres rather than with corpora in the sense that we use the term within our book (e.g., pp. 87 and 101). Also, literary detective work is a main theme of Statistics for Corpus Linguistics within our ser ies.

Stubbs complains of an absence of references to Cobuild and the work of Birmingham corpus linguists. The work o f the Birmingham corpus linguists is touched upon within the book. We appreciate their work and value it - inde ed the second book in our series was by a Birmingham corpus linguist, and Lancaster has co-operated fruitfully with corpus linguists in Birmingham in the past. However, we were writing the book from the point of view of a different tradition - the Lancastrian tradition of co rpus linguistics. Other reviewers have noted the bias and viewed it as understandable. We did not omit the work of Birmingham on grounds of spite or wilful neglect. The Birmingham team has its own story to tell, and we fel t that we should rightly leave them to tell it.

Stubbs criticises the off-putting tone of our introduction to chapter five, where we warn readers that our intr oduction to corpus based computational linguistics is brief and preliminary. It is curious that the reviewer pi cks up on the statements which we made explaining to readers how to approach chapter five when similar comments in chapter three were overlooked.

Stubbs makes a series of criticisms of chapter six:

  1. He complains about the use of the phrase "tending towards being finite" on a "something is either finite or it is not" platform. This leaves us with a question. In reality, as is shown in the study, we do not see a sub language as truly finite when observing even a large chunk of it. What we do see, however, is that when we plot particular features, such as lexicon growth, the resulting curve is asymptotic. It does not actually plateau, indicating full closure, but shows a decided tendency to do so. It is this situation we are trying to describe, in a fairly colloquial way, when we use the phrase "tending towards being finite". We dare say that, had we cl aimed they were finite, the reviewer's criticism would have been that in this case they were not. We were seeki ng a form of words to describe an expectation we had of the data based upon observed behaviour.

  2. Stubbs complains that "the same corpus is given as 300,000 and 200,000 words" - The full size of the APHB d ata set is some 300,000 words. However, when it is presented as being 200,000 words in table 6.1, we are addre ssing only the portion of the corpus which is annotated.

  3. Stubbs complains that "no reference [is made] to the origin of the concept of restricted languages in J.R. Firth's work" - quite right. We appreciate that for a follower of Firth this must be a lamentable oversight, bu t it is by no means essential to the point: we were not trying to write a potted history of sublanguage researc h. If we were, there would be many other authors who would deserve a mention.
Stubbs complains that chapter seven in discussing future corpora "tells us that potential future developments .... are 'mind-boggling': with no further comment on what these implications may be" - The sentence which introduces 'mind-boggling' actually concludes: "the ability at last to use the sam e platform to store, analyse and describe data". The whole of the paragraph which concludes with this statemen t explores multimedia corpora. If the reviewer would like to examine these possibilities further for himself, t hen contacting teams at Gothenburg, Edinburgh or Lancaster where multimedia corpora are under development could be eye opening. Stubbs then lists a series of further errors and promises us he has a store of them of unspeci fied size. Some of the criticisms are fair, if, as we have said before, acknowledged by us previously. Let us turn now to those fresh 'errors', however.
  1. We are told, categorically but with no explanation of how, that the definition of mutual information we giv e is incorrect. We have scratched our heads about this one, and concluded that he must be referring to the diff erence between specific (which we present) and general (which we do not present) mutual information. As mutual information is now almost synonymous with specific mutual information, we thought there was no harm in presenti ng specific mutual information as simply being mutual information. In a classic piece of vanity reviewing, the reviewer then suggests that we read his interpretation of mutual information in works by Clear and Church et al . Neither the Clear or Church et al references are in Stubbs's bibliography. We were much more to the point - w e simply told readers to check Church et al (1991) for a fuller discussion and included the reference to the pa per in our bibliography.

  2. Considering whether or not mutual information can or cannot be used to determine collocations is an interes ting point, and we assume that Stubbs's uncertainties focus, as they have with other statistical tests, around assumptions of normal distribution. Let us be clear, however - we are reviewing the w ork of others. We say this is a possible use of mutual information in a run up to the discussion of Church et a l's (1991) study in which they use mutual information in this way and which got some interesting results. So if there is an error, surely the reviewer should lay it at the feet of the work we are reviewing? We do, inciden tally, make reference to the work of Daille (1995), who suggests an alternative - and arguably more effective - measure than mutual information.

  3. Some of the glossary definitions are castigated as vague or impenetrable. With reference to the modal verb observation, what are we to do - write a mini essay on the intricacies of modal usage in English or put up a th umb nail description which will get the naive reader back on the rails? With reference to Baum Welch, the definition of this algorithm is difficult. We define it from a chapter (chapter five) in which we have warned the re ader that many of the ideas are complex and that the naive reader should beware. Putting in enough material to explain Markov models would have grossly inflated the chapter. Putting in a glossary definition for those reade rs aware of Markov models but not used to the application of the Baum Welch algorithm seemed justified in the context of that chapter.

  4. "risible" - Stubbs asks the question why we find Kaeding's methods risible. In reality, what we say is that "A risible variety of workforces were used by early corpus linguists". We do not pick upon Kaeding alone. We f ind the whole undertaking of employing thousands of analysts to study language data as little short of laughabl e - if the only response that corpus linguists could give to linguists who were determined to eschew empirical data was that they should wait for a decade, employ thousands of analysts and see what happens, then we would, as happened, be laughed out of court. We do emphasise, however, that the work of linguists such as West and Kaeding was important, and surpassed onl y recently.

  5. Racism - the charge of racism is cruel, and we believe unfounded. If Stubbs had looked in the BNC he would have found examples of positive connotations for horde. It is a very moot point that using the phrase 'a horde of analysts from the Indian sub-continent' is in any way even remotely racist. Looking at the BNC data, we see "horde of souls", "horde of children", "horde of courtiers", "horde of tiny crablets", "horde of young g irls", "horde of young men", "horde of ribald workmen" and "horde of volunteers". Trying to suggest that horde is ineluctably bound to some fo rm of negative, racist meaning is a fanciful claim. This charge really sets the tone for the credibility of thi s review.

Chomsky

Stubbs devotes quite some space in his review to Noam Chomsky and the central position that chapter one gives t o his work. The reviewer seems to "find it odd that so much prominence is given to a position, over 30 years ol d, which is itself highly critical of corpus work". This statement is amazing. It was Chomsky's critical appraisal of corpus data which drove linguists away from its us e. To quote Seuren (1998:258) on Chomsky's influence on what is acceptable data in linguistics "Whatever consensus there is has been shaped in large part in the context of the discussions started by Chomsky". The works of Chomsky explain why corpus linguistics became a marginal methodology in linguistics . The fact that the views are so influential after 30 years surely explains our focus.

Stubbs then proceeds to complain that we do not cover the empiricist/empirical distinction raised by Chomsky. W e do not acknowledge or use this distinction, and merely use empiricist as the nominal form of empirical. As our book focuses on the empirical side of Chomsky's distinction, and we saw no re ason to dwell on Chomsky's leading role in the defeat of behaviourism (which we do not view as being synonymous with early corpus linguistics) we did not introduce his definitions of empiricist and empirical.

Stubbs then argues that we miscast Chomsky as a realist. Regarding the question of whether Chomsky has adopted a realist or an instrumentalist position, it is clearly possible to argue either - as Chomsky has in fact appea red to do. Chomsky states clearly (1975) that he was always a realist, and that even when he made statements wh ich appeared instrumentalist he was at heart a realist. Rather than confuse the narrative by exploring Chomsky' s inconsistent position on this question, we decided to take him at his word: in his works "the realist positio n is taken for granted" (Chomsky, 1975:37) and "linguists must describe reality" (Chomsky, 1975:81). If we bas e our views of Chomsky's attitudes to realism/instrumentalism on a reading of 1965 only, then it is possible fo r a reviewer to come to the conclusion, as Stubbs wrongly does, that Chomsky could never be described as a real ist.

Stubbs wonders why we do not attack the validity of the competence/performance distinction. With respect to thi s point, we can see that for those who follow the ideas of Firth (who as we recall saw the distinction to be a false one), such work is essential. We are not followers of Firth, but we do argue that the gulf between I and E langauge has been exaggerated - we do not dodge the issue.

Stubbs goes on to suggest we should have examined Chomsky's three tier system of adequacy for grammars. In our initial draft of the book we did aim to cover this topic, but dropped it for two reasons. One reason that we w ere interested in the question was that it enabled us to examine the concept of grammaticality. We decided to d o this, however, by a reference and brief review of Aarts (1991) who covers grammar induction from corpora and concepts such as the grammaticality and acceptability of corpus sentences. The second major reason we wanted to avoid the Chomskyan three fold adeq uacy argument was that it is "confused and unrealistic" (Seuren, 1998:256). To have brought it in openly wouldhave required a lengthy discussion which, given that we had the reference to Aarts, was largely unnecessary for our purpose.

Stubbs finishes with a complaint that we describe corpus linguistics as a methodology rather than giving it som e theoretical status. This comment alone shows the clear difference in orientation between the reviewer and the authors. We maintain that corpus linguistics is a methodology. While the fact that it clearly has an impact up on linguistic theory and has been buffeted by theoretical debate is beyond doubt, we do not view it as being in eluctably bound to a particular theory of language, though some will rely on it more or less than others.

Conclusion

We accept that the reviewer, for theoretical reasons perhaps, does not find the view of corpus linguistics pres ented in our book that which he wants. He is entitled to this position. However, to support it with a series of highly selective quotes and mean spirited comments is less acceptable.

REFERENCES

Aarts, J. (1991). Intuition-based and observation-based grammars. In: K. Aijmer and B. Altenberg (eds) English Corpus Linguistics, Longman, London, pp. 44-61.

Chomsky, N. (1975). The Logical Structure of Linguistic Theory, Plenum Press, New York & London.

Church, K., Gale, W., Hanks, P. and Hindle, D. (1991). Using statistics in lexical analysis. In: U. Zernik (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum Associates, pp. 115-164.

Daille, B. (1995). Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering, UCREL Technical Papers No. 5., UCREL, Lancaster University.

Fortier, P.A. (1989). Some statistics of themes in the French novel. Computers and the Humanities 23: 293-99.

Fortier, P.A. (1991). Theory, methods and applications: some examples in French literature. Literary and Linguistic Computing 6(3): 192-96.

Seuren, P. (1998). A Brief History of Western Linguistics, Blackwell, London.


Last updated 6/9/98