Using Corpora in Discourse Analysis

Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

Read the first few pages

Book description

Using Corpora in Discourse Analysis examines approaches to carrying out discourse analysis (DA) using techniques that are grounded in corpus linguistics. In the past much research on critical discourse analysis has focussed on analyses of single texts or small collections of texts. However, researchers working in CDA are beginning to acknowledge the potential of using corpora either to supplement their findings or as a valid methodology in itself. A corpus-based approach helps to provide quantitative evidence of the existence of discourses by enabling researchers to identify repetitive linguistic patterns of language use and to uncover hidden meanings in lexical items e.g. by examining collocations. Corpus linguistics also allows researchers to uncover linguistic evidence for prevailing/majority and resistant/minority discourses as a large corpus is likely to show a range of ideological positions - something which an analysis of a single text may be less likely to reveal.

Using Corpora in Critical Discourse Analysis does not assume prior knowledge of corpora lingistics. The book examines and evaluate a variety of corpus-based methodologies including collocations, keyness, concordances and dispersal plots using a range of examples from different types of corpora. It also considers issues of building and annotating corpora as well as the validity of approaching CDA from a combination of qualitative and quantitative perspectives. The book is illustrated with a number of real-life examples of corpus-based CDA from a range of sources and covering a variety of subjects including

  • Holiday brochures
  • Parliament debates about banning foxhunting
  • Newspaper reports about refugees
  • Representations of the words bachelor and spinster in general corpora

Chapter 1 Introduction
Chapter 2 Corpus Building
Chapter 3 Frequency and Dispersion
Chapter 4 Concordances
Chapter 5 Colloates
Chapter 6 Keyness
Chapter 7 Beyond Collocation
Chapter 8 Conclusion

Excerpt - Chapter 1 Introduction

This book is about a set of techniques of analysing language for a particular purpose. Or more explicitly, it is about using corpora (large bodies of naturally-occurring language data stored on computers) and corpus processes (computational procedures which manipulate this data in various ways) in order to uncover linguistic patterns which can enable us to make sense of the ways that language is used in the construction of discourses (or ways of constructing reality).

It therefore involves the pairing of two areas related to linguistics (corpora and discourse) which have not had a great deal to do with each other for reasons I will try to explain later in this chapter. This book is mainly written for 'linguists who use corpora' (Partington 2003: 257), rather than explicitly for corpus linguists, although hopefully corpus linguists may find something of use in it too.

This chapter serves as an overview for the rest of the book. A problem with writing a book that involves bridge-building between two different disciplines, is in the assumptions that have to be made regarding a fairly disparate target audience. Some people may know a lot about discourse analysis but not a great deal about corpus linguistics. For others the opposite may be the case. For others still, both areas might be equally opaque. So, for the sake of completeness and inclusiveness, I will try to cover as much ground as possible and hope that readers bear with me or can skim through the parts that they are already familiar with. I will begin by giving a quick description of corpus linguistics, followed by one of discourse.

Corpus Linguistics

Corpus linguistics is 'the study of language based on examples of real life language use.' (McEnery & Wilson, 1996: 1). However, unlike purely qualitative approaches to research, corpus linguistics utilises bodies of electronically encoded text, implementing a more quantitative methodology, for example by using frequency information about occurrences of particular linguistic phenomena. As Biber (1998: 4) points out, corpus-based research actually depends on both quantitative and qualitative techniques: 'Association patterns represent quantitative relations, measuring the extent to which features and variants are associated with contextual factors. However functional (qualitative) interpretation is also an essential step in any corpus-based analysis.'

Corpora are generally large (consisting of thousands or even millions of words), representative samples of a particular type of naturally occurring language, so they can therefore be used as a standard reference with which claims about language can be measured. The fact that they are encoded electronically means that complex calculations can be carried out on large amounts of text, revealing linguistic patterns and frequency information that would otherwise take days or months to uncover by hand, and may run counter to intuition.

Electronic corpora are often annotated with additional linguistic information, the most common being part of speech information (for example, whether something is a noun or a verb), which allows large-scale grammatical analyses to be carried out. Other types of information can be encoded within corpora - for example, in spoken corpora (containing transcripts of dialogue) attributes such as sex, age, socio-economic group and region can be encoded for each participant. This would allow language comparisons to be made about different types of speakers. For example, Rayson et al (1997) have shown that speakers from economically advantaged groups use adverbs like actually and really more than those from less advantaged groups. On the other hand, people from less advantaged groups are more likely to use words like say, said and saying, numbers and taboo words.

Corpus-based or equivalent methods have been used from as early as the nineteenth century. The diary studies of infant language acquisition (Taine 1877, Preyer 1889), or Käding's (1897) frequency distribution of sequences of letters in an 11 million word corpus of German focussed on collections of large, naturally occurring language use (in the absence of computers, the data was painstakingly analysed by hand). However, up until the 1970s, only a small number of studies utilised corpus-based approaches. Quirk's (1960) Survey of English Usage began in 1961, as did Brown and Kucera's work on the Brown corpus of American English. It was not until the advent of widely available personal computers in the 1980s that corpus linguistics as a methodology became popular. Johansson (1991) shows that the number of such studies doubled for every five year period between 1976-1991.

Corpus linguistics has since been employed in a number of areas of linguistic enquiry, including dictionary creation (Clear et al 1996), as an aid to interpretation of literary texts (Louw 1997), forensic linguistics (Wools and Coulthard 1998), language description (Sinclair 1999), language variation studies (Biber 1988) and language teaching materials (Johns 1997). The aim of this book, however, is to investigate how corpus linguistics can enable the analysis of discourses. With that said, the term discourse has numerous interpretations, so the following section explains what I mean when I used it.


The term discourse is problematic, as it is used in social and linguistic research in a number of inter-related yet different ways. In traditional linguistics it is defined as either 'language above the sentence or above the clause' (Stubbs 1983: 1), or 'language in use' (Brown and Yule 1983). We can talk about the discourse structure of particular texts. For example, a recipe will usually begin with the name of the meal to be prepared, then give a list of ingredients, then describe the means of preparation. There may be variants to this, but on the whole we are usually able to recognise the discourse structure of a text like a recipe fairly easily. We would expect certain lexical items or grammatical structures to appear at particular places (for example, numbers and measurements would appear near the beginning of the text, in the list of ingredients '4 15ml spoons of olive oil', whereas imperative sentences would appear in the latter half 'Slice each potato lengthwise.')

The term discourse is also sometimes applied to different types of language use or topics, for example, we can talk about political discourse (Chilton 2004), colonial discourse (Williams and Chrisman 1993), media discourse (Fairclough 1995) and environmental discourse (Hajer 1997). A number of researchers have used corpora to examine discourse styles of people who are learners of English. Ringbom (1998) found a high frequency of lexis that had a high generality (words like people and things) in a corpus of writings produced by learners of English when compared to a similar corpus of native speakers. Ringbom suggests that this results in learner English having a vague style. Similarly, Lorenz (1998) found that learners modify adjectives frequently, giving their discourse a sense of overstatement 'The sea was very clean', whereas Flowerdew (2000) showed that learner discourse contained an under-use of hedging devices (words like perhaps and possibly) making their writing appear very direct. So this is a conceptualisation of discourse which is linked to genre, style of text type. And throughout this book we will be examining a range of different discourses: tourist discourse in Chapter 3, news reporting discourse in Chapters 4 and 7, and political discourse in Chapter 6.

However, discourse can also be defined as 'practices which systematically form the objects of which they speak' (Foucault 1972: 49) and it is this meaning of discourse which I intend to focus on in this book (although in practice it is difficult to consider this meaning without taking into account the other meanings as well).

In order to expand upon Foucault's definition, discourse is a 'system of statements which constructs an object' (Parker 1992: 5) or 'language-in-action' (Blommaert 2005: 2). It is further categorised by Burr (1995: 48) as 'a set of meanings, metaphors, representations, images, stories, statements and so on that in some way together produce a particular version of events… Surrounding any one object, event, person etc., there may be a variety of different discourses, each with a different story to tell about the world, a different way of representing it to the world.' Because of Foucault's notion of practices, discourse therefore becomes a countable noun: discourses (Cameron 2001: 15). So around any given object or concept there are likely to be multiple ways of constructing it, reflecting the fact that humans are diverse creatures; we tend to perceive aspects of the world in different ways, depending on a range of factors. In addition, discourses allow for people to be internally inconsistent; they help to explain why people contradict themselves, change position or appear to have ambiguous or conflicting views on the same subject (Potter and Wetherell 1987). We can view cases like this in terms of people holding competing discourses. Therefore, discourses are not valid descriptions of people's 'beliefs' or 'opinions' and they cannot be taken as representing an inner, essential aspect of identity such as personality or attitude. Instead they are connected to practices and structures that are lived out in society from day to day. Discourses can therefore be difficult to pin down or describe - they are constantly changing, interacting with each other, breaking off and merging. As Sunderland (2004) points out, there is no 'dictionary of discourses'. In addition, any act of naming or defining a discourse is going to be an interpretative one. Where I see a discourse, you may see a different discourse, or no discourse. It is difficult, if not impossible, to step outside discourse. Therefore our labelling of something as a discourse is going to be based upon the discourses that we already (often unconsciously) live with. As Foucault (1972: 146) notes, 'it is not possible for us to describe our own archive, since it is from within these rules that we speak.'

Paul Baker's homepage