Workshop: Using WordSmith to Analyse Health Texts

 

Making a wordlist

Follow these steps to make a wordlist:

  1. Start WordSmith and go to the relevant directory. (For today, that is: x:\wsmith\texts\smokewp\.)
  2. Click on the text(s) you want to analyse (try smokewp.txt) and press [OK]
  3. Click on [Tools] -- [Wordlist]
  4. Click on on the Green button and then on [Make a wordlist now]. This will create three new windows.
  5. To save all 3 windows together, click on the [Save As] button and save as "smokewp.lst" under your network directory (H: drive).

VERY IMPORTANT: When you have created one wordlist, and want to create a new one, make sure that you hit [CLEAR PREVIOUS] before repeating steps 1-5. Otherwise, your new text(s) will be ADDED to the old one(s) and you will end up with a mess.

You should now have something that looks like this:

One window of a wordlist

There are three different wordlist windows. Look at each window one by one. To move between them click on the word Window at the top and choose 1, 2 or 3.

Wordlist (F): frequency list, where words are listed with the most frequent coming first, descending to the least frequent. (Scroll down and see.)

Wordlist (A): alphabetical list. The same list as above but with words in alphabetical order.

Wordlist (S): statistics file
Some important terms in the (S) window:

Look at the screenshot below. Each word in green is a type. The Freq. column gives the number of tokens.

Types and tokens

The (S) window also tells you, amongst other things, the average sentence length (in words) and the frequencies of words with different lengths (in characters). We will make use of this information in a moment.


Readability

There are several, rather similar, indices of readability available, which are easy to calculate using the information provided by the Wordlist (S) window in WordSmith. Today, we are going to use a slightly modified version of the Gunning readability formula:

(av. sentence length + percentage of words longer than 9 characters) * 0.4

Follow these steps to calculate the Gunning readability for a text:

  1. Look up the average sentence length.
  2. Add up the number of words longer than 9 characters.
  3. Divide the result of (2) by the number of tokens and multiply by 100. This gives the percentage of words longer than 9 characters.
  4. Add the result of (3) to the average sentence length.
  5. Multiply the result of (4) by 0.4.

6 - 8 = a readable text
12 - 15 = a difficult text

In the x:\wsmith\texts\smokewp directory you will find 4 texts:

  • smokewp.txt - the UK government's White Paper, "Smoking Kills"
  • merck.txt - an extract on smoking from the Merck Manual (a comprehensive medical textbook)
  • chris.txt - the transcript of a BBC online chat session on smoking with Dr. Chris Steele
  • surggen.txt - the content of the US Surgeon General's anti-smoking website for children

Now attempt the following questions:

  • Generate a wordlist for each text and use it to calculate the Gunning readability.
  • Does the readability for each text correspond with your intuitions?
  • In what contexts (esp., but not only, in health) might such a readability index be of use?
  • What other factors, not included in the Gunning formula, might be important in making health information more comprehensible? Could these be quantified in an easy formula similar to Gunning's?

Using WordSmith to Explore Health Policy Texts

As well as calculating readability, you can also use WordSmith as a quick "way in" to the most important features of a longer text (such as "Smoking Kills", the UK Government's White Paper on smoking and health).

Go back and open the wordlist of smokewp.txt which you created at the beginning of the seminar.

If you read through a wordlist like this, you will quickly be able to see what topics, actors etc. are most prominent in the text.

  • Make a list of the main actors in the White Paper
  • Try to identify some of the main topics which occur in the White Paper
These lists give us important information about the text and its contents, but it is decontextualized information. To find out more about what roles these topics and actors have in the text, we need to look more closely at their contexts. As the linguist J.R. Firth once said, "you shall know a word by the company that it keeps".

To look at a word in context, you can simply highlight it in your wordlist and click on the C button on the button bar. This will open up a concordance window, showing your word within its textual context.

To organize these contexts a little more, we can search for those words which co-occur most often (or "collocate", as linguists say) with the words which interest us.

When you have a concordance screen open, click on the little yellow hand on the button bar. This gives you a table of collocates for the word you have concordanced. It shows which words occur most often near that word - usually within 5 words of context either side.

  • Concordance, and generate collocate tables for, some of the more important words in "Smoking Kills". Try, for example, government, NHS, health, smoker(s), children, and advertising.
  • What do these tables tell you about the ways in which these concepts are represented in the text?

Online references about readability:

Stone, LA 2000 A demonstration of different estimations of readability for several forms of the Miranda warnings and associated waivers of rights. Electronic Journal of Forensic Psychonomics 1. URL: http://users.stargate.net/~lastone/mirandareadability.html.

Stone, LA 2001 Readability level of the U.S. government's alcohol beverages warning statements. Electronic Journal of Forensic Psychonomics 2. URL: http://users.stargate.net/~lastone/alcoholwarning.html.

General web page on readability: http://www.timetabler.com/reading.html


Last updated by Andrew Wilson on 21-JAN-2003.