You can purchase copies of the manual by contacting the publisher:
Kingston Press Ltd., 43, Derwent Road, Whitton, Twickenham, Middlesex,
TW2 7HQ, England, U.K.
Tel./ Fax int +44 208 893 3015 E-mail: <sales@kingstonpress.com>
Web page: <http://www.kingstonpress.com>
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted in any form or by any means,
electronic, mechanical, photocopying or otherwise without the permission
of the copyright holder.
Over the last few decades, the study of bilingual and plurilingual talk has been an important focus for many linguists. Considerable amounts of data have been collected through projects, large and small, in many countries and involving many different languages and dialects. It has become a source of frustration for many who work in this area that there is no basic standard for transcribing data of this kind, nor any central resource to enable researchers to share their data with each other. Researchers in bilingualism at the moment can only share data by making a private arrangement. Meanwhile, researchers in other fields such as language acquisition have both standard ways of transcribing and coding data, and international databases to which they can contribute and on which they can draw for comparative data.
We, the LIPPS group, hope that this Manual will benefit research and researchers in our field in two ways.
For a researcher who is new to the work of transcription and coding bilingual data, or who has done similar work before but has a new set of data waiting to be transcribed and coded, it describes step-by-step a way of carrying out the transcription and coding which provides many useful facilities and makes it possible to use already existing computer-based analytical tools. For the beginner, we hope that this Manual will provide answers to many of the basic questions relating to transcription of bilingual data. Where it cannot provide answers, it may at least help researchers to consider what they are doing and how their decisions may affect the ways in which their data may be useful to other researchers. We hope that our recommendations in this Manual will be adopted widely as "best practice".
In addition we hope that the existence of a set of basic standards for transcribing and coding bilingual data will encourage research on language interaction from an interdisciplinary perspective. Thus a special effort is made in this manual to cater for the needs of researchers working in very different fields. The proposals made here are intended equally to help those who are interested in quantitative research and those who are interested in qualitative research. We hope to make the researcher’s task of making sense of the data easier by means of useful guidelines where effective solutions are proposed for problems that may arise in the processes of transcription and coding. In addition, some user-friendly computational tools are discussed who provide support in exploring and analysing language interaction data.
The authors of the present manual have worked together and contributed at different stages to the creation of this manual. Their names are listed in alphabetical order on the title page. We would like to put on record here our appreciation for the support of many others: Ad Backus, Maria Carmen Domínguez, Pieter Muysken, Roberto Perez, Mukul Saxena as well as the institutions who have supplied funding and facilities for holding our research meetings: Birkbeck College, Dirección General de Ciencia y Tecnologia (Ministerio de Educación y Ciencia), Lancaster University, Max Planck Institute, The British Council, Tilburg University, Universitat Autònoma de Barcelona, Universitat Pompeu Fabra. For citation purposes, the authors of this manual should be given as "The LIPPS Group".
The manual is still a working document which we plan to improve and refine in successive versions. For where to send comments on the text and further details see Chapter 7 on Practical information. We, the LIPPS group, are happy to offer this Manual to researchers in our field and warmly invite your comments and suggestions.
Chapter 1: A data exchange system for language
interaction
Glossary of terms
Appendices
Appendix 1: CHAT Symbol Summary
Appendix 2: comparison between the transcribing and coding system used by Martin-Jones and Saxena (1992) and the CHAT system.
Appendix 3: comparison of the coding system developed by Selting and others (1997) with the CHAT system
Appendix 4: overview of all CLAN tools
Appendix 5: Information included in the CHAT depfile and 00depaddfile
Appendix 6: Example of a file transcribed using INTROS as a transcription unit
Appendix 7: an example of a detailed coding scheme
Appendix 8: Extract from the Backus tilbrot.cha file
Back to Conference Announcement
The LIDES Coding Manual is the outcome of joint work carried out at meetings held in Ljouwert/Leeuwarden (September 1994), London (January 1995), Barcelona (September 1995), Nijmegen (April 1996) and Barcelona (January 1997), which aimed at creating an international language interaction data exchange system1 (LIDES). Research on language interaction focuses on language practices in multilingual or multidialectal communities from a variety of social and linguistic perspectives. We have adopted the term language interaction rather than the more commonly used terms "code-switching" and "language contact" in order to include all manifestations of language contact whether or not the varieties under study are held to belong to two discrete systems.
At these meetings, the purpose of the LIDES project was discussed and proposals for data transcription and coding were explored. In the field of language interaction we feel there is a rich source both of interesting research questions and relevant data to answer them. However, we believe that many of these questions will not be answerable - and therefore may not even be asked - until researchers working on different language combinations, in different social contexts and drawing on different research traditions, are in a position to exchange and compare data freely with each other.
The idea of setting up a research group for encouraging
the study of language interaction phenomena was first conceived by several
researchers at the Ljouwert/Leeuwarden meeting. Penelope Gardner-Chloros
(Birkbeck College), Roeland van Hout (Tilburg University), Melissa G. Moyer
(Universitat Autònoma de Barcelona) and Mark Sebba (University of
Lancaster) together with Pieter Muysken and François Grosjean committed
themselves to the establishment of an international language interaction
data exchange system. The core group was formally constituted in Barcelona
under the name of LIPPS - Language Interaction in Plurilingual and Plurilectal
Speakers. Participation in the enterprise was organized as follows:
1.2 The goals of LIDES
There is now a wealth of material available on language interaction, both published and unpublished, in the form of Ph.D. theses and similar work. A CD-ROM search of titles in Linguistics Abstracts containing the term code-switching (see glossary) produced over 800 titles. The great majority of these studies involve the collection of new sets of data by individual researchers; so while there is a lot of often painstakingly collected data around, it seems likely that the endless collection of more data for analysis will no longer constitute the most productive application of research efforts.
In spite of the high level of interest, however, no co-ordinated system has yet been developed for researchers to make use of one another's data. On the contrary, the data is only available, if at all, in widely disparate forms, and coding and transcription practices vary widely. Access to the original data (usually audio-recordings) may be necessary in order to verify the significance of the written transcripts, but it is even rarer that anyone except the original researcher gains access to this material.
The CHILDES enterprise (MacWhinney 1995, MacWhinney & Snow 1990) shows us the enormous advantage of extensive databases (see glossary) in research fields where data on spoken, spontaneous language is essential. Clearly, the acronym of LIDES is based on CHILDES. We hope that the CHILDES project will inspire researchers to make their multilingual data available to other researchers through LIDES. Also, many researchers would agree that it is a basic scientific responsibility to make data collected in a research project available to the scientific community, especially when the research was supported by public funds. No tradition of exchanging data in this way exists at the moment in research on language interaction. When resarchers begin to contribute their data consistently to LIDES, the result will not be just more data on more language combinations. Research methodology will change, not only because dedicated tools for language interaction analysis will become available, but also because the regular occurrence or absolute non-occurrence of specific phenomena will become valuable arguments in scientific disputes (see Sokolov & Snow 1994 for the relationship between CHILDES and research methodology in first language acquisition).
The LIPPS project is conceived as a network of researchers who, in addition to carrying out their own research on language interaction data, are committed to the overall goal of producing a database and developing coding schemes and guidelines. Each researcher is in fact working independently on his/her own data set, but a common set of overall goals will be kept in mind:
(1) To develop standards for transcribing and coding spoken multilingual data in ways that will be of use to the participating teams and other researchers elsewhere. The intention is to produce a set of standards compatible with many different kinds of research and which will allow researchers as much freedom as possible in their transcription and analysis, while making their data compatible with the LIDES database and thus of maximum value to other researchers.
(2) To develop a computerized database (corpus) of multilingual interaction data in standardized form, including data from as wide as possible a range of multilingual situations, as a resource where researchers doing comparative studies of multilingual language behaviour can contribute their own data, and access data contributed by others. A promising aspect is that eventually researchers will see the benefits of adding their data to LIDES. By making data available to the scientific community, research results and analyses can be checked, thus improving the quality of research in the field of language interaction. Openness to mutual scrutiny can only improve research quality and the quality of the data.
(3) To develop user-friendly tools for the transcription and coding of language interaction data and the exploitation of the international database as it develops. This is pioneering work as no specialized corpus or word processing software currently exists for the transcription and analysis of language interaction.
The importance of creating this database lies first in the possibilities which it offers to maximize the use of available data, and secondly in that it allows researchers to make comparisons between different sets of data, this being the only means to provide the answer to some of the essential questions currently asked in the field of language interaction. Every research project will contribute unique data to LIDES. It is important to stress that it is the data that is unique, because researchers tend to confuse their data with the need they feel to develop a unique and personal transcription and coding system.
The reasons why it is desirable to achieve a co-ordinated
approach to this type of research go beyond the advantages of simple data-sharing.
What researchers typically want to know about patterns of language interaction
is to what extent these patterns are dictated by the particular language
combination and/or the context and circumstances which are relevant in
their study, and to what extent they are universal or at least common to
similar language sets or similar combinations of sociolinguistic circumstances.
For example, one major strand of research on code-switching focuses on
grammatical constraints on where a switch can occur within the sentence.
Time after time, constraints proposed on the basis of one data-set, and
often put forward as potentially universal, have been disproved when new
data-sets have emerged (for a recent survey, see Muysken 1995). Furthermore
it is not possible, without making comparisons of the kind we propose,
to establish the relative role of linguistic features as such and sociolinguistic,
psycholinguistic and/or contextual factors in the language interaction
patterns which are observed. Both of these are fundamental problems with
approaches based on a single data-set; they could be compared in medicine,
for example, with studying the pattern and aetiology of a disease through
a single patient, thus making it impossible to disentangle the role of
heredity and environment.
1.3 Why CHILDES?
Having decided that it was desirable to formulate a set of standards for transcribing and coding language interaction data, with the purpose of setting up an international database, we looked around for existing systems that could do what we needed. We identified one strong contender: the CHILDES system (MacWhinney 1995). CHILDES was set up to enable researchers working with child language data to code and share their data. CHILDES had been successfully used for over 10 years and is equipped with an institutional support base3, specific detailed guidelines for transcribing and coding data (the CHILDES coding manual) within an existing format (CHAT) and a set of software, the CLAN programs (see glossary), which researchers can use to carry out a large range of automated analyses of the data in the database. These programs for analysis of bilingual data can be obtained by contacting Brian MacWhinney (See Chapter 7 for contact addresses). Furthermore, a number of people associated with the LIPPS group had already used CHILDES for the encoding of their multilingual data.
There were also arguments against using the existing CHILDES system for our purposes. CHILDES was not designed for mainly adult, multilingual, speech data, but for mainly monolingual adult-child interactions. Therefore the CHAT format was not necessarily the most appropriate one for the type of data researchers in this area were collecting, and the CLAN programs were not designed to answer the type of questions which those researchers would want to ask.
However, the CHAT coding scheme and the CLAN tools
in CHILDES are open to further elaborations and additions. Existing tools
can be accommodated to language interaction purposes and new coding schemes
and tools can be developed. In fact, some adaptations necessary to cope
with multilingual data can already be found in CHILDES. First of all, the
CHILDES database contains data from many different languages and the way
transcription problems have been solved for these different languages can
be of help when transcribing language interaction data. More important
in this respect, though, are the bilingual data already available. A separate
chapter of the CHILDES manual (MacWhinney 1995, Chapter 31) is devoted
to data available on bilingual acquisition, which is becoming an important
field of research. The bilingual chapter presents data sets collected by
De Houwer, Deuchar, Guthrie, Hayashi, Serra, Snow and Velasco. A separate
dependent tier (see glossary) is proposed to code information on the language
of the utterance on the main tier (see glossary), the language of the preceding
speaker and the dominant language of the speaker (see MacWhinney 1995:63;
De Houwer 1990 for a good example of this type of system).
Secondly, there is a separate subsection on code-switching
in the CHILDES manual (MacWhinney 1995: subsection 9.4) in which some useful
coding options are proposed. These options will be discussed in section
3.1 of this manual.
Opting for CHILDES does not mean that we believe that this system gives the answer to all future questions on spoken language databases. On the contrary, it is for the moment the most useful system, but it ought to be optimized in the near future. With this in mind, another point in favour of the CHILDES system is the formal way in which the system is set up. One development we can look forward to in the near future is an interface between CHAT and SGML formats. SGML (Standard Generalized Markup Language) is a metalanguage, a standard way of marking up texts (both spoken and written), which is independent of any word processor or computer system. It is already used for coding large monolingual corpora (for example, the British National Corpus of 100 million words). Because SGML is designed to be read by a computer rather than a human, "readability is less of a concern than computational tractability" (Edwards 1995:21), with the result that humans tend to interact with SGML via an interface which hides the complex coding. However, once formal definitions for language interaction data and analytic tools become available within the SGML approach to text-based data, SGML will offer the prospect of a universal set of codes for language interaction data as well as user-friendly interfaces which will allow researchers to encode numerous different scripts and have a choice of how to present the transcribed and encoded data in print and on the computer screen.
Because of its formal properties, there should be no insurmountable problems involved in transforming CHAT-based data to SGML. This prospect in fact matches the view of MacWhinney (1995: 437): "As our work in database development proceeds, we want to think in terms of a more general database of all the varieties of spoken human language".
A rather far-reaching expansion now is the linking
of original sound and video recordings to the transcribed records which
opens new avenues for spoken language analysis. MacWhinney (1995; 438)
mentions this expansion under the heading of exploratory reality. On several
places, work is being carried out to make sound-linking available to the
language research community, for instance at the Max Planck Institute for
Psycholinguistics (Nijmegen, The Netherlands).
1.4 How different is CHAT?
Using CHAT is an attractive option for those who have not yet started transcribing their data. They need to find a way to transcribe their data anyway, though some may fear that:
As for the latter two objections, you can see from examples 1 and 2 below that transcribing in CHAT is not really that different from the traditional way of transcribing, and that the basic transcribing and coding conventions are not that difficult.
(1) a "traditional" transcription (Moyer, 1992)
YVO: Excuse me could we have two coffees and some scones please?
NAT: Yvonne para mí no vayas a pedir
scones de esos
Yvonne for me not
go to ask scones of these
que ahora me
estoy tratando de controlar un poquito antes de Pascua.
that now me are
trying to control a little-bit before of Christmas
'Yvonne, don't order these
scones for me because now I am trying not to put on
weight before Christmas.'
YVO: si Christmas ya está round
the corner mujer.
yes Christmas already
is round the corner woman
'Mind you, Christmas is already round the corner.'
YVO: yo ya no hago dieta hasta por lo menos enero
o febrero y eso con suerte.
I already not make
diet until for the less January or February and that with luck
'I am not going on a diet
until at least January or February and even then with a
bit of luck.'
· there is a short introduction giving details
of speakers and languages used;
· each speaker's turn is put in a separate paragraph
following an indication of who is speaking;
· normal and italic fonts are used to indicate
the language of each word/phrase;
· the literal gloss (see glossary) for each word
is placed on the line beneath; and
· there is a free translation (see glossary) of
the conversation provided.
In the following example the same data is given in CHAT format:
(2) a CHAT transcription (Moyer 1992)
*YVO: excuse@1 me@1 could@1 we@1 have@1
two@1 coffees@1 and@1 some@1
scones@1
please@1 ?
*NAT: Yvonne@1 para@2 mí@2
no@2 vayas@2 a@2 pedir@2 scones@1 de@2
esos@2
que@2 ahora@2 me@2 estoy@2 tratando@2 de@2 controlar@2
un@2 poquito@2
antes@2 de@2 Pascua@2 .
%glo: Yvonne for me not go to ask scones
of these that now me are
trying of
control a little-bit before of Christmas .
%tra: Yvonne, don't order these scones for
me because now I am trying
not to put
on weight before Christmas .
*YVO: si@2 Christmas@1 ya@2 está@2
round@1 the@1 corner@1 mujer@2 .
%glo: if Christmas already is round the
corner woman
%tra: mind you, Christmas is already round
the corner
*YVO: yo@2 ya@2 no@2 hago@2 dieta@2 hasta@2
por@2 lo@2 menos@2 enero@2
o@2 febrero@2
y@2 eso@2 con@2 suerte@2 .
%glo: I already not make diet until for
the less January or February
and that
with luck .
%tra: I am not going on a diet until at
least January or February and
even then
with a bit of luck .
@End
In fact (2) uses the coding system recommended by LIDES,
and as you can see, the data looks somewhat different. However, the same
information is present:
Once your transcription is complete you can print it,
selecting just those tiers you want to appear on the page, from among all
those you have included in your transcription.
An enormous benefit, obvious to every researcher who
has ever done frequency counts using pen and paper on a somewhat larger
set of data, is the fact that you can use the CLAN programs to search your
data for patterns or provide certain statistics, because all LIDES transcriptions
use plain ASCII characters (see glossary) and a common set of transcription
and coding guidelines. The latter also makes it easier to exchange and
compare data with other researchers, inside or outside the LIDES database.
For those who already have transcribed (albeit in a different way) and analyzed (partially) their data the answer to the question "what do I have to do to make my data suitable for the LIDES database or for analysing my data with CLAN programs?" is more complex. There are two possibilities:
· You still wish to analyze your data. In
this case it might still be a good idea to put the transcripts in the CHAT
format, especially if your data set is quite large, because it will allow
you to use the CLAN programs to analyze your data. Provided that you have
the transcripts in a computerized version and not just as a hard copy (on
paper), this should not be too much work; you simply put it in plain ASCII
format (most wordprocessors have the ability to do this automatically)
and start putting in the necessary CHAT symbols and codes.
· You have finished transcribing and do
not wish to analyze your data any further. In this case you probably
would not want to go through all the trouble of putting your transcripts
into CHAT format. It is, however, possible to add transcripts that are
not in CHAT format to the LIDES database. Although the LIDES system strongly
recommends the use of a transcription system based on CHAT to transcribe
data, already existing data in different format will not excluded from
LIDES. Such data is still interesting and can be used productively by other
language interaction researchers if included in the database. One of our
aims is to develop adequate tools to convert these data sets into the transcription
format utilized by the CLAN automatic analysis programs.
1.5 Why put data in LIDES?
The LIDES system is open to everybody who is interested in language interaction phenomena. This means that there are different ways to participate in this project. Some users will just want to consult the LIDES database because they lack the data they need to carry out research. Others will contribute their own data sets to the database. Some contributors may have used their own transcription and coding system. Other contributors will simply adopt the proposed CHAT transcription and coding system.
All contributors are encouraged to make their audio and video recorded material available. Access to this material enables other investigators to use a given corpus to pursue research on an aspect which may be different from the original research questions for which the data was initially collected. In this way, a corpus which was collected to carry out some sort of quantitative analysis can later be used by researchers working on qualitative analysis. In addition, access to audio or/and video material will allow other researchers to confirm transcription and coding as well as carry out multimedia analysis. All users of the data-base will be requested to comply with standard research ethics to guarantee the anonymity of informants.
In the previous sections (1.1 through 1.3) much has been
said already about the general reasons for creating a language interaction
database. In summary, such a database will:
Also there is the possibility for other researchers of adding additional tiers of coding to the data set. Any researcher could then make use of this new coding tier. For example, someone who is interested in phonological analysis would not necessarily add a morphology coding tier to his/her data set. Another researcher may wish to use this data set to conduct a morphological analysis, and must therefore make a morphological coding tier. Once this tier has been added to the original data set, it becomes available to other researchers as a further resource.
Some fear that by adding their data to a public database they will lose control over what happens with their data. This is a risk a researcher takes every time he/she publishes or otherwise makes his/her findings public, but the risk may seem somewhat larger when it comes to adding one's data to a database. But as with any publication there is a moral obligation for users to acknowledge the source of the data they use in their study. Careful consideration will be given to the public use of data contributed to the database and to the requirement for anonymity of speakers.
1.6 About this manual
This manual is organized as follows. Chapter 2 and 3 are devoted to the basic steps for preparing language interaction data for analysis. Special attention is dedicated to the steps involved in the preparation and organization of the data and to the most crucial minimal requirements for transcription and tagging. Chapter 4 presents more advanced aspects of data transcription, tagging and coding. Chapter 5 discusses coding schemes which are relevant to multilingual databases. Some proposals for coding data for specific research interests are discussed. Chapter 6 concentrates on the set of programs that automate data analysis and how to use them in the analytical research stages. Finally, Chapter 8 contains practical information about contact addresses, how to obtain the LIDES programs and advice about their use, and a list of databases currently available.
It should also be pointed out that future editions
of this manual will incorporate new proposals. LIDES users are encouraged
to support the development and improvement of the system not just through
contributing their data, but also through making their own proposals for
coding study-specific information and for programming. The LIDES enterprise
does not imply a deterministic view as to the best way of transcribing,
coding and/or analysing bilingual data. The complex nature of the data
demands a flexible and open-minded approach to the theoretical decisions
involved.