The LIDES Coding Manual was published in 2000 as a special issue of the International Journal of Bilingualism (issue 4:2, pp. 131-270).  As an indication of the purpose and content of this document, we reproduce here the title page, Foreword,  contents page and Chapter 1 of the Manual.

You can purchase copies of the manual by contacting the publisher:
Kingston Press Ltd., 43, Derwent Road, Whitton, Twickenham, Middlesex, TW2 7HQ, England, U.K.
Tel./ Fax int +44 208 893 3015  E-mail: <sales@kingstonpress.com>  Web page: <http://www.kingstonpress.com>




International Journal of Bilingualism
Cross-Disciplinary, Cross-Linguistic Studies of Language Behavior
ISSN 1367-0069 Volume 4  Number 2 June, 2000
Special Issue: ISBN 0-9533353-6-4  
                   Contributors (in alphabetical order)     
© 2000 Kingston Press Ltd.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the permission of the copyright holder.


  Foreword

Over the last few decades, the study of bilingual and plurilingual talk has been an important focus for many linguists. Considerable amounts of data have been collected through projects, large and small, in many countries and involving many different languages and dialects. It has become a source of frustration for many who work in this area that there is no basic standard for transcribing data of this kind, nor any central resource to enable researchers to share their data with each other. Researchers in bilingualism at the moment can only share data by making a private arrangement. Meanwhile, researchers in other fields such as language acquisition have both standard ways of transcribing and coding data, and international databases to which they can contribute and on which they can draw for comparative data.

We, the LIPPS group, hope that this Manual will benefit research and researchers in our field in two ways.

For a researcher who is new to the work of transcription and coding bilingual data, or who has done similar work before but has a new set of data waiting to be transcribed and coded, it describes step-by-step a way of carrying out the transcription and coding which provides many useful facilities and makes it possible to use already existing computer-based analytical tools. For the beginner, we hope that this Manual will provide answers to many of the basic questions relating to transcription of bilingual data. Where it cannot provide answers, it may at least help researchers to consider what they are doing and how their decisions may affect the ways in which their data may be useful to other researchers. We hope that our recommendations in this Manual will be adopted widely as "best practice".

In addition we hope that the existence of a set of basic standards for transcribing and coding bilingual data will encourage research on language interaction from an interdisciplinary perspective. Thus a special effort is made in this manual to cater for the needs of researchers working in very different fields. The proposals made here are intended equally to help those who are interested in quantitative research and those who are interested in qualitative research. We hope to make the researcherís task of making sense of the data easier by means of useful guidelines where effective solutions are proposed for problems that may arise in the processes of transcription and coding. In addition, some user-friendly computational tools are discussed who provide support in exploring and analysing language interaction data.

The authors of the present manual have worked together and contributed at different stages to the creation of this manual. Their names are listed in alphabetical order on the title page. We would like to put on record here our appreciation for the support of many others: Ad Backus, Maria Carmen Domínguez, Pieter Muysken, Roberto Perez, Mukul Saxena as well as the institutions who have supplied funding and facilities for holding our research meetings: Birkbeck College, Dirección General de Ciencia y Tecnologia (Ministerio de Educación y Ciencia), Lancaster University, Max Planck Institute, The British Council, Tilburg University, Universitat Autònoma de Barcelona, Universitat Pompeu Fabra. For citation purposes, the authors of this manual should be given as "The LIPPS Group".

The manual is still a working document which we plan to improve and refine in successive versions. For where to send comments on the text and further details see Chapter 7 on Practical information. We, the LIPPS group, are happy to offer this Manual to researchers in our field and warmly invite your comments and suggestions.

TOP OF PAGE


CONTENTS
 

 

TOP OF PAGE

Back to Conference Announcement



Chapter 1: A data exchange system for language interaction
 
 
1.1 The origin of LIPPS and LIDES

The LIDES Coding Manual is the outcome of joint work carried out at meetings held in Ljouwert/Leeuwarden (September 1994), London (January 1995), Barcelona (September 1995), Nijmegen (April 1996) and Barcelona (January 1997), which aimed at creating an international language interaction data exchange system1 (LIDES). Research on language interaction focuses on language practices in multilingual or multidialectal communities from a variety of social and linguistic perspectives. We have adopted the term language interaction rather than the more commonly used terms "code-switching" and "language contact" in order to include all manifestations of language contact whether or not the varieties under study are held to belong to two discrete systems.

 At these meetings, the purpose of the LIDES project was discussed and proposals for data transcription and coding were explored. In the field of language interaction we feel there is a rich source both of interesting research questions and relevant data to answer them. However, we believe that many of these questions will not be answerable - and therefore may not even be asked - until researchers working on different language combinations, in different social contexts and drawing on different research traditions, are in a position to exchange and compare data freely with each other.

The idea of setting up a research group for encouraging the study of language interaction phenomena was first conceived by several researchers at the Ljouwert/Leeuwarden meeting. Penelope Gardner-Chloros (Birkbeck College), Roeland van Hout (Tilburg University), Melissa G. Moyer (Universitat Autònoma de Barcelona) and Mark Sebba (University of Lancaster) together with Pieter Muysken and François Grosjean committed themselves to the establishment of an international language interaction data exchange system. The core group was formally constituted in Barcelona under the name of LIPPS - Language Interaction in Plurilingual and Plurilectal Speakers. Participation in the enterprise was organized as follows:
 

The LIPPS group agreed that the creation of a language interaction data exchange system (LIDES) would require a common set of transcription and coding guidelines. This meant that there had to be a consensus on a basic minimum of transcription and coding conventions so that people undertaking research in different linguistic disciplines such as discourse analysis or syntax could benefit from LIDES. Likewise, there is a need for providing LIDES users with dedicated computational tools that automate and support data analysis. These are the issues on which this manual concentrates. Using the CHILDES (CHIld Language Data Exchange System) project (MacWhinney 1995) as its starting point, LIDES concentrates on using and developing transcription guidelines, coding systems, and computational tools which are suited to the study of language interaction data (e.g. the marking of such phenomena as calques, borrowings, code-switches, language mixing sequences, discourse patterns).
 

1.2 The goals of LIDES

There is now a wealth of material available on language interaction, both published and unpublished, in the form of Ph.D. theses and similar work. A CD-ROM search of titles in Linguistics Abstracts containing the term code-switching (see glossary) produced over 800 titles. The great majority of these studies involve the collection of new sets of data by individual researchers; so while there is a lot of often painstakingly collected data around, it seems likely that the endless collection of more data for analysis will no longer constitute the most productive application of research efforts.

 In spite of the high level of interest, however, no co-ordinated system has yet been developed for researchers to make use of one another's data. On the contrary, the data is only available, if at all, in widely disparate forms, and coding and transcription practices vary widely. Access to the original data (usually audio-recordings) may be necessary in order to verify the significance of the written transcripts, but it is even rarer that anyone except the original researcher gains access to this material.

 The CHILDES enterprise (MacWhinney 1995, MacWhinney & Snow 1990) shows us the enormous advantage of extensive databases (see glossary) in research fields where data on spoken, spontaneous language is essential. Clearly, the acronym of LIDES is based on CHILDES. We hope that the CHILDES project will inspire researchers to make their multilingual data available to other researchers through LIDES. Also, many researchers would agree that it is a basic scientific responsibility to make data collected in a research project available to the scientific community, especially when the research was supported by public funds. No tradition of exchanging data in this way exists at the moment in research on language interaction. When resarchers begin to contribute their data  consistently to LIDES, the result will not be just more data on more language combinations. Research methodology will change, not only because dedicated tools for language interaction analysis will become available, but also because the regular occurrence or absolute non-occurrence of specific phenomena will become valuable arguments in scientific disputes (see Sokolov & Snow 1994 for the relationship between CHILDES and research methodology in first language acquisition).

The LIPPS project is conceived as a network of researchers who, in addition to carrying out their own research on language interaction data, are committed to the overall goal of producing a database and developing coding schemes and guidelines. Each researcher is in fact working independently on his/her own data set, but a common set of overall goals will be kept in mind:

(1) To develop standards for transcribing and coding spoken multilingual data in ways that will be of use to the participating teams and other researchers elsewhere. The intention is to produce a set of standards compatible with many different kinds of research and which will allow researchers as much freedom as possible in their transcription and analysis, while making their data compatible with the LIDES database and thus of maximum value to other researchers.

(2) To develop a computerized database (corpus) of multilingual interaction data in standardized form, including data from as wide as possible a range of multilingual situations, as a resource where researchers doing comparative studies of multilingual language behaviour can contribute their own data, and access data contributed by others. A promising aspect is that eventually researchers will see the benefits of adding their data to LIDES. By making data available to the scientific community, research results and analyses can be checked, thus improving the quality of research in the field of language interaction. Openness to mutual scrutiny can only improve research quality and the quality of the data.

(3) To develop user-friendly tools for the transcription and coding of language interaction data and the exploitation of the international database as it develops. This is pioneering work as no specialized corpus or word processing software currently exists for the transcription and analysis of language interaction.

The importance of creating this database lies first in the possibilities which it offers to maximize the use of available data, and secondly in that it allows researchers to make comparisons between different sets of data, this being the only means to provide the answer to some of the essential questions currently asked in the field of language interaction. Every research project will contribute unique data to LIDES. It is important to stress that it is the data that is unique, because researchers tend to confuse their data with the need they feel to develop a unique and personal transcription and coding system.

 The reasons why it is desirable to achieve a co-ordinated approach to this type of research go beyond the advantages of simple data-sharing. What researchers typically want to know about patterns of language interaction is to what extent these patterns are dictated by the particular language combination and/or the context and circumstances which are relevant in their study, and to what extent they are universal or at least common to similar language sets or similar combinations of sociolinguistic circumstances. For example, one major strand of research on code-switching focuses on grammatical constraints on where a switch can occur within the sentence. Time after time, constraints proposed on the basis of one data-set, and often put forward as potentially universal, have been disproved when new data-sets have emerged (for a recent survey, see Muysken 1995). Furthermore it is not possible, without making comparisons of the kind we propose, to establish the relative role of linguistic features as such and sociolinguistic, psycholinguistic and/or contextual factors in the language interaction patterns which are observed. Both of these are fundamental problems with approaches based on a single data-set; they could be compared in medicine, for example, with studying the pattern and aetiology of a disease through a single patient, thus making it impossible to disentangle the role of heredity and environment.
 

 1.3 Why CHILDES?

Having decided that it was desirable to formulate a set of standards for transcribing and coding language interaction data, with the purpose of setting up an international database, we looked around for existing systems that could do what we needed. We identified one strong contender: the CHILDES system (MacWhinney 1995). CHILDES was set up to enable researchers working with child language data to code and share their data. CHILDES had been successfully used for over 10 years and is equipped with an institutional support base3, specific detailed guidelines for transcribing and coding data (the CHILDES coding manual) within an existing format (CHAT) and a set of software, the CLAN programs (see glossary), which researchers can use to carry out a large range of automated analyses of the data in the database. These programs for analysis of bilingual data can be obtained by contacting Brian MacWhinney (See Chapter 7 for contact addresses). Furthermore, a number of people associated with the LIPPS group had already used CHILDES for the encoding of their multilingual data.

 There were also arguments against using the existing CHILDES system for our purposes. CHILDES was not designed for mainly adult, multilingual, speech data, but for mainly monolingual adult-child interactions. Therefore the CHAT format was not necessarily the most appropriate one for the type of data researchers in this area were collecting, and the CLAN programs were not designed to answer the type of questions which those researchers would want to ask.

 However, the CHAT coding scheme and the CLAN tools in CHILDES are open to further elaborations and additions. Existing tools can be accommodated to language interaction purposes and new coding schemes and tools can be developed. In fact, some adaptations necessary to cope with multilingual data can already be found in CHILDES. First of all, the CHILDES database contains data from many different languages and the way transcription problems have been solved for these different languages can be of help when transcribing language interaction data. More important in this respect, though, are the bilingual data already available. A separate chapter of the CHILDES manual (MacWhinney 1995, Chapter 31) is devoted to data available on bilingual acquisition, which is becoming an important field of research. The bilingual chapter presents data sets collected by De Houwer, Deuchar, Guthrie, Hayashi, Serra, Snow and Velasco. A separate dependent tier (see glossary) is proposed to code information on the language of the utterance on the main tier (see glossary), the language of the preceding speaker and the dominant language of the speaker (see MacWhinney 1995:63; De Houwer 1990 for a good example of this type of system).
 Secondly, there is a separate subsection on code-switching in the CHILDES manual (MacWhinney 1995: subsection 9.4) in which some useful coding options are proposed. These options will be discussed in section 3.1 of this manual.

 Opting for CHILDES does not mean that we believe that this system gives the answer to all future questions on spoken language databases. On the contrary, it is for the moment the most useful system, but it ought to be optimized in the near future. With this in mind, another point in favour of the CHILDES system is the formal way in which the system is set up. One development we can look forward to in the near future is an interface between CHAT and SGML formats. SGML (Standard Generalized Markup Language) is a metalanguage, a standard way of marking up texts (both spoken and written), which is independent of any word processor or computer system. It is already used for coding large monolingual corpora (for example, the British National Corpus of 100 million words).  Because SGML is designed to be read by a computer rather than a human, "readability is less of a concern than computational tractability" (Edwards 1995:21), with the result that humans tend to interact with SGML via an interface which hides the complex coding. However, once formal definitions for language interaction data and analytic tools become available within the SGML approach to text-based data, SGML will offer the prospect of a universal set of codes for language interaction data as well as user-friendly interfaces which will allow researchers to encode numerous different scripts and have a choice of how to present the transcribed and encoded data in print and on the computer screen.

Because of its formal properties, there should be no insurmountable problems involved in transforming CHAT-based data to SGML. This prospect in fact matches the view of MacWhinney (1995: 437): "As our work in database development proceeds, we want to think in terms of a more general database of all the varieties of spoken human language".

 A rather far-reaching expansion now is the linking of original sound and video recordings to the transcribed records which opens new avenues for spoken language analysis. MacWhinney (1995; 438) mentions this expansion under the heading of exploratory reality. On several places, work is being carried out to make sound-linking available to the language research community, for instance at the Max Planck Institute for Psycholinguistics (Nijmegen, The Netherlands).
 

1.4 How different is CHAT?

Using CHAT is an attractive option for those who have not yet started transcribing their data. They need to find a way to transcribe their data anyway, though some may fear that:

As for the first objection, the existing CHAT transcribing and coding conventions are already very flexible, making it possible for the researcher to reflect many kinds of phenomena that occur in natural speech data. We will give some examples of existing CHAT transcribing and coding options in section 2.2. Furthermore, you can add every type of code you want, as long as you use it consistently and define it properly, which is what you would have to do any way, be it in a more traditional format or the CHAT format.

 As for the latter two objections, you can see from examples 1 and 2 below that transcribing in CHAT is not really that different from the traditional way of transcribing, and that the basic transcribing and coding conventions are not that difficult.


(1) a "traditional" transcription (Moyer, 1992)
 



In this traditional transcription we can see:

· there is a short introduction giving details of speakers and languages used;
· each speaker's turn is put in a separate paragraph following an indication of who is speaking;
· normal and italic fonts are used to indicate the language of each word/phrase;
· the literal gloss (see glossary) for each word is placed on the line beneath; and
· there is a free translation (see glossary) of the conversation provided.

In the following example the same data is given in CHAT format:


(2) a CHAT transcription (Moyer 1992)
 


In fact (2) uses the coding system recommended by LIDES, and as you can see, the data looks somewhat different. However, the same information is present:
 

From these two examples, you can see that the CHAT transcription is actually very similar to the traditional one. Once you are familiar with the conventions, not much extra work is involved in transcribing the data. Admittedly, it takes some time to become familiar with the conventions. Also, some extra work is inevitable, for example tagging each word with a language code - though your word processor or editor may be able to help you with this . However, once you have done all the work the benefits are substantial. To name a few: CHAT is very flexible and there are many possibilities. You can create additional headers which provide information such as the age of the participants, the name of the transcriber and the date of the recording. You can create additional dependent tiers on which you can provide a gloss (%glo), a translation (%tra), or code grammatical, pragmatic or other information. You can devise codes to label problematic words which do not clearly belong to one language or another. You can add as many dependent tiers as you need.

Once your transcription is complete you can print it, selecting just those tiers you want to appear on the page, from among all those you have included in your transcription.
An enormous benefit, obvious to every researcher who has ever done frequency counts using pen and paper on a somewhat larger set of data, is the fact that you can use the CLAN programs to search your data for patterns or provide certain statistics, because all LIDES transcriptions use plain ASCII characters (see glossary) and a common set of transcription and coding guidelines. The latter also makes it easier to exchange and compare data with other researchers, inside or outside the LIDES database.

 For those who already have transcribed (albeit in a different way) and analyzed (partially) their data the answer to the question "what do I have to do to make my data suitable for the LIDES database or for analysing my data with CLAN programs?" is more complex. There are two possibilities:

· You still wish to analyze your data. In this case it might still be a good idea to put the transcripts in the CHAT format, especially if your data set is quite large, because it will allow you to use the CLAN programs to analyze your data. Provided that you have the transcripts in a computerized version and not just as a hard copy (on paper), this should not be too much work; you simply put it in plain ASCII format (most wordprocessors have the ability to do this automatically) and start putting in the necessary CHAT symbols and codes.
 
· You have finished transcribing and do not wish to analyze your data any further. In this case you probably would not want to go through all the trouble of putting your transcripts into CHAT format. It is, however, possible to add transcripts that are not in CHAT format to the LIDES database. Although the LIDES system strongly recommends the use of a transcription system based on CHAT to transcribe data, already existing data in different format will not excluded from LIDES. Such data is still interesting and can be used productively by other language interaction researchers if included in the database. One of our aims is to develop adequate tools to convert these data sets into the transcription format utilized by the CLAN automatic analysis programs.
 

1.5 Why put data in LIDES?

The LIDES system is open to everybody who is interested in language interaction phenomena. This means that there are different ways to participate in this project. Some users will just want to consult the LIDES database because they lack the data they need to carry out research. Others will contribute their own data sets to the database. Some contributors may have used their own transcription and coding system. Other contributors will simply adopt the proposed CHAT transcription and coding system.

 All contributors are encouraged to make their audio and video recorded material available. Access to this material enables other investigators to use a given corpus to pursue research on an aspect which may be different from the original research questions for which the data was initially collected. In this way, a corpus which was collected to carry out some sort of quantitative analysis can later be used by researchers working on qualitative analysis. In addition, access to audio or/and video material will allow other researchers to confirm transcription and coding as well as carry out multimedia analysis. All users of the data-base will be requested to comply with standard research ethics to guarantee the anonymity of informants.

In the previous sections (1.1 through 1.3) much has been said already about the general reasons for creating a language interaction database. In summary, such a database will:
 

On a more personal level, one could take the following reasons into consideration: first of all, if a researcher wishes to compare her/his data to the data of other researchers, it is no more than reasonable that he/she, in turn, would add her/his data to the database for others to compare their data with.
 Secondly, when someone uses a data set looking at it from a different angle, his/her results may be an inspiration for the original researcher to look at his/her own data again, and may give her/him new ideas for analysis.

 Also there is the possibility for other researchers of adding additional tiers of coding to the data set. Any researcher could then make use of this new coding tier. For example, someone who is interested in phonological analysis would not necessarily add a morphology coding tier to his/her data set. Another researcher may wish to use this data set to conduct a morphological analysis, and must therefore make a morphological coding tier. Once this tier has been added to the original data set, it becomes available to other researchers as a further resource.

 Some fear that by adding their data to a public database they will lose control over what happens with their data. This is a risk a researcher takes every time he/she publishes or otherwise makes his/her findings public, but the risk may seem somewhat larger when it comes to adding one's data to a database. But as with any publication there is a moral obligation for users to acknowledge the source of the data they use in their study. Careful consideration will be given to the public use of data contributed to the database and to the requirement for anonymity of speakers.

1.6 About this manual
 
 

This manual is organized as follows. Chapter 2 and 3 are devoted to the basic steps for preparing language interaction data for analysis. Special attention is dedicated to the steps involved in the preparation and organization of the data and to the most crucial minimal requirements for transcription and tagging. Chapter 4 presents more advanced aspects of data transcription, tagging and coding. Chapter 5 discusses coding schemes which are relevant to multilingual databases. Some proposals for coding data for specific research interests are discussed. Chapter 6 concentrates on the set of programs that automate data analysis and how to use them in the analytical research stages. Finally, Chapter 8 contains practical information about contact addresses, how to obtain the LIDES programs and advice about their use, and a list of databases currently available.

 It should also be pointed out that future editions of this manual will incorporate new proposals. LIDES users are encouraged to support the development and improvement of the system not just through contributing their data, but also through making their own proposals for coding study-specific information and for programming. The LIDES enterprise does not imply a deterministic view as to the best way of transcribing, coding and/or analysing bilingual data. The complex nature of the data demands a flexible and open-minded approach to the theoretical decisions involved.
 

 
 
 

  TOP OF PAGE

Back to LIPPS Home Page