| |
|
CRG Timetable - Term 2: 17th January-21st March 2011
All meetings for Term 2 are in Meeting Room 1, FASS Building, at 3 pm, unless otherwise stated.
wk 11. (17 January)
-Will Simm (School of Computing and Communications, Lancaster University) ViewKi: Exploratory Interactive Navigation of Short Comments
wk 12. (24 January 2) -Nicholas Groom (University of Birmingham) Corpus Perspectiveon Turn-Taking in University Seminars
Pls note changes in time and venue wk 13. (31 January 2011 3-5pm) Faraday Building A205 Edward J.L. Bell (Lancaster University) R: More than just a letter of the alphabet (Workshop)
wk 14. (7 February 2011) no meeting
wk 15. (14 February 2011) - Cathy Lonngren-Sampaio (University of Hertfordshire) The construction and analysis of a computerised corpus of child bilingual language
Pls note changes in day wk 16. (24 February 2011 Thursday) Richard Xiao (Edge Hill University)Contrastive Corpus Linguistics: Cross-linguistic contrast of English and Chinese presentation
wk 17. (28 February 2011) - Jana Tereick (University of Hamburg) Climate Change in 'Old' and 'New' Media: Towards a Multimodal Corpus-Assisted Discourse Analysis
wk 18. (7 March 2011) - Ghada Mohamed (Lancaster University) Text Classification of the BNC using Corpus and Statistical Methods.
Pls note changes in day and venue wk 19 (1). (16 March 2011 3-5pm) Faraday Building A036 Jana Tereick (University of Hamburg) Workshop: Multimodal Data Transcription and Analysis Software
Pls note changes in day and venue wk 19 (2). (18 March 2011) IAS Building Meeting Room 3 Gisle Andersen (Norwegian School of Economics and Business Administration (NHH)) The spread of English: Evaluating the effect of language policy decisions on spelling adaptation of Norwegian anglicisms
Pls note changes in day and venue wk 20. (23 March 2011) IAS Building Meeting Room 3 Scott Piao (School of Computing and Communications,Lancaster University) Understanding People’s Interests from Twitter – An Application of
Corpus-based NLP Techniques
|
| |
wk 11 Monday 17 January 2011
ViewKi: Exploratory Interactive Navigation of Short Comments
Will Simm
(School of Computing and Communications, Lancaster University)
The Voice Your View project aims to mobilise the tacit knowledge of a community to transform public spaces to be safer and more inclusive.
Voice Your View will collect real-time information that can then be structured, stored in an online repository, and exchanged with appropriate stakeholders: other users, local community groups, local authorities, etc. Voice Your View has operated a number of trials using emerging technologies to summarise live public commentary.
The ViewKi extends the Voice Your View concept by allowing users to explore the comments received by the system. Before, users were presented with a non-interactive summary of comments received; now users can interact with the data and see similar comments that have been left by others.
↑top |
| |
wk 12 Monday 24 January 2011
Corpus Perspectiveon Turn-Taking in University Seminars
Nicholas Groom
(University of Birmingham)
The fundamental aim of seminars and other forms of small-group interaction in higher education is to get learners to talk, and the underlying assumption shared by educational theorists and university teachers alike is that the more learners talk, the more they will learn, and thus the more successful the seminar will be. But what does ‘more talk’ mean? Is it to be measured in terms of the number of words spoken by learners, or by the number of turns that learners take, or by the average length of learners’ turns, or perhaps by some composite of these (and perhaps other) measures? In this talk I will present a ‘work in progress’ report on a study of the British Academic Spoken English Corpus (BASE), in which Oliver Mason and I are investigating learner and teacher contributions to seminars according to each of these three measures. BASE is particularly well suited to our research interests not only because it includes a seminars subcorpus, but also because it is divided into four different ‘knowledge domains’: humanities, social sciences, life sciences and physical sciences. This allows us to ask whether turn-taking patterns in university seminars are subject to any form of systematic disciplinary variation.
Our main finding so far is that different knowledge domains perform better according to different measures. Specifically, if we define talk in terms of total words spoken, we find that students talk the most in seminars in the humanities and social sciences; if on the other hand we quantify talk in terms of number of turns, then students in the physical sciences are found to talk the most; and if we measure talk in terms of average turn length, then students in life sciences disciplines come to the fore.
I will then offer some possible explanations for these trends by taking a closer qualitative look at some examples of seminar interactions in each of these four knowledge domains. Following this, I will argue against the idea that any one of these measures might be inherently better or more desirable than the others. I will suggest instead that each of these different versions of ‘talking more’ carries with it a different set of affordances, each of which is more or less well attuned to the particular epistemological and pedagogic goals of different academic disciplines. I will conclude by considering the implications of this argument for staff development and training programmes in higher education.
↑top |
|
wk. 13 Monday 31 January 2011
R: More than just a letter of the alphabet
Edward J. L Bell
(Lancaster University)
I will demonstrate how the statistical software R can be used in corpus linguistics. We will go over the basics of R initially and then proceed to explore topics of interest to linguists such as:
* frequency distributions
* graphs and plotting
* hypothesis testing
* modelling/classification (if we have time)
The presentation will take the form of a tutorial with practical exercises. I will provide data but the members of audience can bring data if they so desire. The best form for data is a plain text file without annotation.
Most examples will be taken from 'Analysing Linguistic Data (Baayen)' and Gries' linguistic books on R.
↑top |
|
wk. 14 Monday 7 February 2011
no meeting
↑top |
|
wk 15. Monday 14 February 2011
The construction and analysis of a computerised corpus of child bilingual language
Cathy Lonngren-Sampaio
(University of Hertfordshire)
This paper describes the process of construction and analysis of a computerised corpus of child bilingual language following the transcription and analysis system of the CHILDES (Child Language Data Exchange System) project (MacWhinney, 1991). The corpus is composed of transcriptions of the spoken language of two Brazilian bilingual siblings (M and J), exposed to Portuguese and English from birth. The data comprises recordings of diverse family situations occurring over three years which were transcribed using the conventions set out by CHAT (Codes for the Human Analysis of Transcripts)(MacWhinney, 2010a). Specific codes were designed and inserted in the corpus to permit the electronic investigation of both grammatical and sociolinguistic aspects of Code-Switching (CS) through the use of the CLAN (Computerized Language Analysis)(MacWhinney, 2010b) tool. The effectiveness of the codes were tested through analyses on a small number of files and the output, original CS data for the language pair Portuguese/English, was analysed both quantitatively and qualitatively (Lonngren, 2004). Methodological considerations, relating to both the process of construction of the corpus and its analysis will be the focus of this presentation.
References:
LONNGREN, C. (2004). A Investigacao da Alternancia de Codigo em um Corpus Eletronico de Linguagem Bilingue Infantil. CROP (Revista da Área de Língua e Literature Inglesa e Norte-Americana. Departamento de Letras Modernas. USP: São Paulo, Brazil
MACWHINNEY, B. (1991). The CHILDES Project: tools for Analyzing talk. Hillsdale, NJ: Lawrence Erlbaum Associates.
MACWHINNEY, B. (2010a). The CHILDES Project, Tools for Analyzing Talk - Electronic Edition. Part 1: The CHAT Transcription Format. Carnegie Mellon University. Available online: http://childes.psy.cmu.edu/manuals/chat/pdf <http://childes.psy.cmu.edu/manuals/chat/pdf> .
MACWHINNEY, B. (2010b). The CHILDES Project, Tools for Analyzing Talk - Electronic Edition. Part 2: The CLAN Programs. Carnegie Mellon University. Available online: http://childes.psy.cmu.edu/manuals/clan/pdf <http://childes.psy.cmu.edu/manuals/chat/pdf>
↑top |
|
wk. 16 Thursday 24 February 2011
Contrastive Corpus Linguistics: Cross-linguistic contrast of English and Chinese presentation
Richard Xiao
(Edge Hill University)
The corpus-based approach is inherently comparative in nature. In this
presentation, I will introduce a new model of Contrastive Corpus
Linguistics proposed in Xiao and McEnery’s new book Corpus-Based
Contrastive Studies of English and Chinese (Routledge, 2010), which
provides a common research platform for areas including corpus linguistics, contrastive linguistics, translation studies and second
language acquisition research. I will also present the major research findings, and discuss the challenge and promise, of corpus-based contrastive studies of two distinctly different languages such as
English and Chinese.
↑top |
|
wk. 17 Monday 28 February 2011
Climate Change in 'Old' and 'New' Media: Towards a Multimodal Corpus-Assisted Discourse Analysis
Jana Tereick
(University of Hamburg)
In the current "Convergence Culture", discourses spread over very many and
very different media channels. When Discourse Analysis tries to address
this, theoretical, methodological as well as technical problems occur.
This talk discusses some of the challenges when trying to incorporate
audio-visual and Web 2.0 data into a corpus-assisted Discourse Analysis.
Examples are drawn from my doctoral research on German climate change
discourse. The corpus comprises print media articles from six major
newspapers (1995 to 2010), TV programmes broadcast around the 2010 UN
Climate Change Conference and the 1000 most-viewed YouTube videos on
climate change with accompanying comments.
Based on first results from this study, the talk tries to show how an
integrative approach might lead to new insights into how discourses unfold
in the digital age.
↑top |
|
wk. 18 Monday 7 March 2011
Text Classification of the BNC using Corpus and Statistical Methods
Ghada Mohamed
(Lancaster University)
This presentation demonstrates a new statistical methodology for establishing categories within a text typology. Although there exist many different approaches to the classification of text into categories, my study will fill a gap in the literature as most work on text classification is based on features external to the text such as the text's purpose, the text producer’s intentions, and the medium of communication (see, for instance, Reiss 1976; Welirch 1983). Text categories that have been set up based on some external features are not linguistically defined (Biber 1989). In consequence, texts which belong to the same type are not necessarily similar in their linguistic forms. Even Biber's (1988) linguistically-oriented work was based on externally-defined registers.
In this presentation I will show how a text typology, based on similarities in linguistic forms, can be developed using a multivariate statistical technique, namely cluster analysis. In this study, this technique was implemented using R statistical package. There are two reasons for using statistical software: (a) the large number of texts to be classified in the BNC (British National Corpus) and (b) the exhaustive list of linguistic features used as underlying variables for classifying the texts. The linguistic features used include personal pronouns, passive constructions, prepositional phrases, nominalization, modal auxiliaries, adverbs, and adjectives.
Computing a cluster analysis based on this data is a complex process with many steps. At each step, several alternative techniques are available. Choosing among the available techniques is a non-trivial decision, as multiple alternatives are in common use by statisticians. I will demonstrate how a process of trial and error was used to test several combinations of clustering methods, in order to determine the most useful / stable clustering combination(s) for use in the classification of texts by their linguistic features. The stable results obtained from this trial and error process were then validated using three validation techniques available in cluster analysis, namely the cophenetic coefficient, the adjusted Rand index, and the AU p-value.
Cluster analysis, if used with caution, is a powerful tool for structuring the data. The way it has been implemented in this study constitutes an advance in the field of text typology.
↑top |
|
wk. 19 (Event 1) Wednesday 16 March 2011
Multimodal Data Transcription and Analysis Software
Jana Tereick
(University of Hamburg)
In this workshop, we will look at some software available for managing and analysing larger sets of multimodal data. We will explore the functions of EXMARaLDA (a free transcription and corpus analysis software package) and NVivo (a commercial qualitative data analysis tool) as well as some open-source video editing and transcoding software (e.g. Avidemux and ffmpeg) and image management/analysis tools (digiKam, imgSeek) -- always with regard to our respective research interests.
Please register with Jana (jana.tereick@uni-hamburg.de) if you are interested in attending. You can bring your own data to try out the programmes. Suggestions for other software we might want to include are very welcome.
↑top |
|
wk. 19 (Event 2) Friday 18 March 2011
The spread of English: Evaluating the effect of language policy decisions on spelling adaptation of Norwegian anglicisms
Gisle Andersen
Norwegian School of Economics and Business Administration (NHH)
This paper focuses on English influence on Norwegian vocabulary and addresses the orthographic adaptation of import words, such as the change from blog to blogg, or squash to skvåsj. This adaptation can be viewed from a top-down perspective, by considering the effect of standardisation decisions made by the Norwegian Language Council, or from a bottom-up perspective, by considering unsolicited adaptation initiated by the language users themselves. Both types are observable in the 900 million words Norwegian Newspaper Corpus (NNC). The paper aims to show that the NNC lends itself easily to a large-scale investigation of either top-down or bottom-up adaptation. This is a corpus-based investigation of the extent to which English-based import words are represented with original or adapted orthography. The corpus provides empirical data that may shed significant light on the linguistic aspects of the adaptation process. Through a quantitative and qualitative inspection of the data, I point at a set of linguistic and contextual factors that appear to have important bearings on the degree to which words undergo spelling adaptation. At the same time, the paper is meant to illustrate the usefulness of some of the facilities of the newly developed Corpuscle search engine and interface and to show its value when applied in a specific empirical research task.
↑top |
|
wk. 20 Wednesday 23 March 2011
Understanding People’s Interests from Twitter – An Application of
Corpus-based NLP Techniques
Scott Paio
(School of Computing and Communications, Lancaster University)
In this presentation, I will present our ongoing research on the automatic twitter analysis for analysing people’s interests, which is carried out in the RCUK SerenA Project. Various corpus-based NLP
techniques are employed in this work. Twitter is a popular social media system which has about 190 million users today. People use it for various purposes such as conversation, passing news, self-promotion etc. The large quantity of twitter messages (tweets) provides the possibility of analysing people’s interest in (nearly) real-time. In particular, twitter has been increasingly used for the research promotion purposes, and it might be possible to identify some aspect of people’s academic interests as well.
Different from many recent research works on the twitter analysis, which have largely focus on metadata like hashtags, we focus on the extraction of information from the messages themselves. It is a challenging task to extract reliable information from raw twitter messages, for which the corpus-based NLP techniques have an important role in this research. We employ a set of NLP tools, including POS tags, chunkers, entity detector, term extractor etc. to summarise the core interests associated with a given twitter user. Our initial evaluation shows some interesting results.
↑top |
|
 |
 |
|