....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....




Surveying Existing Resources for the

Indigenous Minority Languages of the British Isles and Ireland


The Department of Linguistics at Lancaster University has been engaged with two recent EPSRC-funded projects drawing attention to the non-indigenous minority language communities in the UK by locating existing resources, investigating end-user needs and wants, examining basic technical issues and beginning to generate appropriate resources. This identified a subsequent gap in the market for the associated indigenous minority languages of the British Isles and Ireland, or “BIML”s – Cornish, (Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans) and Welsh - which are becoming increasingly widely used in both public and private life. Speech and language technology applications for these languages are now also becoming an increasing urgent need. To develop such applications, basic language resources are  therefore required.  

The LER-BIML project has three primary aims: 

    i)   to survey the existing language engineering resources and tools for the BIMLs in question

    ii)   to obtain information regarding end-user needs and demands in these areas

    iii)  to investigate some of the particular technical issues that these BIMLs raise, principally in view of spoken corpus collection and annotation

This initial workpackage concentrates on the first of these objectives. 


The BIMLs in question fall into the two language families of Celtic Indo–European and Scots. As it has not previously been general practice to regard and treat these two families as one individual cohesive group, there is no one central governing body responsible for coordinating and making their resources and facilities widely available. Consequently work on these languages is widespread and yet sparsely distributed. This paper is therefore concerned with the manner and extent of this distribution and the level of current activity surrounding them. It was evident from the outset that the Internet was going to be the main starting point for locating material and suitable contacts. It was intended that our main focus should be on text corpora and machine-readable texts, but also concentrating on locating speech databases, term banks, lexicons and language analysis tools such as taggers and parsers. The Internet has the obvious benefit of supplying all texts in electronic form. Search engines proved a useful initial means for generating a general idea of the volume of material available for the BIMLs, and from here subsequent leads and contacts were then followed up. What was apparent was that most of this material would require sifting as whilst there is a wealth of information in English about the BIMLs, the interests of this project lie, however, with resources directly available in the languages themselves[1]



Current Projects 

Aside from MILLE and EMILLE there are several other various projects in progress sharing themes with those of LER-BIML. These include CELT, an online database of ancient and contemporary Irish literary and historical texts, the National Corpus of Irish incorporating 15 million words from a variety of contemporary books, newspapers, periodicals and discourse, marked up in accordance with the PAROLE encoding standards and MELIN which has produced dictionaries, grammars, spellcheckers and terminology lists for the initial four EU minority languages, Irish, Welsh, Catalan and Basque. Their sites offer good links to BIML data and other websites. The Oxford Text Archive has a catalogue of several thousand electronic texts and linguistic corpora in a range of languages including standard reference works and mono and bilingual dictionaries. The Universities of Edinburgh and Glasgow have recently begun collaborative work on the SCOTS project which aims to build a collection of electronic spoken and written texts for the languages of Scotland incorporating Scottish English, Scots and Gaelic. Previous work in this field has included a one million word lexical database and frequency count for Welsh (CEG) developed at Bangor from a broad range of modern Welsh text types, and Briony William’s annotated speech database for Welsh and its application in speech technology at CSTR, University of Edinburgh.  


In assessing the relative volume of BIML resources, it was clear that Welsh has the most widely available material. This is a direct result of the Welsh Language Act 1993 which states that the public sector must offer its services bilingually, something that much of the private sector has now also adopted. In locating BIML data we therefore took specific note of whether the material appeared solely in the original language, or in bilingual format. One particular good example of Welsh parallel text is to be found on University homepages. Sabhal Mór Ostaig, a Further Education College on Skye, offers a comprehensive index to key resources in Scottish Gaelic and is a primary gateway for BIML data. ACCAC has a parallel site of exhaustive Welsh language and bilingual educational sites for ages 4-18. Dublin City University has a centre, FIONTAR, which administers academic programs entirely through the medium of Irish, and has a bilingual website outlining this. The Centre for Manx Studies obliged in sending an extensive list of Manx resources, which included the main resources page developed by the Manx Language Officer at the Department of Education. This page offers links to short stories, the Manx Language Society’s newsletter Dhooraght and the magazine CARN, dictionaries, grammars and glossaries. 

There are interactive and self-teaching language courses available for all the BIMLS, ranging in style and intensity from the colloquial to the more grammatically orientated instruction, but which cater for all levels of learner. Most sites also supply various vocabulary and phrase lists and glossaries. 

The Irish, Ulster Scots and Welsh languages are all well represented by having official Language Boards and Agencies whose websites occur entirely in bilingual format and offer useful links. Whilst no such official bodies exist for the remaining BIMLs, there are organisations whose work is invaluable to promoting their respective language such as Agan Tavas for Cornish, Cli for Scottish Gaelic and Mannin.Org.Im for Manx. These sites include histories of the language, manuscripts, reference materials, news items, and merchandise. Most of these pressure groups offer discussion fora and mailing lists in English and in the language in question where reports are archived for public reading. Webzines are particularly popular and are a very good source of BIML resources as they form the focus of special interest groups with a targeted loyal audience. Examples are An Gannas for Cornish speakers, Beo for Irish, Wir Leid for Scots and Ullans.com. They offer comment, stories, puzzles, quizzes, jokes, polls, and reviews amongst others. The Mercator project which serves as an information network for minority languages of the European Union profiles Cornish, Scottish Gaelic, Irish and Welsh amongst these languages and directs towards associated resources. 


All Welsh council and parliamentary sites including the National Assembly are legally required to be presented in bilingual format, and the Scottish Parliament and Northern Ireland Executive are following their lead. Health Authorities in Wales that have developed their own websites also present them in this bilingual format. There is obviously much more “official” material available for Welsh than any of the other BIMLs, but Gaelic and Irish are certainly increasing their profile. 


Media resources are strong in BIML data with various online newspapers, radio broadcasts and television schedules. Newspapers vary from presenting their entire content in the original language, like the Welsh weeklies Y Cymro and Golwg, and the Irish weeklies Foinse and Lá, to producing special reports like An Phoblacht, Ireland’s leading weekly Republican newspaper (archived). BBC Online is available in Welsh, BBC Scotland has pages in Gaelic, BBC Cornwall produces a weekly audio five minute news bulletin and the Welsh and Irish stations S4C and TG4 respectively have bilingual sites. The recently launched BBC4 channel broadcasts programmes in Gaelic. RTE, the national Irish radio service, provides bulletins online and Raidió na Gaeltachta has an extensive bilingual site which supplies audio downloads of its broadcasts 24 hours a day. There is a text and audio weekly review of Manx Radio. Search engines and web browsers can be used in Welsh, Gaelic and Irish: the Opera Web Browser is available in Irish, Scottish Gaelic and Welsh as well as Breton, and a detailed guide to Welsh software can be found at Meddal. 

Arts and Literature 

The Internet does not entirely replace the printed text as much of the literary corpora is reproduced on various sites. Poetry and song lyrics are favourites, some with audio recordings, and are often used alongside language lessons. Short stories have been written by contributors to the webzines. There are several online book inquiry services and bookshops including the Welsh Books Council, and the National Library of Wales has a bilingual website. Details and adverts for film and music festivals and local events are often posted in the respective BIML and linked through webzines and special interest groups. The National Museums and Galleries of Wales website is in parallel format. 


As regards religious texts, the Cornish Language Board has translated several books of the Bible into Cornish. Various excerpts have also been translated into Manx and there are Manx, Scottish Gaelic and Welsh versions of the Book of Common Prayer. 


Online dictionaries are widely available for all the BIMLs and spellcheckers have been developed for Cornish, Scottish Gaelic, Irish, Manx and Welsh. Canolfan Bedwyr at the University of Bangor has published extensively in the area of specialist terminology dictionaries, most of which are online and include amongst others glossaries of finance and education terms produced for the National Assembly of Wales. 



There is a reasonably healthy volume of BIML resources available on the Internet but as predicted Welsh is the most prominent and prolific of these languages. This prominence will be a direct result of the Welsh Language Act 1993 which demands all official material to be presented bilingually. As there is no such official legislation as regards the other BIMLs there is a strong bias amongst these languages towards popular entertainment, particularly webzines, and archaic literature as they still rely primarily on specialist interest. They are, however, increasing their profile within more official and political contexts. The paucity in particular of Scots and Ulster Scots resources is probably best accounted for by the difficulty in distinguishing the boundaries between what is a dialect of English and what is something entirely recognisable as “Scottish”. This is a much disputed issue.[2] Bilingual text tends to be the most favoured format of presenting BIML data, as it caters for the interests of the native BIML speaker whilst also recognising the need to extend its resources to the non-BIML speaker. Much material exists on the subject of these languages, but in English. The positive results of this survey therefore support the project’s claims of the increasing profile of the BIMLs in the UK today.

[1] To avoid confusion the term “BIML data/material” will refer exclusively to material written in the languages themselves, rather than in English.

[2] In 2001, the UK Government signed and ratified the European Charter for Regional or Minority Languages. As a result, in Scotland and in Northern Ireland, Scots is recognized as a regional language, except that in Northern Ireland it is referred to as Ulster-Scots. For most linguists, Scots is the national dialect of Scotland bound up with present-day English, and that as a national dialect it also has regional variants, including the variant in Ulster. For many non-linguists, because Scots was once a language it still is on the grounds of both national ideology as well as the distinctiveness still retained and separating it from English. For some politicians and some activists, Scots is officially a regional language which is a political and not a linguistic concept. However because of the way legislation is expressed, some politicians and some activists consider Ulster Scots a separate regional language in Northern Ireland.