Department of Linguistics
A NATIONAL BULLETIN ON ISSUES IN
AUSTRALIAN STYLE AND ENGLISH IN AUSTRALIA
| Volume 16 No 2 | December 2009 |
The Australian National Corpus InitiativeMichael Haugh is a lecturer in the School of Languages and Linguistics, Griffith University. He has convened workshops on the Australian National Corpus in 2008 and 20091. There are now more than 22 million users of language in Australia. Language is used constantly in our daily lives, from face-to-face conversations, reading newspapers and books, through to writing emails and blogging. A variety of different languages are also spoken by these users, including indigenous languages, migrant or community languages, and, of course, English. The latter predominates in public life, and also in private life for a large proportion of the population, but it does not exist in a vacuum, being influenced by and influencing other languages and users in Australia. This complex linguistic landscape forms an important part of what it means to be Australian. Information technologies – in particular, increasingly powerful computers as well as the Internet – are offering new ways in which to study the Australian linguistic landscape. One such possibility is the establishment of a representative collection of digitised spoken and written language in Australia in all its forms and diversity. The term corpus is generally used to describe such a collection. Many countries have large corpora, including the U.S., the U.K., Germany and Denmark, but Australia’s language data resources remain scattered and relatively inaccessible. The Australian National Corpus initiative involves a concerted push by linguists, applied linguists, language technologists and those interested in language more generally to establish a massive online database of language in Australia. In this way, we can take advantage of the capacity of computing technologies to search across large amounts of language data, and also make this corpus easily accessible, not just to researchers and educators, but to everyone who is interested in language and the ways in which it is used in Australia. In order to be representative of language in Australia the Australian National Corpus needs to be very large indeed. The English component, in particular, needs to be enormous because there are so many different people in Australia who have used or are using English in multiple ways in spoken, written, and increasingly, computer-mediated forms of communication. At present the largest collections of Australian English we have are the Australian Corpus of English and the Australian component of the International Corpus of English, which are approximately one million words each. While one million words may sound like a lot there are much larger corpora held in other countries. The British National Corpus is one hundred million words, and the Corpus of Contemporary American English is more than four hundred million words. There is now even a two billion word corpus of English held by Oxford University Press, the Oxford English Corpus.2 The reason other countries have built such large corpora is that many questions about language and language use can only be answered when you have a much larger, more representative collection. We are all familiar with certain phrases such as taking the piss (or taking the mickey if you will), for instance, in Australia. Yet if we search current corpora of Australian English we cannot find even one example where this phrase is used. This means that while we might assume that taking the piss is something that many Australians enjoy doing, we cannot get a handle on what such an iconic phrase means across the Australian linguistic landscape. We know from searching the British National Corpus and the Corpus of Contemporary American English that it is phrase also used frequently by the British, but rarely by Americans. But we do not yet know if Australians mean something different by this phrase to the British. This is just one example of a multitude of questions about language use that could be answered through the establishment of a large Australian National Corpus. An Australian National Corpus would not only be useful to those studying languages in Australia and seeking to better understand what it means to be Australian. It would serve as a helpful resource for those teaching English and other languages, as it would provide real-life, authentic examples of spoken and written language to use in the classroom. It would also be of assistance for those building human-computer interaction systems. Unless we all want to start speaking like Americans, then language technologists will increasingly need access to large collections of data where Australians are speaking English. It might be irritating to be answered by a computer system on the phone, for example, but voice controlled systems are probably here to stay. The development of more Australian-friendly systems is at least one way to reduce such irritations (albeit not completely). The Australian National Corpus initiative is a collaborative effort amongst Australian researchers, but we hope to involve Australians more broadly in this project in various ways. After all, ultimately every one of us has a deep investment in how language is used in Australia, as it is through all of us using language in our daily lives that we create the complexity and diversity of the Australian linguistic landscape. Notes 2. The British National Corpus and Corpus of Contemporary American English are accessible online at Click here to read the lead article in the previous edition of Australian Style (16.1). |


