McGill project wins digging into Data Challenge

International team developing innovative software to enable researchers to dig into the vast collections of English language speech data worldwide

By Meaghan Thurston

How English sounds are pronounced, especially vowels, is constantly evolving. Throughout history, changes have been observed in several different dialects of English, such as those affecting the vowels of words like “goose,” “dress,” “bath” and “trap,” or the “th” sound in words like “this” and “thin.”

While these on-going evolutions have generally been studied independently in each region where they occur, an international or multi-regional view could be obtained by developing tools that allow researchers to study language with much larger sets of data from several regions at once. This is the challenge that an international team of researchers, including McGill’s Professor Morgan Sonderegger of the Department of Linguistics, is undertaking.

The team has recently been awarded $897,000 in funding through the Trans-Atlantic Platform (T-AP) Digging into Data Challenge, to develop and apply user-friendly software for large-scale speech analysis, investigating how English speech has changed over time and space, and across dialects. Fourteen projects internationally, including six Canadian projects, were awarded funding on March 31.

The project, SPeech Across Dialects of English (SPADE): large-scale digital analysis of a spoken language across space and time, is led by an international team: Jane Stuart-Smith, University of Glasgow, Sonderegger, and Jeffrey Mielke, North Carolina State University, and will analyze 43 existing datasets of both Old World (British Isles) and New World (North American) English, including many private datasets held by “data guardians.” Together, these data sets include recorded English from speakers born over a 130-year period.

“The analysis of speech data has many applications,” Sonderegger said. “The large-scale analysis of speech data will help linguists better understand spoken English, but has applications beyond linguistics, across the social sciences and humanities. For example, social historians can examine dramatic changes in the pronunciation and intonation of ‘standard’ British English over the 20th century, linked to changes in class structures. Political scientists could mine archives of political speech in the UK and Canada. Our tool will let users examine these patterns.”

Though speech recording became possible in the early 20th Century, it has become common practice over the last 60 years, opening the door for analysis by linguists, speech scientists, anthropologists and folklorists. However, analyzing these diverse data sets has presented both ethical and technical challenges. While there is an enormous amount of recorded speech in public and private collections around the world, only a small fraction of this speech is available for research purposes. Unlike textual data, research ethics decree that speech recordings cannot be listened to and analyzed by researchers unless explicit permission was granted at the time of the recording.

“Our team asked, ‘how can we develop a tool that can take advantage of the existing speech data in an ethical way?’” said Sonderegger. “The tool to be developed will have the capacity to analyze speech from data sets that already exist, and that are largely private or under ethical restrictions. Using the software we will develop, how English is pronounced across different dialects and over time will be analyzed on a macro scale without requiring that researchers actually listen to the data.”

McGill Department of Linguistics faculty members Charles Boberg and Michael Wagner are also involved as collaborators on this project. Prof. Boberg is a leading expert on North American English and the large-scale regional analysis of English; Prof. Wagner was a PI on a previous Digging into Data project.

The approach proposed by the project has only been applied on a limited scale, because the methods for doing so were underdeveloped, explained Professor Sonderegger. “The proposed software will help linguists scale-up the study of speech. The results of the project will hopefully be applicable for both researchers and the public, who hold private speech datasets and are interested in how English sounds across time and space.”

SPADE has received some internal McGill funding, and enhanced, Sonderegger said, by a longstanding connection between McGill and the University of Glasgow. Funded cooperatively through the Digging into Data Challenge, SPADE received funding from the Social Sciences and Humanities Research Council (SSHRC) and the Natural Sciences and Engineering Research Council of Canada (NSERC), the Arts and Humanities Research Council and the Economic and Social Research Council (UK) and the National Science Foundation (US). The Digging into Data Challenge promotes innovative humanities and social science research using large-scale data analysis.

You might also enjoy...

Angelica Galante earns President’s Prize for Outstanding Emerging Researchers

Anne H. Charity Hudley on linguistic justice

First language wires brain for later language-learning

French program for McGill faculty members ready for fall...