The Big Picture on Big Data

The Big Picture on Big Data

If you feel overwhelmed by the sheer volume of information available for any given subject, know that you are not alone. Philosophers, statisticians and computer scientists have been wrestling with questions about how to derive meaningful insights from large volumes of information for decades. Three McGill researchers showcase the different applications of Big Data and offer insights on how to make sense of it all… or at least some of it. //

By Victoria Leenders-Cheng, with files from Zoë Macintosh


In the Reasoning and Learning Lab in the School of Computer Science at McGill, four iRobots (think Roomba vacuums) scoot around, probing the environment and gathering data. Attached to one robot is a Microsoft Kinect camera with a little grip pad, so the robot not only moves through a space but also “looks” at objects and pushes them around.

It’s not just science fiction or a scene from the Jetsons anymore: artificial intelligence (AI) machine learning, in which programs improve automatically with experience, is everywhere. Data-mining programs that detect fraudulent credit card transactions, autonomous vehicles that learn to drive on public highways — these are just a couple of examples of the practical applications arising from this field of research.

Computer science professor Doina Precup, who co-directs the Reasoning and Learning Lab, is investigating both the applied and theoretical dimensions of machine learning. Robots that can map a room or open a door without pre-programmed instructions, or handle a doorknob whose shape is wholly new to the robot’s database, are potential assistants for the elderly or the disabled. Precup and fellow computer science professor Joelle Pineau work on robots, including a wheel-chair robot, with these capabilities.

Software agents such as computer games that automatically adapt to a player’s gaming style can maximize the player’s enjoyment and provide an increasing level of challenge (or, as demonstrated in the Oscar-nominated movie Her, operating systems can become almost human, with human wants and feelings…).

On the theoretical side, an examination of the algorithms used in machine learning yields deeper insights into information processing. For example, first of all, how does one go about training robots to learn on their own?

“You build a bias into the robot so that it likes to receive large numbers, like a cookie for a rat,” says Precup. Then, you reward the robots with large numbers as they navigate new environments. Rather than rely on instructions to map a room or open a door, then, Precup’s robots create their own classifications from the ground up, based on these built-in biases.

Next, what conditions do you set for the data that the robots are gathering? Current software permits Precup’s robots to simultaneously take dozens of sensor readings per second in chunks of time that range from a few hours to a day.

“It’s nice to know, given this infinite amount of data, whether it is likely that we can come up with the right answer,” Precup explains. “Theoretical analysis answers this question and also tells us how fast we can expect to see results: a day, a month, or a year. Then, we use practice applications to see if there is a match with the theoretical algorithms, or, if there isn’t a match, we try to understand why not.”

Precup compares the tasks in machine learning to those needed for an activity such as cooking dinner: there are high-level steps such as choosing a recipe or acquiring the ingredients and low-level steps, such as actually stirring the pot or chopping the vegetables.

The goal in artificial intelligence is to build models capable of both strategic thinking and low-level thinking.

“If you focus disproportionately at the very low level, the data comes in and there is too much of it so you can’t really process it properly. But, at the high level, the task becomes too abstracted and you may not be able to do what you set out to do,” Precup says. “We try to create a balance between the two.”



A Canadian law enforcement official flipped through a five-page printout from a new cyber forensics tool called AuthorMiner. The software, created by McGill School of Information Studies professor Benjamin Fung, analyzes the writings styles of anonymous emails and identifies the humans behind the messages. But the printout in the official’s hands, which contained thousands of permutations of alphabet letters, was virtually incomprehensible.

“We cannot spend days trying to understand this data,” the law enforcement official told Fung. With the digitization of so much of our communication (email, tweets, blog posts), and with the occasionally malicious nature of such communication (spam, phishing, hate mail), the ability to attribute a message directly to a person helps law enforcement officials by providing them with evidence that stands up in a courtroom trial.

“However, our most reliable methods for identifying authors usually involve sophisticated and obscure computational models that are very challenging to interpret,” Fung explains. “As a result, the output from these methods is seldom actually used in real-life lawsuits because the output doesn’t meet the standards of evidential proof.”

Authorship attribution, as this field of research is called, combines linguistics and computer science to generate analyses of writing styles, broken down into its various components called stylometric features. These features include lexical, syntactic and structural elements of a written document, such as how often a person writes “u” instead of “you” or how often that person misuses “it’s” for “its.”

There are 302 such stylometric features, with 2302 or 8.15 x 1090 possible combinations. They are assessed on the basis of their frequency and their recurrence to yield a writeprint (akin to a fingerprint but for writing) that identifies the author of a message with upwards of 90 percent accuracy. These methods of analysis can even provide inferences about an author’s nationality and gender.

Hearing the security official’s feedback convinced Fung of the need to create a visualization component for the software so that the results of the complex permutations would communicate directly with the human eye and thus be comprehensible to the human mind.

Visualization Sample of Writeprint Analysis Results
Visualization Sample of Writeprint Analysis Results

He went back to the research lab of National Cyber-Forensics and Training Alliance Canada, and spent a year working together with his student, Honghui Ding, on the programming.

“Our goal was to develop a highly accurate way of measuring this data, and also improve how easily this data can be interpreted and visualized,” Fung says. “We were inspired by fingerprint matching, where you can visually compare the minutiae of the print, and devised a tool that visually represents our hypotheses.”

The result, Author Miner 3.0, presented the complex information about stylometric features in colour-coded charts that could be easily and accurately interpreted. In late 2013, he again demonstrated the software to law enforcement officers. The eventual goal: use the software to support current cyber forensics investigations, increase social accountability and cut down on cybercrime.

Linguistic Fingerprint

Examples of stylometric features used to generate a writeprint:

  • Vocabulary richness
  • Word length
  • Punctuation
  • Use of function words such as prepositions and pronouns


  • Sentence or paragraph length
  • Spelling and grammatical mistakes




Visualization of a 2-gram and a 3-gram

Western music, from 12th-century Gregorian chant to 21st-centuryLady Gaga, is built on the notion of counterpoint. (Quick definition for the uninitiated: counterpoint is the relationship between different voices in a piece of music.)

At McGill, music research professors Julie Cumming and Peter Schubert have figured out how to represent bits of counterpoint with sets of numbers. They lead a group of scholars and students in a project using these datadriven techniques to illuminate the history of musical composition and understand how musical styles evolve.

Called ELVIS, for Electronic Locator of Vertical Interval Successions, the project has created an online database of searchable music scores composed between 1300 and 1900 and provides the software tools to enable analysis of the scores and the musical styles.

“We can look at a two-gram, which is two vertical intervals linked by the melodic motion in the lower voice, and represent it with three numbers (see sidebar). Or we can look at a three-gram, which is three vertical intervals, and represent it with five numbers,” says Cumming. “Once we can represent those relationships with numbers we can search for them with a computer.”

The ELVIS database holds more than 6,000 pieces of music with over six million searchable vertical intervals and was awarded a “Digging into Data Challenge” grant in 2012 as part of an international initiative to address how big data changes the research landscape for the humanities and social sciences.

Musical style is heavily defined by the vertical intervals considered “allowable” for the period and deviance from the style is sometimes met with outrage, as with the riots provoked by Stravinsky’s The Rite of Spring. Tracing the evolution of style, a long-standing practice in music history, gives scholars insight into the nature of music itself. Now, by digitizing the practice, scholars are able to analyze huge quantities of data — orchestral scores can have up to 40 or 50 voices sounding notes at the same time — much faster and with greater rigor and accuracy.

ELVIS software is available for public use on the web (; it will also become part of another music digitization initiative at McGill led by music technology professor Ichiro Fujinaga that aims to create optical music recognition software able to scan and analyze musical documents written in all styles, from hand-written manuscripts to printed scores.

“We are hoping to get more and more data from the optical recognition techniques,” Cumming says, leading into a statement that summarizes the big takeaway, “but unless you have good search and analysis tools that actually use that information, it’s useless to have all that data.” ■

More Digging into Data

Andrew Piper of McGill’s Department of Languages, Literatures and Cultures in the Faculty of Arts leads a team of 14 researchers awarded the Digging into Data Challenge for 2014. The project, entitled Global Currents: Cultures of Literary Networks, 1050-1900,” will undertake a study of literary networks in different cultural epochs and bring a data-driven approach to the study of world literature.


What is Counterpoint ?

Examples of counterpoint most obvious to the ear include the back and forth between melodies in Row, row, row your boat, or Pachelbel’s famous Canon, or the harpsichord solo in The Beatles’ song, In My Life.

Basically, in counterpoint, different notes sound at the same time, resulting in what we think of loosely as harmonies; music theorists call these “vertical intervals,” in reference to the distance between the two  pitches, and when intervals sound one after another, they call this a “vertical interval succession.”


ELVIS goes Multimedia

ELVIS lends itself to other unusual representations of data. Graduate student Mike Winters created sound files from (or “sonified”) the ELVIS data for 15th century composer Guillaume Dufay, 16th-century composer Palestrina, and Ludwig van Beethoven. The sonification paint a vivid aural picture of the three

composer’s different styles and Beethoven’s sound file is colourful… to say the least. Hear these sonifications online at