A Visual Comparison of Te Reo and English Word Structure in Te Ara

distribution of letters within words
Distribution of letters within words across Māori and English Te Ara articles.

The Māori and English languages have fundamentally different structures. One way they differ is the distribution of vowel and consonant sounds within words. I was curious to understand some of these contrasts through data visualisation. The charts at the top of this post summarise an analysis of letter distributions within words in Te Ara—The Encyclopedia of New Zealand, one of the largest Māori-English bi-lingual corpora in existence (and available under a Creative Commons Attribution-NonCommercial 3.0 New Zealand Licence).

The main thing that strikes me about this visualisation is the difference between the use of vowels (red) and consonants (blue). I knew that Te Reo is an open syllable language, where every syllable ends with a vowel, but to see such a strong contrast between vowel and consonant shapes across the two languages was a moment of forehead-slapping “oh, well but of course” recognition.

To create the graphic, some helpful Ministry for Culture and Heritage staff assisted me in locating all Te Ara entries with both Māori and English translations. I used these pages as a base and imported them through a series of algorithms to generate letter frequencies.

  • The visualisation takes into account the number of times a word appears in the text. This means more frequently used words (e.g. “nga” or “the”) influence the chart more strongly than rare words (e.g. “kōtare” or “kingfisher”)
  • For this first pass, I converted all characters with macrons to one of the standard 26 letters of the alphabet. For example, “ā” became “a”.
  • I also handled the Māori consonants of ‘ng’ and ‘wh’ in an very simplistic fashion. They simply get broken into their constituent letters and counted. I have some ideas about how to better represent these consonants and welcome all suggestions for a more nuanced handling.
  • Hyphenated words are broken up into separate words.
  • Many Māori words appeared in the English corpus and vice versa. To address this I generated a unique list of “edge case words”, which appeared in both sets of text, and compared it to a Māori dictionary. If the word appeared in the dictionary I added it to the set of words that only appeared only in the Māori version of Te Ara. Otherwise I allocated the word to the set that only appeared in the English corpus. Based on a review the data generated, this approach seems to work well with the exception of a small number of placenames that slipped into the English list.
This work extends an interesting series of posts David Taylor published in mid-2014 on graphing the distribution of English letters towards the beginning, middle or end of words. The piece describes how Taylor wrote some code to examine the frequency that each letter of the alphabet occurs within a large corpus of English texts (specifically, the Brown Corpus) and plotted the results as a series of area charts.

letters_brown_words

Taylor provides source code and a detailed methodological explanation in a second post. In Taylor’s own words, the most salient points are:

  • “I used a corpus rather than a dictionary so that the visualization would be weighted towards true usage. In other words, the most common word in English, ‘the’ influences the graphs far more than, for example, ‘theocratic’.”
  • “The ordinal (y) scales are obviously not equal: ‘e’ is used 100-200 times more often than ‘z’, and while I could have fudged everything with log scales, letter frequency is not the point of the graphs. As long as I had to fudge anyway, I did so in a way that, I believe, makes it easiest to understand what the graph shows. Your mileage may, of course, vary. The color coding is a quick guide to help understanding, since letter frequency is of course relevant to the shapes you see.”
  • “There are 15 ‘bins’ of letter positions, as a purely qualitative comparison suggested to me this was about the ideal number to show the underlying trends without under- or overfitting. Therefore the ‘t’ in “the” takes up positions 1 through 5, the ‘h’ 6 through 10, etc. When letters straddle a boundary they are apportioned proportionately.”

Although I ended up writing my own Python implementation for data processing and R scripts for visualisation, I followed the rules that Taylor outlines above.

Leave a Reply

Your email address will not be published. Required fields are marked *