Distribution of letters within words across Māori and English Te Ara articles.
To create the graphic, some helpful Ministry for Culture and Heritage staff assisted me in locating all Te Ara entries with both Māori and English translations. I used these pages as a base and imported them through a series of algorithms to generate letter frequencies.
- The visualisation takes into account the number of times a word appears in the text. This means more frequently used words (e.g. “nga” or “the”) influence the chart more strongly than rare words (e.g. “kōtare” or “kingfisher”)
- For this first pass, I converted all characters with macrons to one of the standard 26 letters of the alphabet. For example, “ā” became “a”.
- I also handled the Māori consonants of ‘ng’ and ‘wh’ in an very simplistic fashion. They simply get broken into their constituent letters and counted. I have some ideas about how to better represent these consonants and welcome all suggestions for a more nuanced handling.
- Hyphenated words are broken up into separate words.
- Many Māori words appeared in the English corpus and vice versa. To address this I generated a unique list of “edge case words”, which appeared in both sets of text, and compared it to a Māori dictionary. If the word appeared in the dictionary I added it to the set of words that only appeared only in the Māori version of Te Ara. Otherwise I allocated the word to the set that only appeared in the English corpus. Based on a review the data generated, this approach seems to work well with the exception of a small number of placenames that slipped into the English list.
- “I used a corpus rather than a dictionary so that the visualization would be weighted towards true usage. In other words, the most common word in English, ‘the’ influences the graphs far more than, for example, ‘theocratic’.”
- “The ordinal (y) scales are obviously not equal: ‘e’ is used 100-200 times more often than ‘z’, and while I could have fudged everything with log scales, letter frequency is not the point of the graphs. As long as I had to fudge anyway, I did so in a way that, I believe, makes it easiest to understand what the graph shows. Your mileage may, of course, vary. The color coding is a quick guide to help understanding, since letter frequency is of course relevant to the shapes you see.”
- “There are 15 ‘bins’ of letter positions, as a purely qualitative comparison suggested to me this was about the ideal number to show the underlying trends without under- or overfitting. Therefore the ‘t’ in “the” takes up positions 1 through 5, the ‘h’ 6 through 10, etc. When letters straddle a boundary they are apportioned proportionately.”
Although I ended up writing my own Python implementation for data processing and R scripts for visualisation, I followed the rules that Taylor outlines above.