# Kris Shaffer

I read, write, edit, publish, listen, hack, make music, educate, reform, worship, husband, and father. CU–Boulder. Hybrid Pedagogy Publishing.

BG photo from darkday on Flickr.

## Praat: doing Phonetics by Computer #digitalhumanities #corpusmusic

This looks really cool!

## Visualizing statistical patterns in Schubert's song cycle, Die schöne Müllerin

After some time away (my wife and I just celebrated the birth of our third child!), I'm diving back into The Lieder Project again. I've been doing some data visualization and exploratory analysis of Die schöne Müllerin. Here's what I've found.

First, I imported the parsed text and music into the statistical analysis package R. There I used the plot tool to visualize potential relationships between different parameters in these songs ― pitch, duration, metric stress (both musical and poetic), vowel sounds, etc. Most relationships in this song cycle are very complex. However, a few of them lead in interesting directions.

First, here is a box graph showing the general pitch height of the notes in DSM.

Pitches are denoted by "diatonicNumber", a property in music21 that strips accidentals and assings a number to each letter name. Middle C (natural, sharp, or flat) is 36, B3 (natural, sharp, or flat) is 35, D4 (natural, sharp, or flat) is 37. The X axis represents the songs in the cycle (minus the three that are missing), in the order in which they are performed in the cycle. The boxes represent the middle two quartiles of pitches (the 25th to 75th percentiles), with the horizontal line representing the median pitch. The extended dashed lines show the outer quartiles. Dots represent outliers.

Note that for most songs, the median pitch is middle C. Though the quartiles vary somewhat, there is a lot of uniformity in pitch register. However, this chart counts each note once, regardless of duration. What happens when we represent notes by length? Does the register go up when those long, high melodic climaxes get extra weight?

A little bit. Here is a re-plot where each note is counted in proportion to its duration.

At first, this doesn't seem to tell us much other than Schubert doesn't alter the register much from song to song. However, it serves as a baseline for other analysis.

Let's look for some specific words. I asked R to plot the pitch height (weighted for duration) of all notes attached the the syllable grün. (By using the syllable, we get all forms of the word: grünen, grüne, etc.) Here is that plot.

The word mainly occurs in three songs: "Mit dem grüne Lautenbande," "Die liebe Farbe," and "Die böse Farbe." Note the differences in pitch height for grün- in these three songs. It seems that in "Mit dem grüne Lautenbande," the heightened psychological tension surrounding the color green (this is the song when the green ribbon the narrator gives to his beloved is fading in color) is reflected by the pitch height of the occurrences of the word green in the music. Likewise, the upwards motion in pitch height from "Die liebe Farbe" (the beloved color) to "Die böse Farbe" (the hateful color) seems to suggest a similar increase in tension. But are these observations significant?

That's where the baseline plot comes in. When compared to each other, the overall pitch content and the "grün" pitch content of "Die böse Farbe" match pretty well. The median pitch of "Die liebe Farbe" is 33 (G3) overall and 35 (B3) for grün, but with substantial overlap of the pitch ranges, there doesn't seem to be a substantial divergence from "normal" for "grün" there either. There does, however, seem to be a substantial difference between the overall pitch content of "Mit dem grünen Lautenbande" and the occurrences of grün in that song: every occurrence of grün in that song is at or above the overall median pitch. So the heightened emotional content surrounding the color green in this text seems to be reflected in the high pitch setting of the word "grün."

There are a number of other questions like these that we can ask the corpus. We can look at other words. We can use topic modeling on the text to suggest broader topics (sets of related words) to search for and gain a more nuanced view of some aspects of the text. We can also compare other feature sets of the song cycle: pitch duration, metric stress, strength of the beat (strong, weak, offbeat, etc.) ― both visually and through correlation and other statistical measures. And, of course, as we hone in on specific relationships, we can run the appropriate statistical tests to establish the specific associations and whether or not they rise to the level of statistical significance.

If you're into data science or digital humanities, download the corpus, import it into your favorite statistical analysis/visualization tool, and play around with it. Let us know if you find anything interesting!

## Finely parsing the corpus

Now that we have a small collection of song files that combine the IPA text and music in one place, we can start to look in detail at the relationship between the musical and lyrical structures. For that, we created the parseFullData.py script. This script uses a combination of music21 functions and our own custom text-analysis functions to parse each song note-by-note and provide detailed information about each note's musical properties (pitch, duration, beat placement, etc.) and lyrical properties (for now, simply whether or not the syllable is stressed, and where the vowel sound falls on the open–neutral–close scale). This produces a really big table, with a lot of information:

etc.... (It also produces a table for each song in the corpus individually.)

This output is where I'm starting to have some fun. I started learning the statistical analysis program R this summer, and it can do some really powerful things with this output. I can import this data into R with a single line of code, and then ask all kinds of really specific questions ... and get the answers very quickly. Things like:

• What is the most common vowel category? — in the whole corpus? in each song?
• Is there a correlation between a syllable's stress and what part of the beat it falls on? (there is)
• Is there a correlation between the "openness" of a vowel and how high the note is? how long the note is? what beat the note falls on?

...and so on.

For now, I've just been asking R random questions about the corpus and getting mixed results. Next, I'll write an R script that will systematically ask these questions of all the songs in the corpus, so we can do some good "distant reading" of the corpus and get a feel for what's going on, and in which songs are the most interesting things happening.

If you use R, feel free to download the data, import it yourself, and play around. Let us know if you find anything cool!

## Combining music notation with (IPA) text

We've made a lot of progress with our code for The Lieder Project recently. Or, at least since the most recent blog post. :)

One update is a new, short Python script, textMusicCombine.py, which does just what it sounds like it does. It takes poetry from a text file and combines it with music notation from another file — in this case, our music is coming from musicXML files (modified from Leigh VanHandel's amazing **kern database).

There are a couple of tricks to making this work. One is syllabification. I wrote this script to divide the text stream into syllables by looking for new lines, spaces between words, and periods between syllables. We just had to make sure our IPA transcriptions included those periods. So the first two lines of "Am Feierabend"

Hätt ich tausend
Arme zu rühren!

are translated into syllabified IPA like this:

'hɛt Iç 'ta:ʊ.sənt
'aɾ.mə tsu 'ɾy.ɾən

Another issue is ties and slurs — when there are multiple notes in the printed music that need to be attached to a single syllable. Most of the code in textMusicCombine.py is dealing with that scenario. I'm pretty sure that 7 if loops and 2 for loops all nested together is bad form for a programmer, but it works. :p (I'll spare you the details, but if you're interested, simply follow the link and see the code on GitHub.)

Finally, we had originally encoded the poems as poems. But composers often repeat, even change, text when they set a poem to music, and we had to account for that. So we made a second version of each poem that accounted for those changes, and we used those versions as the source text to get the poetry and music lining up properly.

It took a fair bit of back and forth, as Jordan and I looked at the musical scores being produced, finding errors both in the code and in the IPA text. But ultimately, we cleaned up all of the files, and get some usable results. And now, we have a program that anyone (with Python and music21 installed) can use for combining a plain-text poem with music notation. Since our script is based around the musicXML format, which just about every music notation program can import/export, we hope it will be useful to many others. (And we hope that if others use it, they'll notify us of newly found kinks for us to work out!)

Here's a sample of what it produced:

You can find all the output we've created so far, including the entire set of vocal parts for Die schöne Müllerin, here. If any vocalists need an IPA version of those songs, download away!

## Collaborating with GitHub (video tutorials)

Our team is using Git and GitHub to coordinate our development. GitHub is an online hub for storing repositories of files that teams collaborate on. Git is "version control" software that keeps track of changes made by different team members, as well as older versions of the project. It also does a pretty good job of merging updates made by different people editing the same file at the same time. GitHub is simply and online hub for Git projects (hence the name), and it helps people find new projects to "fork" (copy) and/or contribute to.

I've been using Git & GitHub for some time (and wrote an article about how to use it to build and clone an online textbook), so I made a couple videos to help others on the team get familiar with the workflow. I'm posting them here in case others find them useful.

These videos assume that you are using a Mac, and that you have Git (and the Mac command-line tools) installed, as well as an account on GitHub already set up. (I'm getting a new computer in a couple weeks, and since I'll need to install all of that on the new machine, I'm waiting until then to make a getting-started-from-scratch video. The article linked above contains links to a lot of resources for getting started with Git, if you want to jump in now.)

So, without further ado, here are the videos!

## It works! — An update on the Lieder Project

We've reached a minor milestore in our coding efforts. While others in the group have been working hard to encode the text for a complete Schubert song cycle in IPA (the International Phonetic Alphabet), I've been writing software that will analyze various aspects of the sounds of that poetry, independent of the music. Today I finished that code. (for now...)

The core program is our poemAnalysis script. This script will take all of the IPA files in the texts folder and spit out several files for each poem, containing different kinds of basic statistical data. For now, we are just looking at vowels and categorizing each of them as open, open-mid, neutral, close-mid, and close — describing both their sound and the physicality of speaking or singing them. Our poemAnalysis script will calculate the probability of occurrence for each of these vowel types, outputting that data song-by-song, stanza-by-stanza, and line-by-line. The script allows us to choose whether we analyze every vowel, or only those that are stressed in the poetic meter. We can also choose whether we want to analyze both vowel phonemes in a diphthong, or skip the second one. (Since singers tend to sustain only the first vowel in a diphthong, we usually prefer the latter option.) The script is currently set up to conduct several of these analyses simultaneously, outputting the results of each set of options into individual files. (These go in the statOutput folder.)

These output files are in a format that makes for easy import into a statistical analysis application. I have recently begun learning the statistical programming language R. After just a couple weeks of work, I'm already amazed at how quickly and simply we can perform some of the statistical analyses we're interested in exploring. For example, a very short R script is also posted in our GitHub repository. This script imports the song-by-song data (produced by poemAnalysis), and it measures the correlation between each pair of songs in the corpus. This can tell us 1) how consistent poets are in their phonemical patterns, and 2) which poems stand out as having the most unique sound. Though they have not been fully proofread yet, the test analysis showed something interesting: the most unique song in the corpus so far is the one written by a different poet than all the rest! Of course, we need more (proofread) data before we can actually conclude anything, but it suggests that we might be able to find some differences in the way poets write — as well as the way composers set that poetry to music.

Finally, as our poemAnalysis script runs, it calculates how much the probability of occurrence for each vowel type changes — line-by-line and stanza-by-stanza. Then it flags moments where the change exceeds a certain threshold. (This is flexible, but we're currently looking for places where multiple vowel categories change in excess of two standard deviations.) Though there are some false positives (and probably false negatives), these flags are directing our attention to many interesting moments in the poems. In many cases, the moments where the sound changes substantially are also moments of metrical change or shifts in plot or the narrator's attitude towards something/someone. This is exactly what we want to see, especially if accompanied by musical changes, too. (We'll blog more details once the data is cleaner and more complete.)

We'll continue to provide updates as we go. In the meantime, feel free to download our data and scripts, and play around with them. Let us know if you find something cool!

## German-to-IPA Dictionary Builder

We now have a script to build a German-to-IPA dictionary for automatic transcription! Find it on our GitHub page.

All you need is a text file with a German poem (here's a sample), another text file with an IPA translation of that poem (here's a sample), a starter dictionary, and our dictionary scripts (GermanToIPA and DictionaryBuilder). Put them all in the same folder, and then run the DictionaryBuilder script:

`python DictionaryBuilder.py`

(or open DictionaryBuilder.py in a developer-friendly editor like TextMate for Mac, and type command-R).

The script will double-check that the German and IPA files have the same number of words. If so, then it will take the IPA for every German word not already in the dictionary and add it as a new entry in the dictionary. On "Nacht und Träume," it runs in less than 1/10 of a second. Woohoo!

The way we'll use this for our project is first, to build the dictionary from the transcriptions we've already done. Then we'll use that dictionary and the GermanToIPA script to pre-transcribe new poems. Then we'll transcribe any of the words not done automatically and run the DictionaryBuilder on it to add those words to the dictionary.

Now every poem we transcribe will, theoretically, make future transcriptions a little bit faster. And once we get a big enough collection of poems done, transcribing new ones should be a breeze.

## Automatically translating German poetry to IPA

By far, the slowest part of this project is encoding the poetry in a way that allows us to analyze its sound computationally. The IPA Unicode tool helps a lot, but translating German text to IPA, and then encoding that IPA as digital text, is a long, slow job.

So we decided to speed it up.

To start, Jordan compiled a list of 50 of the most common words in German, along with their IPA translation. (Leigh has also created an ordered list of the most common words in Schubert’s songs which will serve as the basis for further growth of the dictionary.) Then this morning, Jordan, David, and I wrote a translator script in Python. This script takes a text file containing a German poem, checks each word against Jordan’s German-to-IPA dictionary, and if the word is in the dictionary, it replaces it with its IPA equivalent. It even strips punctuation and accounts for capitalization.

The script is really simple. All you need is a text file with a German poem (here’s “Nacht und Träume” if you want a sample), the German-to-IPA dictionary, and this script. Be sure they are all in the same folder. Then go to the last three lines of the script and update the sourceFile and outputFile names to suit your needs. Finally, run the script. At the terminal, run:

`python GermanToIPA.py`

Or if you use a program like TextMate for Mac to edit the file names in the script, simply save the script and type command-R to run it from within TextMate.

That’s it!

Here’s what the German text for “Nacht und Träume” looks like going into the script:

Heil’ge Nacht, du sinkest nieder;
Nieder wallen auch die Träume
Wie dein Mondlicht durch die Räume,
Durch der Menschen stille Brust.
Die belauschen sie mit Lust;
Rufen, wenn der Tag erwacht:
Kehre wieder, heil’ge Nacht!
Holde Träume, kehret wieder!

And here’s what the output looks like:

ha:Il.gə naχt du zIŋ.kəst ni.dəʁ
ni.dəʁ wallen a:ʊχ di Träume
vi da:In Mondlicht dʊɾχ di Räume
dʊɾχ deʁ Menschen stille Brust
di belauschen zi mIt Lust
Rufen wɛn deʁ Tag erwacht
Kehre wieder ha:Il.gə naχt
Holde Träume kehret wieder

Note that not every word is translated, only those in the dictionary. However, even just getting 20% of the words out of the way will save a good chunk of time. And as the dictionary grows, it will speed up the process even more.

For now, we’re adding words to the dictionary manually, focusing on those that are the most frequent in the poems we’re studying. However, in a future stage, we hope to write a dictionary builder — a script that will analyze fully translated songs for German-IPA word pairs and then add them to the dictionary. Then, every time we finish an IPA translation, we run the dictionary builder and add words to the dictionary, speeding up all of our subsequent translations in the process.

There’s one thing that this translator won’t be able to do, though: stress. While multi-syllable words have stress patterns that we can encode in the dictionary, every poem has single-syllable words, and their poetic stress is dependent on the meter of the poem and the arrangement of words within that meter. For example, “und” might be stressed one time in the poem and unstressed the next. So we’ll never be able to simply go straight from a German poem to statistical analysis of phonemical structures that are music- and stress-sensitive. There will always be some human intervention with the IPA text. However, having a translator that automatically generates IPA for a large number of words in each poem will certainly help us move a lot faster as we build our corpus of nineteenth-century German poems.

Feel free to try out the script. If not for computational analysis, maybe for that next German Diction assignment!

And we’ll always welcome new contributions to the dictionary.

## Processing IPA Unicode data with Python

One of the main challenges I anticipated for this project was dealing with our phonetic data. Vocalists typically use the International Phonetic Alphabet (IPA) to guide their pronunciation while singing in a non-native language, and there are many sources of IPA transcriptions of art song texts, so it seemed like a natural place to start. However, my software coding experience has been limited to the processing of numerical data and plain text, and IPA involves a number of "special characters." I thought it would be a big challenge for my initial coding effort.

However, it turned out to be fairly simple. I write my code using the Python scripting language, which — as it turns out — offers good support for Unicode text. We also found a Unicode font designed specifically for IPA. Putting these two together has made the analysis of IPA text fairly straightforward.

First, here is a sample German poem, "Nacht und Träume," and its IPA transcription:

Heil'ge Nacht, du sinkest nieder;
Nieder wallen auch die Träume
Wie dein Licht durch die Räume,
Lieblich durch der Menschen Brust.
Die belauschen sie mit Lust;
Rufen, wenn der Tag erwacht:
Kehre wieder, heil'ge Nacht!
Holde Träume, kehret wieder!

ha:Ilgə naχt du zIŋkəst nidəʁ
nidəʁ val:lən a:ʊχ di trɔ:ymə
vi da:In montlIçt dʊɾç di ɾɔ:ymə
dʊɾç deʁ mɛnʃən ʃtIl:lɛ bɾʊst
di bɛla:ʊʃən zi mIt lʊst
ɾufən vɛn deʁ tak ɛɾvaχt
keɾɛ vidəʁ ha:Ilgə naχt
hɔldə tɾɔ:ymə keɾət vidəʁ

We began by making a plain text file containing the IPA transcription. Then we used Python's codecs framework to import the text in a usable format.

`import codecscontent = [line.rstrip('\n') for line in codecs.open('NachtUndTraume.txt', encoding='utf-8')]`

Analyzing the text takes a little more work, but it's still fairly simple. For example, one thing we're looking at is the relative occurrence of different vowel types, and how that changes poem-to-poem, stanza-to-stanza, line-to-line. That analysis begins with categorizing the vowels in the poem: open, open-mid, close-mid, close, neutral. To do this, we use a Python dictionary, but we have to interact with the Unicode background to make this work. Using the chart provided with the IPA Keyboard Layout, we identified the IPA designation for each character. Then we used those to setup the dictionary.

`phonemeCategory = {       'a': 'open',    u'\u0061': 'open',    'e': 'closeMid',    u'\u025b': 'openMid',    u'\u0259': 'neutral',    'i': 'close',    'I': 'open',    'o': 'closeMid',    u'\u0254': 'openMid',    u'\u00f8': 'closeMid',    u'\u0153': 'openMid',    'y': 'close',    u'\u0153': 'close',    'u': 'close',    u'\028a': 'close',}`

Note that for regular Roman characters, we can simply type the character. Only the "special characters" need the full Unicode treatment.

With this dictionary defined, we can simply ask the category of each phoneme

`phonemeCategory[phoneme]`

and use the usual tools to calculate probabilities, make comparisons, etc.

Once we had an IPA-friendly Unicode font, processing the IPA text became very simple.

Entering that IPA text is another story...