JochemKuijpers.nl Personal blog and portfolio

Text generation by Markov Chains

text-generation, markov-chains

I was listening to the (great) podcast Idle Thumbs by Chris Remo and others. This is a largely gaming-related podcast, but at the end of every podcast, Chris will read out some interesting e-mails received by fans (the so-called "reader mail" section). At the very end of Episode 273, by a fan's request, they discuss this tumblr post where a computer supposedly generated a synopsis of a Batman episode from a large corpus of old episode synopses.

Then at the end of Episode 278, they discuss another post by the same author, this time a generated Yelp review of the Catacombs of Paris. Both of these stories are, besides funny, quite well structured. There are multiple references to earlier sentences and they form some kind of consistent story, at least, it seems to be that way.

(unintelligent) Markov chains...

The python script behind these posts reveils that generating these texts is actually a very manual process. It simply analyzes some corpus input file, generates Markov chains of words in a sentence, and lets the user pick from the best 20 suggestions. After picking a word, it's added to the output and a new list of 20 suggested words is shown. This allows the user to steer towards an interesting or funny story.

Let's try this myself.

Somewhat disappointed about the lack of intelligence in the script, I decided I should at least give it a go with my own corpus, so I collected the podcast descriptions from Idle Thumbs episodes 200 up to 279 (currently the latest released episode) and put them in a text file.

I tried the randomized mode, where the script would pick a number of random words from the list of suggestions, but this didn't yield very interesting results. I think this has to do with the small dataset and large variety of words, so the suggestions aren't very good except from the top few suggested words. I tried again but only picking the first suggestion unless this caused a loop and ended up with the following text (after formatting):

The time to the left and right. "I'm hungry", says the pilot over text chat.
You see he's named Nick. You're hungry too. Maybe there's somewhere to eat.
Maybe to two of you will be spoiled.

This week we take a look into the creation of this podcast. It is about
Metal Gear Solid V: The Phantom Pain, a game that stirs up memories of something
we loved dearly, long ago. Then we'll let you in on the sad space dad's craze that
has taken the gaming world by storm. If that isn't enough, stick with us for the
video game Downwell. And this is a perfect moment to preserve as a photo. You
try to clear up the shot by setting any enemies to hidden and find yourself
suddenly alone.

voicebox.py on Idle Thumbs podcast descriptions

Note: The italic word video is the only word where I picked the second-best suggestion, since it would otherwise choose time, which started the entire sequence all over again.

I thought this was a somewhat interesting result (and surprisingly consistent). It could very well be a podcast description for an Idle Thumbs podcast (an arbitrary episode for comparison). I might give text generation a go myself some day.

About me

I'm a 22-year-old Computer Science student at Eindhoven University of Technology.

More on the about page.

Archive

Tags

telegram, java, meta, text-generation, markov-chains

Other stuff