Contents

Updates

  • There have been no updates on this project. (Why not?)

Search Query Haikus

Introduction

In the fall of 2009, I studied Applied Machine Learning at CMU under Carolyn Rose. As a final project, I analyzed a leaked AOL search query dataset. After that class, I continued working with the data to identify unintentional haikus in users' search histories.

Machine Learning

My Machine Learning project concerned identifying patterns in search behaviour. While the data set does not contain personally-identifiable information (at least not directly), it does group queries by user. I attempted to build user profiles based on search habits and use these profiles to identify further search sessions. As one might expect, identifying people with such sparse data is very difficult, and my results supported this.

Identifying Haikus

My brother's immediate response to this project was, "What about haikus?" We could both agree that even nonsense phrases take on extra gravitas when composed in the form of a haiku (one 5-syllable line, one 7-syllable line, one 5-syllable line). Finding unintentional haikus within the dataset is admittedly a far cry from the original goals of the project, but I was interested enough to explore the possibilities on my winter break.

Identifying haikus required counting the syllables in each query. Doing this programmatically is non-trivial, and to my knowledge, there are no publically-available lookup tables. To accomplish the task, I was admittedly a bit rude. I queried the website HAIKU WITH TEETH thousands of times to get syllable counts for the words in each search term. With this data, I was able to identify which sequential queries formed haikus.

Results

I learned, first and foremost, that there is more to funny haikus than correct syllable counts. Many people have a search pattern which includes repeating previous searches. This led to many haikus with identical first and third lines (not funny). Additionally, while Haiku with Teeth does a great job, it is not perfect. This is particularly true in cases of typos, contractions, and proper nouns.

Despite all that, the results contain some interesting haikus:

  • Stand up tanning bed
  • gas prices new york city
  • meanings of roses
  • Poems of springtime
  • nash community college
  • angioplasty
  • Cats urinating
  • betty everett lyrics
  • patsy cline lyrics
  • Free music lyrics
  • tattoos flowers butterflies
  • flowers butterflies
  • Empire flooring
  • incident in a small town
  • and justice for all
  • What does meekly mean
  • what does serenading mean
  • what does halted mean

Fortunately for Haiku with Teeth, I did not search the entire data set. I applied this filter to a small subset as a proof-of-concept. While there may be thousands of interesting haikus left to find, for now, I am happy with my hand-picked few.

@ a Glance...

  • Duration: 1 month
  • Team mates: none (independant project)
  • Responsibilities: project definition, coding
  • Technology: Python