Data Science, how hard can it be?
Created on July 2, 2014, 9:16 p.m.
Lets write a poem, and lets do this thing the programmer way. I've got this big juicy dataset of poetry I recently acquired from err, somewhere or other. Lets whip out the statistics and see what we can do with it. Just a little data science. How hard can it be...
But where to start?
How about the length. lets be honest, no one likes a long poem. Just stick it in a novel if you've that much to say. But we have the data now. Lets find out exactly how long it should be...
Okay, so maybe long poems aren't such a drain. It doesn't seem like there is much correlation between poem length and rating. Instead the rating measure just gets more unstable as the amount of data drops.
Either way, I'm still giving a fat zero to the longest poem in the data set "Ashtaroth - a Dramatic Lyric" by "Adam Lindsay Gordon" which is an immense 17680 words long. Literally nobody has time for that.
I'm liking the look of the other side of the spectrum. The shortest poem in our data set is "Reflection On A Wicked World" by "Ogden Nash" consisting of just the three following words:
Purity Is obscurity.
Nice. Poetic angst at its finest.
So length doesn't seem to matter. We'll keep it short because y'know - can we really be bothered.
How about line length? In English weren't we taught all about iambic pentameter, getting a nice rhythm to the poem and all that. How long should each line be, and does being iambic (having a du-DUH rhythm) really matter to how good a poem is.
According to my English teacher if you're going for it poems that are 10 syllables long, with a nice iambic rhythm should be pretty sweet.
Okay flat as a pancake. Screw that. Lets take a different approach.
Lets decide what to write about first. Topic is surely more important than length. Should it be Love? Ambition? Existentialism? Even...Data Science? (one can dream). This data set is tagged with topics for each poem. Lets check out the most popular topics and their average scores to get an idea - any idea.
Wait what? So the best poems are about mentors, moms, chocolate, television, and football? And since when was "concrete" a topic for poetry anyway? I found over 25 poems in the set with that label. I'm no English major, but I've seen some concrete in my time, and it has never inspired me to write a beautiful sonnet.
I'm starting to wonder if this is really a good dataset. Or more specifically, if the rating and topic tagging, which appears to have largely been done by bored American housewives (with a love of concrete), is really going to tell me anything insightful.
In fact I think I've come up with the perfect poem already - a maternal haiku. Lets stop here.
mom eating chocolate football on television concrete between us
Pretty deep eh? I Know.
A special mention goes out to the topic "depression", the only topic out of several hundred (including "concrete") which scored below the default 5.0 score. I guess you better face it. The people don't want to hear about your troubled childhood. Go back and write something about candy, or, America, or do you just hate freedom?!
So I guess that isn't going anywhere. But perhaps topic is linked to other things. My teacher always told me that iambic pentameter was like a heartbeat. It expressed passion, and power - love and joy. Lets take a look and see if she was right...
Ouch. Okay that was clearly a write-off.
Enough on the topics. What about Rhymes. Everyone loves a sweet rhyme. The only thing that beats a sweet rhyme is a sweet pun, and iambic-inning to get sick of this poetry business. Eh. Eh.
Lets take a look at how the rhyme schemes stack up.
Wow. I mean I always liked him, but I didn't expect my main guy ABCAACCADEADEA, the greatest rhyme scheme ever, to be right up there. Not to mention my buddy, ABBACCCCCDDCCC. Five rhymes in a row, that is just something special. That's like - you can't just step up and write that - it takes something special inside.
But seriously, it doesn't really seem like rhyme schemes have any sensible correlation to rating, even when normalized to take into account the number of poems with that scheme. But it is nice to know that people slightly prefer ABCB to ABCC ... I guess.
Clearly we're not going to be able to get informed of how to write a super sweet poem this way.
But I have one final awesome idea.
Lets get the data to write the poem for us. Lets roll out the big guns.
It's Markov Chain time.
Oh yeah. This is it. This is the deep-o-tron.
they mow to what was happy folk of that second
there this trivial circumstance to dust mixed up from
over the green field of life is the strife i wronged and aye and not for the cataract
And never for the cataract!
before that i tell were stirred to the image of green the shamrock the flowers
all in furrowed mountains in a violin in a merchant who bore come when we came and the sun i said earth are spray the earth my father will the clash oh music the rickety ferry for many a phantom forest
A personal favourite - this epic tale - of feeling, and questioning.
a tree i see all felt why
And this saucy verse about perfumed monkeys and bridal ass.
we sit and rather one leap out the ass in her bridal bower by a joy and holy ground suddenly she young and baulked his body give medicine on the earth my softest voice ill fares the warm our southern line now mere profession noble the perfumed monkeys on yonder house can it in the moon didst not moonbeams that wretched dead conquers every birth the last has been at whose lucid nothing but that is the blast you think not distant speaking ask ye your languid arms to their gear of
From which I am taking my new exclamation of surprise, "by a joy and holy ground!".
Unfortunatly I'm kinda stuck now. So I'll leave you with my final submission. But it's been fun folks - remember to tune in next time for "Data Science, how hard can it be?".
This poem I call "A Pale English Man on Holiday".
a man your victory is all its fill her sobbing with happiness three days passed and brown