In case my half-done thoughts are useful to anyone looking to build something in this space:
My aim is/was to allow configurable matching, so you can match, e.g. "XxxXxx / XxxXxx1 / XxxXxx / XxxXxx1", meaning four consecutive lines of six syllables, where X is a stressed, and x an unstressed syllable, and where the last syllable of the 2nd and 4th lines have the same phoneme, denoted "1", whereas there are no phonemic constraints on any other syllables (this allows a crude approach to rhyme).
I'm not entirely happy with cmudict because, since it works one syllable at a time, it can't really do much about stress, which can vary depending on the surrounding words. I've been using the output of espeak -x instead, which gives a phonetic rendering of an entire sentence, including assigning both phonemes and stress. I'm not sure if it's genuinely an improvement though. Its poorly documented output surely isn't an improvement! And in particular it gives a normal prosaic reading of a sentence, which might be too constraining for poetry-finding, since poems often allow a bit of freedom on moving around the stresses.
The idea to scan large amounts of text is to compile the configurable pattern into a regex that matches espeak -x output, so for example X gets mapped to a "match any stressed syllable" regex snippet. Alas, that's error-prone, especially since the espeak -x phoneme format is a bit quirky (e.g. no fixed length per syllable or syllable markers, so you need to have some per-language rules to figure out what sequences of ASCII constitute what, which I haven't debugged).
One thought I want to explore in a later version is using cmudict's stress patterns for polysyllabic words, but ignoring any stress/meter rules for monosyllabic words. I suspect that'll do pretty well, and it'll be interesting to test it out.
For an example of where it seems weird w/ monosyllabic words, compare, "I WENT to the STORE to BUY some BREAD", which has a sort of poetic rhythm, with "I went TO the STORE to BUY some BREAD" which seems weird, even in a poem. An offhand analysis is that stressing the main verb and then running "to the" together into one unstressed syllable is more natural than making the main verb unstressed and stressing the preposition. Perhaps buried in the code of some text-to-speech engine are heuristics that cover some of these cases? But perhaps they can just be ignored at first, and patched up later in cases where results are too strange.
Anyway, this is just miscellaneous thoughts about future enhancements; the current Nantucket is cool to try out.
The logic is simple: reject a word if it has both stressed and unstressed syllables, with a stressed one falling on an unstressed beat in the meter. Pass anything else.
Very neat project, BTW. I've been wanting a good solution to the unknown-word problem. Here's a post on my blank-verse detector before: http://darius.livejournal.com/48525.html
On a different note, I read the about section of the blog and saw that the OP, in addition to this great stuff, is a beekeeping, hacking attorney who also spins fire. Amazing!
the labourer's time and that of
his family at the
disposal of the
capitalist for the purpose of
greater quantity of labour
In addition to a measure
of its extension
ie duration
labour now acquires a measure
-Karl Marx