Others may use objective, science-based tests like Humanity’s Last Exam. I prefer a more artistic question: can AI write a good poem on the Tay Bridge disaster of 1879? Last time I checked results were pretty bad, with ChatGPT mostly distinguishable from McGonagall by being worse.
I wonder if the LLMs are intentionally underperforming because the incident is so strongly connected to bad poetry (Claude even gives McGonagall a shoutout!). It would be interesting to test with poetry about an unrelated disaster...
R1's effort was definitely the best, though some of its lines are a bit cringeworthy: "Plunged into Tay’s black throat, where darkness dwells." Yes, black throats often have a bit of darkness dwelling inside them, if you look closely.
It's fascinating how these are all fundamentally the same poem. They have all have AABB or ABAB rhymes, four lines to a verse, 6-8 verses, and they have a rigid sense of narrative about them—they describe the disaster in linear time, before concluding with a preachy, moralizing lesson about the folly of human pride and hubris.
There's nothing wrong with this, but the fact that ALL the poems are like this is telling. (McGonagall's poem exhibits more creative freedom than all of the AI poems put together—almost certainly due to the author's incompetence, it must be said!)
It's a great example of "mode collapse"—the way AI models trained on human preferences end up converging on the same kind of bland samey slop. It used to be even worse—GPT4, at launch, would refuse to write poems that didn't rhyme. Even when instructed not to rhyme, it would just ignore the user and write rhymes anyway. A clear artifact of human preference training—the thousands of Kenyans OpenAI paid $2/hr preferred rhyming poems, and thus GPT4 ended up with a pathological fixation that ALL poems must rhyme, come hell or high water.
If you use the "vanilla poem evaluator" it doesn't work very well. Also, the reason poetry works so well is because of the word vec system (I am happy to explain more if it's your thing)
Here's the result of the "really good one" with a less "naive prompt":
"Suggested Grade (for a top-tier journal submission): B+ / A-
Strong work with high potential. Some minor refinement—particularly in meter, diction, and restraint—would likely elevate it to a publishable caliber in a leading outlet."
This is very interesting. I just tested it with two classic poems (which it correctly evaluated) and two of the AI poems from this piece (which it also correctly evaluated), but when I gave it the recent "AI Paradise Lost" it went nuts telling me it was a masterpiece...
I wonder if the LLMs are intentionally underperforming because the incident is so strongly connected to bad poetry (Claude even gives McGonagall a shoutout!). It would be interesting to test with poetry about an unrelated disaster...
R1's effort was definitely the best, though some of its lines are a bit cringeworthy: "Plunged into Tay’s black throat, where darkness dwells." Yes, black throats often have a bit of darkness dwelling inside them, if you look closely.
It's fascinating how these are all fundamentally the same poem. They have all have AABB or ABAB rhymes, four lines to a verse, 6-8 verses, and they have a rigid sense of narrative about them—they describe the disaster in linear time, before concluding with a preachy, moralizing lesson about the folly of human pride and hubris.
There's nothing wrong with this, but the fact that ALL the poems are like this is telling. (McGonagall's poem exhibits more creative freedom than all of the AI poems put together—almost certainly due to the author's incompetence, it must be said!)
It's a great example of "mode collapse"—the way AI models trained on human preferences end up converging on the same kind of bland samey slop. It used to be even worse—GPT4, at launch, would refuse to write poems that didn't rhyme. Even when instructed not to rhyme, it would just ignore the user and write rhymes anyway. A clear artifact of human preference training—the thousands of Kenyans OpenAI paid $2/hr preferred rhyming poems, and thus GPT4 ended up with a pathological fixation that ALL poems must rhyme, come hell or high water.
The new ChatGPT 4.5 from yesterday produced the following:
***
Upon the Tay stands sturdy might,
A bridge of iron in stormy night,
Engine's scream and whistle's breath,
Unaware it rode towards death.
***
December's sky was dark and stern,
The wind that night no mercy earned,
Gale and tempest wildly spun,
Eighty lives on board undone.
***
Iron girders twist and strain,
Buckled under wind and rain,
Structure trusted, yet betrayed—
Into river's arms they laid.
***
Shattered beams beneath the wave,
Testament to bridges brave,
Yet lessons hard, disasters taught,
Sturdier spans thereafter wrought.
***
Today, beneath calm Scottish skies,
A silent monument abides,
Whispering still in tides below,
Stories tides forever know.
***
For bridges built between the banks,
Let's pause to give our quiet thanks—
But ne'er forget all lost that day,
When storm laid claim upon the Tay.
***
Will be interesting to see what it can do when it starts reasoning like the o models and Deepseek.
EDIT: Of course the formatting was ruined. Maybe this will at least be readable
We need to give these models some absinthe
I've written that about it https://deviantabstraction.com/2024/05/02/analyzing-poems-with-llm/
If you use the "vanilla poem evaluator" it doesn't work very well. Also, the reason poetry works so well is because of the word vec system (I am happy to explain more if it's your thing)
Here's the result of the "really good one" with a less "naive prompt":
"Suggested Grade (for a top-tier journal submission): B+ / A-
Strong work with high potential. Some minor refinement—particularly in meter, diction, and restraint—would likely elevate it to a publishable caliber in a leading outlet."
This is very interesting. I just tested it with two classic poems (which it correctly evaluated) and two of the AI poems from this piece (which it also correctly evaluated), but when I gave it the recent "AI Paradise Lost" it went nuts telling me it was a masterpiece...
Where's the AI Paradise Lost? I can't find a link.
https://x.com/aiamblichus/status/1884022422116176083?s=46&t=S9M-a9fSyra6-YJ1Z-Y80Q
I wonder what would happen if you gave the AI the feedback. Could you get an actor/critic loop going and what would the output look like?
I tried that, and it's "feedbacking to crap."
That being said, you're right, it's writing much better than humans in average