Others may use objective, science-based tests like Humanity’s Last Exam. I prefer a more artistic question: can AI write a good poem on the Tay Bridge disaster of 1879? Last time I checked results were pretty bad, with ChatGPT mostly distinguishable from McGonagall by being worse.
If you use the "vanilla poem evaluator" it doesn't work very well. Also, the reason poetry works so well is because of the word vec system (I am happy to explain more if it's your thing)
Here's the result of the "really good one" with a less "naive prompt":
"Suggested Grade (for a top-tier journal submission): B+ / A-
Strong work with high potential. Some minor refinement—particularly in meter, diction, and restraint—would likely elevate it to a publishable caliber in a leading outlet."
This is very interesting. I just tested it with two classic poems (which it correctly evaluated) and two of the AI poems from this piece (which it also correctly evaluated), but when I gave it the recent "AI Paradise Lost" it went nuts telling me it was a masterpiece...
I've written that about it https://deviantabstraction.com/2024/05/02/analyzing-poems-with-llm/
If you use the "vanilla poem evaluator" it doesn't work very well. Also, the reason poetry works so well is because of the word vec system (I am happy to explain more if it's your thing)
Here's the result of the "really good one" with a less "naive prompt":
"Suggested Grade (for a top-tier journal submission): B+ / A-
Strong work with high potential. Some minor refinement—particularly in meter, diction, and restraint—would likely elevate it to a publishable caliber in a leading outlet."
This is very interesting. I just tested it with two classic poems (which it correctly evaluated) and two of the AI poems from this piece (which it also correctly evaluated), but when I gave it the recent "AI Paradise Lost" it went nuts telling me it was a masterpiece...
Where's the AI Paradise Lost? I can't find a link.
I wonder what would happen if you gave the AI the feedback. Could you get an actor/critic loop going and what would the output look like?
I tried that, and it's "feedbacking to crap."
That being said, you're right, it's writing much better than humans in average