Treat it like somebody grading an essay. It is either pass/fail or you can grade from 0 to 100 though you probably want to develop some kind of scoring rules for the latter.
I know a lot about evaluating classification but I am seeing people struggle w( evaluating text genersators; why don’t you look up my profile, send me an email, and we can talk more.