I appreciate the feedback, and also agree that chatgpt is a no-go for many use cases.
We're working on putting together a better comparison specifically along the lines of accuracy between the LLMs (chatgpt, bard, falcon) and also traditional models. Hope that one hits the spot for you! Are their specific metrics you think might be interesting? We were primarily looking at f1/accuracy for this task, but also attempting to see what types of classes they work well in using semantic similarity.