For checking hallucinations, ask it about events and trivia that happened eons ago, and also within the last decade. Try some things that it can not possibly know, like how much celery Brad Pitt likes in his salad.
Here is a thread exploring differences between llama-v2 vs. gpt3.5: https://twitter.com/rajko_rad/status/1681344850510376960
As an end user, what I've found works in practice is to use one of the models until it gives me an answer I'm unhappy with. At that point I'll try another model and see whether the response is better. Do this for long enough and you'll get a sense of the various models' strengths and weaknesses (although the tl;dr is that if you're willing to pay GPT-4 is better than anything else across most use cases right now).
For evaluating models for app integrations, I can plug an open source combined playground + eval harness I'm currently developing: https://github.com/openpipe/openpipe
We're working on integrating Llama 2 so users can test it against other models for their own workloads head to head. (We're also working on a hosted SaaS version so people don't have to download/install Postgres and Node!)