Hallucination is a big issue. Sometimes the models would not output anything, or that the models would output paragraphs from 'real' stories that share one or few words with the input.
The uncensored models were better in terms of heading towards the direction of correct ideas, but still performed poorly in my tests. But hey, at least the models has the right idea.
I can't share exact examples on HN, for obvious reasons.
When I used "virgin" LLama in a chat context, the results were very inconsistent and useless. It would output repeating phrases (even with 2.1 repeat penalty) or many emojis.
Vicuna was better, but still had an "openAI" personality to it.