A lot of times when I ask for a source, I get broken links. I'm not sure if the links existed at one point, or if the LLM is just hallucinating where it thinks a link should exist. CDN libraries, for example. Or sources to specific laws.
For example: "/glossary/love-parade" - There is no mention of this on my website. "/guides/blue-card-germany" has always been at "/guides/blue-card". I don't know what "/guides/cost-of-beer-distribution" even refers to.
They'll do pretty much everything you ask of them, so unless the text actually come from some source (via tool calls, injecting content into the context or other way), they'll make up a source rather than doing nothing, unless prompted otherwise.
For every line of text output, give me a full MLA annotated source. If you cannot then say your source does not exist or you are generating information based on multiple sources then give me those sources. If you cannot do that, print that you need more information to respond properly.
Every new model I mess with needs a slightly different prompt due to safeguards or source protections. It is interesting when it lists a source that I physically own and their training data is deteriorated.
(unless you are Google etc which are specifically let in to get the article indexed into search)
How do you make an LLM understand that it must only give factual sources? Just some naive RL with positive reward on the correct sources and negative reward on incorrect sources is not enough -- there are obscenely many more hallucinated sources possible, and the set of correct sources is a set of insanely tiny measure.
The loop "create a research plan, load a few promising search results into context, summarize them with the original question in mind" is vastly superior to "freely associate tokens based on the user's question, and only think about sources once they dig deeper".