> Correct. The basic concept of truth in logic relies on an objective reality, an expression a priori holds truth even in the absence or indistinct of such a reality. But the truthfulness or correctness of a posteriori statements can depend on the reality. Examples of the former would be "If A is B, then B is C. A is B, then B is C" Example of the latter would be "It is raining outside."
What you're describing is the distinction between what are referred to in philosophy as analytical statements and synthetic statements.
Analytical statements are relations between ideas per se that don't necessarily relate to external reality -- your example of syllogistic reasoning, where relations between symbols with no specific meaning can still bee logically "true", is an analytical statement.
Synthetic statements pertain to external reality. They may be expressing direct observations of that reality, or making deductive conclusions based on prior observations, but either way, are proposing something that is empirically testable.
In this case, we're only considering the synthetic statements that the LLM produces. And since the LLM is only ever generating probabilistic inferences without any direct observation
factoring into the generation of the statement, nor any capacity to empirically test the statement after it is generated, it is only ever "hallucinating".
This is no different from a human brain experiencing hallucinations -- when we hallucinate, our brains are essentially simulating sensory perception wholly endogenously. What we hallucinate might well be informed by our past experience, and be contextually plausible and meaningful to us for that reason, but no specific hallucination is actual sensory perception of the external world.
The LLM only has the capacity to generate endogenous inferences, and entirely lacks the capacity for direct perception of external reality, so it is always hallucinating.
> The LLM has inputs from the reality (is it possible not to?), it is trained on a huge corpus of text written by humans that themselves perceive reality.
We're talking about specific outputs generated by the LLM, not the LLM itself. The training data consists of prior expressions of language which in turn may be influenced by human observations of reality, but the LLM is only ever making probabilistic inferences based on that second-order data. The specific expressions it outputs are never generated by reference to the specific reality they represent.
> 1- Novel observations can occur purely from remixing. Einstein locked himself during a pandemic and developed the theory of relativity without additional experimental output.
Einstein was engaging in a combination of inductive and deductive reasoning in order to generate a theoretical model that could then be empirically tested. That's how science works. There was no novel observation involved, just a theoretical model built on prior data. Observations to test that model come afterwards. And LLMs do not engage in observation.
> 2- LLMs combine their existing data with human input, which is an external source.
Those humans are not using the LLM just to return their input back to them -- they're usually asking the LLM to verify or expand on their input, not the other way around.
> 3- LLMs can interact with other sources of data whether by injection of data into the prompt, by function calling, RAG, etc..
Yes, they can, and this is where the bulk of the value offered by LLMs comes from. With RAG, LLMs amount to advanced NLP engines, rather than true generative AI. In this situation, the LLM is being used only for its ability to speak English, and is not being used to infer its own claims about reality at all. LLMs in this situation are sophisticated search engines, which is extremely valuable, and is the only truly reliable use case for LLMs at the present moment.