You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].
It simulates your code on the inputs. When there's an LM call, it will make one or more simple zero-shot calls that respect your declarative signature. Think of this like a more general form of "function calling" if you will. It's just trying out things to see what passes your validation logic, but it's a highly-constrained search process.
The constraints enforced by the signature (per LM call) and the validation metric allow the compiler [with some metaprogramming tricks] to gather "good" and "bad" examples of execution for every step in which your code calls an LM. Even if you have no labels for it, because you're just exploring different pipelines. (Who has time to label each step?)
For now, we throw away the bad examples. The good examples become potential demonstrations. The compiler can now do an optimization process to find the best combination of these automatically bootstrapped demonstrations in the prompts. Maybe the best on average, maybe (in principle) the predicted best for a specific input. There's no magic here, it's just optimizing your metric.
The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.
In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.