A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.
Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.
Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.
Maybe it should be called something else? "Openly-licensed"?
Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).
Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.