undefined | Better HN

0 pointsbayindirh2y ago0 comments

First of all, "The Stack" is the dataset that models like StarCoder is trained upon. I don't know what's the data source for IBM Granite family.

I know the Stack is not clean, because they included my fork of GDM's greeter, which is GPL licensed.

My words about IBM was in general. I can't tell anything about their models, because I didn't see mention of "The Stack", and I don't know what their models are based on.

On the other hand, IBM doesn't like risks from my experience, so they would play it way safer than other companies.

If their data is not clean to begin with, then shame on them, and hope their AI efforts burn to the ground.

BTW, LLM training is not fair use. For start, Fair Use's definition automatically excludes "for profit" usage. Just because OpenAI has a non-profit part and training done here doesn't make them immune to consequences of for profit operations.

0 comments

No comments yet.