undefined | Better HN

0 pointsvitalyan1235d ago0 comments

distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.

0 comments

22 comments · 10 top-level

duskdozer5d ago· 4 in thread

>such outrageous copyright infringement

Sarcasm, considering the source of their own training data?

margalabargala5d ago

Considering they called the company "Misanthropic", sarcasm is a safe bet.

duskdozer4d ago

Somehow, I completely overlooked that.

orphea5d ago

Narrator: it was sarcasm, indeed.

baron3dl5d ago

IP for me, not thee.

mirekrusin5d ago· 4 in thread

I don’t understand why there isn’t public dataset for reasoning that can be improved by humans/llms like Wikipedia (ie with auto judging contributions etc).

woctordho4d ago

There is already a lot of effort to collect agent traces including reasonings, e.g. see the recent discussion: https://old.reddit.com/r/LocalLLaMA/comments/1u795pb/donate_...

We've been developing DataClaw for this: https://github.com/peteromallet/dataclaw

mirekrusin3d ago

Did I get it wrong or the first link has dataset with 30 entries only?

logicchains5d ago

For reasoning a manually-curated dataset is too small; you need to be able to automatically generate vast volumes of synthetic reasoning data with provably correct answers. That's presumably why Claude and GPT are so good at using Lean (the theorem prover), because they get fed a bunch of synthetic, verifiably correct training data.

mirekrusin4d ago

Wikipedia is a lot of data as well but we manage to do it, no?

mannanj5d ago· 2 in thread

The companies that did copyright infringement and unethically scrapped data think that copyright infringement and unethically scrapping data is wrong and needs to be stopped.

Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy.

This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm

brookst4d ago

It’s been amazing to see the arc of tech people going from “evil Disney, copyright is an abomination, information wants to be free” to “OMG copyright is inviolable and AI is taking money out of Plato’s descendants’ pockets!”

solid_fuel4d ago

> taking money out of Plato’s descendants’ pockets

Yeah, remind me - is it Plato's descendants that people are concerned about here, or is it every single author who had any work in Anna's Archive, any work published online, any work published on github, etc?

I think that people are probably upset about the harm to living people who had their work stolen by Meta and other LLM companies - regardless of license, terms of use, or any other attempted protection.

1 more reply

Bolwin5d ago· 1 in thread

For Claude models at least, you can tell to just manually think in the output and it works fine. I do it reguralrly because for creative writing and summarization, they seem to believe they don't need to think at all, and get way worse results.

carterschonwald5d ago

this helps so much. i do it too. with some of the newer frontier models its unclear if you can even turn it off in the first party chat apps. havent compared api semantics yet.

ComputerGuru5d ago· 1 in thread

Supposedly there are “jailbreaks” that expose considerably more of the thinking traces.

woctordho4d ago

Simple trick: Use an agentic tool like Pi or OpenCode that allows you to switch models. First do some chats with DeepSeek or GLM who shows full thinking traces, then switch to Claude or GPT and it's more likely to show full thinking traces.

nyrikki5d ago

It is quite likely that the intermediate tokens don’t have ‘semantic import’[0]

There are methods like Habitual Reasoning Distillation or Inverted Reasoning Traces [1] that can help.

While there are reasons to hide the intermediate tokens from a IP protection stand point, there is also a need to hide more effective and efficient generating that doesn’t fit the R1 claims of an aha moment that has been debunked, but is a consumer expectation.

While hidden intermediate tokens do increase the difficulty, it is not a from barrier in itself, especially as they are billed, given information about their length.

[0] https://arxiv.org/abs/2504.09762v4

[1] https://arxiv.org/abs/2603.07267

kmeisthax5d ago

Chinese distillation attacks are about as unethical as Robin Hood stealing from the rich to give to the poor. The real unethical scraping was done by Anthropic to train Claude.

To be clear, if Anthropic was using totally licensed data, I'd be sympathetic to these claims. But if you're going to pirate the world's creativity you'd better be willing to gimme dat shit for free[0].

[0] As said by Hungry Santa.

overfeed5d ago

FYI: model outputs are not protected by copyright.

BoorishBears5d ago

Reasoning models can coaxed to reason like they do in dedicated reasoning blocks, outside of those blocks: in normal parts of the response.

But Anthropic at least has openly admitted they try to detect that and interfere

orbital-decay5d ago

You can trivially leak the CoT of any current model, it's not a problem.

>outrageous copyright infringement

>unethically scrapped data

Hahahahaha

j / k navigate · click thread line to collapse

0 comments

22 comments · 10 top-level

duskdozer5d ago· 4 in thread

>such outrageous copyright infringement

Sarcasm, considering the source of their own training data?

margalabargala5d ago

Considering they called the company "Misanthropic", sarcasm is a safe bet.

duskdozer4d ago

Somehow, I completely overlooked that.

orphea5d ago

Narrator: it was sarcasm, indeed.

baron3dl5d ago

IP for me, not thee.

mirekrusin5d ago· 4 in thread

I don’t understand why there isn’t public dataset for reasoning that can be improved by humans/llms like Wikipedia (ie with auto judging contributions etc).

woctordho4d ago

There is already a lot of effort to collect agent traces including reasonings, e.g. see the recent discussion: https://old.reddit.com/r/LocalLLaMA/comments/1u795pb/donate_...

We've been developing DataClaw for this: https://github.com/peteromallet/dataclaw

mirekrusin3d ago

Did I get it wrong or the first link has dataset with 30 entries only?

logicchains5d ago

mirekrusin4d ago

Wikipedia is a lot of data as well but we manage to do it, no?

mannanj5d ago· 2 in thread

The companies that did copyright infringement and unethically scrapped data think that copyright infringement and unethically scrapping data is wrong and needs to be stopped.

brookst4d ago

solid_fuel4d ago

> taking money out of Plato’s descendants’ pockets

1 more reply

Bolwin5d ago· 1 in thread

carterschonwald5d ago

this helps so much. i do it too. with some of the newer frontier models its unclear if you can even turn it off in the first party chat apps. havent compared api semantics yet.

ComputerGuru5d ago· 1 in thread

Supposedly there are “jailbreaks” that expose considerably more of the thinking traces.

woctordho4d ago

nyrikki5d ago

It is quite likely that the intermediate tokens don’t have ‘semantic import’[0]

There are methods like Habitual Reasoning Distillation or Inverted Reasoning Traces [1] that can help.

While hidden intermediate tokens do increase the difficulty, it is not a from barrier in itself, especially as they are billed, given information about their length.

[0] https://arxiv.org/abs/2504.09762v4

[1] https://arxiv.org/abs/2603.07267

kmeisthax5d ago

Chinese distillation attacks are about as unethical as Robin Hood stealing from the rich to give to the poor. The real unethical scraping was done by Anthropic to train Claude.

[0] As said by Hungry Santa.

overfeed5d ago

FYI: model outputs are not protected by copyright.

BoorishBears5d ago

Reasoning models can coaxed to reason like they do in dedicated reasoning blocks, outside of those blocks: in normal parts of the response.

But Anthropic at least has openly admitted they try to detect that and interfere

orbital-decay5d ago

You can trivially leak the CoT of any current model, it's not a problem.

>outrageous copyright infringement

>unethically scrapped data

Hahahahaha

j / k navigate · click thread line to collapse