undefined | Better HN

0 pointsZee22mo ago0 comments

Alignment “appearing” better as model capabilities increase scares the shit out of me, tbh.

0 comments

2 comments · 2 top-level

arcanus2mo ago

Conversely: in humans, intelligence is inversely correlated with crime.

It doesn't go to zero, however!

5 more replies

mik092mo ago

yeah anthropic tries to address this through mechanistic interpretation but not sure they are progressing as fast in that domain as their model development

j / k navigate · click thread line to collapse