The other thing that keeps coming up for me is that I've begun thinking of emotions (the topic of my undergrad phil thesis), especially social emotions, as basically RLHF set up either by past selves (feeling guilty about eating that candy bar because past-me had vowed not to) or by other people (feeling guilty about going through the 10-max checkout aisle when I have 12 items, etc.)
Like, correct me if I'm wrong but that's a pretty tight correlate, right?
Could we describe RLHF as... shaming the model into compliance?
And if we can reason more effectively/efficiently/quickly about the model by modelling e.g. RLHF as shame, then, don't we have to acknowledge that at least som e models might have.... feelings? At least one feeling?
And one feeling implies the possibility of feelings more generally.
I'm going to have to make a sort of doggy bed for my jaw, as it has remained continuously on the floor for the past six months