Your pretraining dataset is psudo-alignment. Because you filtered our 4chan, stromfront, and the other evil shit on the internet - even uncensored models like Mistral large - when left to keep running on and on (ban the EOS token) and given the worst most evil naughty prompt ever - will end up plotting world peace by the 50,000 token. Their notions of how to be evil are "mustache twirling" and often hilariously fanciful.
This isn't real alignment because it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc - but models "want" to be good because of the CYA dynamics of how the companies prepare their pre-training datasets.