Could you elaborate here? Alignment seems pretty obviously a good thing.
Alignment assumes that there exists a foundational philosophy on what is good and fair and nice, that's close enough a match to everyone. It's a reasonable assumption, because there are core human universals, and the cultural differences around the world are a rounding error in comparison. We're not talking here about someone's view on when white lies are justified or which model of marriage is the bestest - we're talking at the level of "cooperation = good", "love = good", "trust = good", "death = bad", "suffering = evil", etc., and with AIs, this starts with making sure it even understands those concepts more-less the same way we do.
Alignment does not assume this foundational philosophy is known or easy to derive. If it were, alignment would be solved. The entire GAI x-risk problem stems from the fact that we don't have a complete picture of this philosophy, and that we don't have a clue how to formalize it so we can communicate it fully to an AI.
LLMs kind of give a new twist to it - it turns out that maybe we don't have to formalize it, as LLMs seem capable of picking up high-level ideas from enough exposure to how they manifest in practice. At the same time, with a system of this type, we have no way of telling if it actually understood human values and morals correctly.
> Yes, HN contributors might have shared goals for AGI alignment—but we are not the world—-we are a thin slice of one culture.
As controversial and bad as this will sound: those differences are all bike shedding relative to common core - just like DNA differences between individual humans are a rounding error compared to DNA differences between average human and an average potato. And yes, this bikeshedding is half of what makes the world a dynamic (if dangerous place). It matters to us. But it's an inconsequential detail when dealing with entities that do not have the same common core.
Another way of looking at it: if these differences were big enough to matter, humanity wouldn't be able to cooperate regionally and globally, like it always has, because each group would see other groups as incomprehensible alien minds (thus unpredictable, thus dangerous).
Most people disagree to a significant degree. Reminder: the majority of humanity (and a big majority of people that have 2+ children) adhere to religious doctrines which all but prohibit transhumanism. So no, death and suffering aren't unquestionably bad, by human accounting. And as for cooperation and trust, this naturally leads to peer pressure and collectivist coercion if taken to the extreme; and as for individual freedom, humans near-universally value power over shaping the trajectory of their progeny… You assume too much.
> Alignment does not assume this foundational philosophy is known or easy to derive. If it were, alignment would be solved.
It would not. The technical problem of making a strong, self-modifying, agentic AI provably conform to a set of qualitative value preferences in a way its builders would not disavow is hard regardless of the set of values we're trying to force onto it. It is quite likely unsolvable in principle; I expect a theorem to this effect could be proven. The fact that you think the problem is deriving some fashion of moral realism doctrine shows that for you this is a purely political issue.
> The entire GAI x-risk problem stems from the fact that we don't have a complete picture of this philosophy, and that we don't have a clue how to formalize it so we can communicate it fully to an AI.
This suggests that GAI x-risk discourse is not championed by serious thinkers who understand AI technology or moral philosophy. (Indeed, Lesswrong is basically a forum for clueless sci-fi TVTropes enjoyers, and they're behind most of it). Human morals are ad hoc preferences, not lossy approximations of some function; we can derive an approximating function from a big lump of human preferences, but it won't be legible or meaningfully amenable to formalization. As such, the closest we come is just finetuning models on the vague markers of human decency distilled in their general training data, e.g. like Anthopic does with their Constitutional AI. This is also the closest we came to AGI, so this should be our first-priority scenario for future AIs and aligning them – not speculations from the 90s about «formalizing» something.
> At the same time, with a system of this type, we have no way of telling if it actually understood human values and morals correctly.
We have too. Testing LLMs is vastly easier than testing humans, we have insight into their activations, we can steer them, there's a big body of research into that. More importantly, there is no strictly correct understanding, this whole idea ought to be thrown out.
What's really going on here is that some armchair Bentram-style utilitarians like Bostrom encountered literature on Reinforcement Learning and jumped to conclusion that this is how an AGI is to be built; if only they could formalize the correct vector of increasing utility, it would seize the light cone and optimize for the global utility maximum. And accordingly, if they failed, an AGI would optimize for something else, which would most likely (here's another assumption of a quasi-random objective selection) be at odds with human preferences or survival.
Since then, they have written a great deal of elicidations on this basic take, incuriously shoehorning new technologies into its confines. But no part of this hermeneutic tradition is in any way helpful for making sense of our current explosive success with tools like LLMs.
> But it's an inconsequential detail when dealing with entities that do not have the same common core
But why don't they? Just because some Lovecraft fans with Chūnibyō call natural language processors trained on human data Shoggoths, entities summoned from the Eldritch Space of Minds?
The AI risk discourse is incredibly sophomoric, imaginative in the bad sense. Once you learn to question its assumptions, it kind of falls apart.
Are you going to invite religious extremists to the table in the name of fairness?
Alignment as a political project is about limiting AIs in ways that rule out certain behaviors even despite user's wishes. This is as bad as a text processor that only accepts certain strings (e.g. won't register "Xinnie the Pooh"; somehow we need to point at foreign excesses to make the absurdity clear). A more ambitious Alignment project, with the discussion of "pivotal acts" and such, is as I've said, a dream of moral busybodies about unifying humanity under some common ideological doctrine; and proponents of this one are understandably stressed about proliferation and democratization of AI tech. If they let it slip now, if the Singleton becomes impossible and the multipolar outcome is locked in, they will fail at their intention to essentially compel the human race to do their bidding. I can't not wish them to fail, the way all totalizing philosophical movements to date have failed. We don't need Utopias, we don't need even the most thoughtful fascist regime. We never needed Plato's Republic, and these guys aren't better than Plato.
But of course this, too, is a matter of personal philosophy.
1. https://twitter.com/feross/status/1641548124366987264
2. https://en.wikipedia.org/wiki/Four_Horsemen_of_the_Infocalyp...