2 Comments
User's avatar
Jörn's avatar

Some problems I spot with this approach after skimming it:

(1)

In "The Road Ahead" there is no item about figuring out how to embed a human-written prior about ethics into an AI.

I can vaguely imagine several mechanisms that impact the AI's final form after training: varying the seed of the random number generator, varying some parameters of the noise, hand-designing circuits that easily translate into neural weights, asking GPT4 to write the initial parameters, curating the training data, etc.

All the things I come up with have to deal with the problem that we currently do not understand the inductive biases and process of value-formation inside neural networks very well. This makes it hard to backchain from some high-level description of the AI we want (the ethical values we figured out by solving moral philosophy) and arrive at a low-level input (the initial weights, the data, etc.) that are then turned into an aligned superintelligence during training.

(2)

I still think that it's hard to solve moral philosophy enough to come up with some sufficiently-mechanical ethical framework which also does lead to good outcomes and not just AI going off and doing valueless things. quantifying some parts of human morality doesn't scale to a good solution here, as superintelligence will face very weird decisions that break any hidden assumptions which we made while quantifying parts of our morality. in essence, goodhart's law applies and suggests that we need to robustly figure out where human morality comes from, not just how it operates along measurable axes in the current environment

(3) I'd expect research on this proposal to maybe make enough progress to create good products that look ethical. But then it will still break once the AI becomes smarter and e.g. does self-reflection on its values for the first time. If we don't have a complete solution to human values that also addresses this new capability previously never encountered, then the AI won't do things like a human does, using different inductive biases that were still randomly initialized basically. I'd roughly call the problems here "goal misgeneralization" and again a lack of understanding of value-formation. So the counterargument of ethics washing is a valid concern [implicitly assuming there is something else humanity can do instead of work on this proposal].

(4)

Iterative refinement of the "lightweight approximations of complex ethics" fails because we can't refine a deceptive unaligned AI that protects its values from modification. So at that point we no longer learn fast enough what is going wrong, and soon after a superintelligent unaligned AI kills us.

It would be nice if we could get in a prior that prevents deceptive behavior, but that is a very hard problem. Afaict "not wanting to be deceptive" is very close in capabilities to "being able to do deception", so unless we expand the gap between those two capability levels, for example by training the AI to be whitebox under our interpretability tools, we just run into AIs that are first deceptive, and then no longer learn to avoid deception because they just lie about wanting to avoid deception.

Expand full comment
Jörn's avatar

(4b)

Also, it sure looks like "expected utility maximization" is the essence of "being good at general problem solving". So in order to refine a non-deceptive AI that's smarter than humanity, we need to have some access to it, and a way to modify it in the direction we want. Gradient-hacking even without deceptiveness might be possible (?) and prevent us from just fine-tuning on some more examples. Evaluating whether the AI even is misaligned might be too costly, and the AI might refuse to help us with that task, so we don't even have data to fine-tune on. Imo something like corrigibility is needed to get the superhuman AI to help us refine the ethical framework that's then used to finetune it.

Expand full comment