Fail safe(r) at alignment by channeling reward-hacking into a "spillway"
motivation
-
It's plausible that flawed RL processes will select for misaligned AI
motivations.[1] Some misaligned motivations are much more dangerous than
others. So...
17 hours ago
