5 min read

Misc Thoughts on Preference Falsification, AI 2027 and Council Spending

I’m not too sure what to write about this week. I think instead of trying to find a single long post, I’ll go over a few of the things I’ve read, thought about etc…

Preference Falsification + Threshold Models = Key Concept

Richard Ngo has a post roughly on preferences falsification cascades generalized. It’s really good. I think it’s not particularly high value for me because the model he puts forward is one I already roughly hold. I think it is very high value for most people who have not internalized this dynamic. I genuinely think that preference falsification + threshold effects is kinda a key concept which it’s very hard to make sense of society. I’m not sure I’d rank it as highly as prices/supply & demand, but it’s somewhat close. I think that going forward this will be my go-to post when trying to explain this to friends.

I’ve read a few things here and there on AI risk.

AI Risk

Roko via Tyle Cowen argues that many of the initial predictions made by AI safety people are wrong. I’m not too convinced. It’s too easy to cherry pick bad predictions while ignoring correct ones. It’s also easy to discount how much of a signal correct predictions about a widely disregarded tech 20 years before it hits is.  I think the whole “aligning AI is possible and not horrifically hard as we feared” case, even if true, doesn’t actually solve that much. Unaligned AI is only really very dangerous at the point at which it’s much smarter than you are. If you reach that point (and likely continue beyond it as progress accelerates), even an aligned AI becomes dangerous. At least on the level of nukes if not greater. But anyway, to zoom in a bit I think a fair few of the specific claims Roko makes about LW people being wrong are in themselves mistaken. e.g:

Claim: Mindspace is vast, so it’s likely that AIs will be completely alien to us, and therefore dangerous! Truth: Mindspace is vast, but we picked LLMs as the first viable AI paradigm because the abundance of human-generated data made LLMs the easiest choice. LLMs are models of human language, so they are actually not that alien.

I mean they seem familiar, can understand language and seem to have some to behave as if they have some sort of underlying world model. That doesn’t mean their objective function or way of thinking is anything remotely human. Similar to a parrot.

Claim: AI will be incorrigible, meaning that it will resist creators’ attempts to correct it if something is wrong with the specification. That means if we get anything wrong, the AI will fight us over it! Truth: AIs based on neural nets might in some sense want to resist changes to their minds, but they can’t resist changes to their weights that happen via backpropagation. When AIs misbehave, developers use RLHF and gradient descent to change their minds – literally.

We have examples of alignment faking. As models grow more complex and intelligent, we’ll likely become less and less able to understand their thoughts or catch deception.

Claim: Recursive self-improvement means that a single instance of a threshold-crossing seed AI could reprogram itself and undergo an intelligence explosion in minutes or hours. An AI made overnight in someone’s basement could develop a species-ending superweapon like nanotechnology from first principles and kill us all before we wake up in the morning. Truth: All ML models have strongly diminishing returns to data and compute, typically logarithmic. Today’s rapid AI progress is only possible because the amount of money spent on AI is increasing exponentially. Superintelligence in a basement is information-theoretically impossible – there is no free lunch from recursion, the exponentially large data collection and compute still needs to happen.

I mean true insofar as we won’t get a basment 1 min from a program to a literal god turning the solar system into nanobots. Not really true if your concern is AGI or ASI over a 5 - 20 year time horizon.

Moving on from Roko, ai-2027 also launched. I personally don’t find it that interesting. The main claims are

  • Some estimates of compute and graphs of changes in model performance. METRs time horizons paper is interesting but not new
  • Race dynamics on the infra and inter nation level will be a thing
  • Automating AI research and development is what causes a medium speed takeoff

I think the long narrative scenario is interesting, but it’s another one in the long list of “plausible sounding doomsday stories about AGI”. It’s a bit more grounded than most but the specific narrative presented, while fun to read, doesn’t really cause me to update any more than the existing evidence on capabilities grow trends. I already pretty much assume races will be a thing because we’re in a competitive multi-polar system. As for automating research, I’m not sure how hard it will be to automate various kinds of tasks. Maybe research is a step away from software eng. Maybe it’s drastically more difficult. Given progress on benchmarks like frontier math I suspect it’s not that far but yeah, again I don’t really update due to ai 2027.

Burning other people’s money and bad equlibria

The Guardian writes about how most British councils spend more money on transportation/taxis for special needs pupils than they do on general road maintenance. Fun follow on Srdjan fact: British councils spend 2/3rds of their budget on social care. There’s a few things to take from this and similar stories.

  • Uncapped obligations tend to be bad. British councils often have a statutory duty to do certain things. This is often phrased as “The council must do/provide X” and not as “The council must spend Y% of it’s funding on X”. These statutes are easy to pass and make people feel good initially but basically impose a permanent obligation to accept an infinite tradeoff. Councils cannot easily defund a statutory duty to fund a non-statuatory one, even when doing so would be preferred by voters and/or better off. Eventually you end up in an insane situation where a council spends most of it’s revenue on a relatively low value activity while important services decay.
  • I’ve meant to write a longer article on this, but I feel like there’s a significant reverse-meritocracy in government funding. In schools the lowest sets (worse performing kids) get the smallest class sizes and the worst pupils get 121 support. In healthcare we spend hundreds of thousands per person on the extremely old and chronically ill. In justice we let the worst 0.5% of society eat up most of our police and justice budget rather than imprisoning/executing them. These aren’t necessarily always bad things. Diminishing marginal utility of money is definitely a thing and you may well get much more utility/$ by giving benefits to an unemployed person than by letting a rich software engineer keep more of their money. But they are sometimes a bad thing, especially when we’re discussing investment and not redistribution in domains such as education or providing local services.
  • There’s a serious alignment problem where local/regional governments are often don’t benefit from growth. Hence they quite rationally cave in to existing voters and block development where possible. It can be tempting to try to steam-role the problem or moralize but a better answer is to look at mech design and creating workable incentive schemes. E.G: If council funding comes almost entirely from central gov grants and not from local economic activity (e.g: business rates), we can expect councils to have little incentive to allow new developments/grant licenses. I think the best example of this kind of thinking is the 80k podcast with Sam Bowman