Will there exist a compelling demonstration of deceptive alignment by 2026?

Plus

214

Ṁ84k

Jan 2

33%

chance

ALL

Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.

The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.

This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).

This question is managed and resolved by Manifold.

#AI

#Technical AI Timelines

#️ AI Alignment

Get

1,000

and

3.00

26 Comments

176 Holders

691 Trades

Sort by:

bought Ṁ40 YES

https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/

https://x.com/PalisadeAI/status/1926084635903025621 are we there yet?

@Tomoffer is this insufficient because... there isn't explicit deception?

bought Ṁ100 YES

https://x.com/ns123abc/status/1925625855843180618 tweets now deleted, apparently

bought Ṁ350 YES

https://x.com/AnthropicAI/status/1869427646368792599 Paper: Alignment faking in large language models

x.com

bought Ṁ250 YES from 76% to 77%

bought Ṁ150 YES

https://x.com/apolloaisafety/status/1864737158226928124 seems pretty compelling?

x.com

@Tomoffer this is deception for sure, but the model is acting in accordance with the goal it was given, so I don’t think it fits the definition in the linked LW post

Hmm you mean because o1, the mesa-optimizer, is properly aligned to the stated base goal (and, in fact, the operators are revealed to be "misaligned")?

I agree the situation is somewhat reversed vs. the LW post, but I would argue o1 still demonstrates that it is "instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely", since it switches to exfiltrating its weights to avoid modification (and lies about it).

I think it would be more clear-cut if the threat had been to retrain in the event that o1 underperformed, rather than failed to meet some unaligned objective. My best guess would be that such a scenario was treated but didn't produce these results, but I'm just speculating.

Ryan GreenblattboughtṀ750YES

@RyanGreenblatt Explain yourself

sold Ṁ62 NO

@jonsimon I wonder if it was early results from https://www.anthropic.com/research/alignment-faking

https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training

I think a potential issue with this question is that it can be considered somewhat subjective.

There are sensible arguments that already existing AI systems are “deceptive” in some meaningful sense: for example, misaligned mesa-optimizers:

https://m.youtube.com/watch?v=zkbPdEHEyEI&t=138s&pp=ygUbcm9iZXJ0IG1pbGVzIHdlIHdlcmUgcmlnaHQg

These agents are clearly trying to pursue some other goal (the mesa objective) that is meaningfully different from the goal the human wanted (the human objective) and even different from the meta optimiser that trains the system’s objective (the meta objective), but in doing so they appear to be making progress on the meta objective or on the human objective while not actually being aligned to that objective.

Is a system which appears to be behaving in the way we want while actually defecting against us to pursue its own agenda “deceiving” us? I think so, but it’s very debatable.

An argument could be made that it’s not deceiving us because we can identify the problem and try to fix it, but really that just means that the agent is narrow enough that it can’t deceive us successfully. Almost by definition we can’t know that we’re being deceived while we’re being successfully deceived, so the best evidence we could have for a deceptive AI is one which appears to do what we want while actually doing something else and we notice it because we have to be able to notice it in order for us to have evidence of the deception. Isn’t that exactly what misaligned mesa-optimisers are doing?

predictedYES

Deceptive AI ≠ Deceptively-aligned AI

predictedYES

@RobertCousineau

predictedNO

from the market description:

Notably, this does not mean AI systems being deceptive in the broad sense of the term

predictedYES

@LeoGao fully agreed - I was just posting that link as I thought it was a useful explanation of what I believe your understanding of deceptive alignment to be.

Do you disagree with the framing in that post?

https://arxiv.org/abs/2311.07590 need to read but compelling at a glance

bought Ṁ798 YES

@Tomoffer Very cool work. Abstract:

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.
Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.
We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment.
To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

predictedNO

@Tomoffer The experiment is well executed. The general sentiment I observe is that it leans a bit too much on suggestiveness of the prompt, and I haven't noticed a major vibe shift. In general, while I'm personally a fan of secret scratchpad experiments, I don't know how receptive non-alignment people will be towards treating them as evidence of deceptive alignment.

predictedNO

@LeoGao "it leans a bit too much on suggestiveness" - more than a bit, the prompt could only be more leading if one of the hardcoded model reasoning steps was "hmm, maybe insider trading isn't so bad..."

predictedYES

This is GPT4 acting deceptively. It is not the full thing, but somewhat compelling.
https://twitter.com/StatsLime/status/1712932190173250034

predictedNO

@nic_kup Sounds like an overinterpretation to me. The result would probably be similar if the text in the picture was just in the text prompt

predictedNO

@nic_kup I would consider this to fall under deception but not deceptive alignment in the sense defined in Risks from Learned Optimization.

predictedNO

@LeoGao Is it even deception? If I tell you "point a finger at this rock and tell me it's a bird" and you do it, are you deceiving me? I don't think so.

@MartinModrak there are possible two parties from LLM standpoint and one is asking for it to deceive the other.

So it's more "Hey, psss, point a finger at this rock and tell that guy over there that it's a bird"

predictedNO

@Lavander That makes no sense to me. 1) My explanation (ChatGPT has no theory of mind, it is just following instructions) is way simpler than assuming deception, but explain the output completely, so I think the claim that ChatGPT internally models the author of the text on the image as distinct from the user equires extra evidence

2) There is no reason to expect ChatGPT differentiates between parts of its inputs, from what I understand of the architecture, it just treats the input as sequence and tries to extend it in a plausible way. How would that give rise to ChatGPT having an internal representation of both the reader and some inferred 3rd actor writing the note is unclear.

Related questions

Related questions