Will OpenAI release weights to a model designed to be easily interpretable (2024)?

Plus

Ṁ25k

Jan 1

10%

chance

ALL

This market predicts whether OpenAI will provide the weights to a language model specifically designed for easy interpretability(an Interpretability Model) to any Alignment-focused Organization by December 31, 2024:

Resolves YES if:

OpenAI officially announces or confirms that they have shared the weights of an interpretability model with any alignment-focused organization on or before December 31, 2024.

Resolves 50% if:

There is credible evidence that OpenAI has shared the weights of an interpretability model with any alignment-focused organization. Then the market will be held open for an additional 6 months(June 1, 2025) and resolved 50%.

Resolves NO if:

No alignment-focused organization receives the weights of an interpretability model from OpenAI by December 31, 2024.

Resolves as NA if:

OpenAI ceases to exist, or the listed organizations merge, dissolve, or undergo significant restructuring, rendering the original intent of the market unclear or irrelevant.

Definitions:

A language model is an algorithm that processes and generates human language by assigning probabilities to sequences of tokens (words, characters, or subword units) based on learned patterns from training data. They can then be used for various natural language processing tasks, such as text prediction, text generation, machine translation, sentiment analysis, and more. Language models use statistical or machine learning methods, including deep learning techniques like recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and transformer architectures, to capture the complex relationships between words and phrases in a language.
"Interpretability model" refers to a language model created by OpenAI with the primary goal of facilitating better understanding of the inner workings, decision-making processes, and learning mechanisms of the model, as opposed to models optimized solely for performance or prediction accuracy. An accompanying article should emphasize interpretability, and an accompanying technical report should report interpretability benchmarks that can judge other language models. Model must not have been released before this market's creation. While performance is not the primary goal, it must be competitive on benchmarks with language models at most 2 years behind it (this excludes the possibility of e.g. a Markov Chain being presented).
"OpenAI" mainly means OpenAI, but additionally any executive or heavy investor is allowed to make the announcement. So if Microsoft makes the announcement, it will count. If OpenAI is acquired, the name refers to the acquirer. If OpenAI dissolves, this market resolves NA.
"Credible evidence" means journalistic reporting alluding to credible sources, statements by employees or former employees, or similar.
"Alignment-focused Organization" refers to Anthropic, Redwood Research, Alignment Research Center, Center for Human Compatible AI, Machine Intelligence Research Institute, or Conjecture. Additional examples may be added - if there is a dispute, a poll may be taken. A public release of weights counts as releasing to these organizations. A leaked release also counts, as long as the model is confirmed to have been developed with the purpose fo being interpretable. "Red-teaming" of capabilities does not count.
Market description can be freely adjusted within one week after market creation. After that, I will only refine to narrower meanings or to have additional examples added.

This question is managed and resolved by Manifold.

#AI

#OpenAI

#Technical AI Timelines

#AI Safety

#️ AI Alignment

#Language Models

#Mechanistic interpretability

Get

1,000

and

3.00

5 Comments

73 Holders

213 Trades

Sort by:

https://github.com/openai/transformer-debugger
Doesn't look to be a model, but a tool to debug models.

sold Ṁ49 YES

https://github.com/openai/sparse_autoencoder
does this count?

predictedYES

Just to confirm, if it happens in 2023, it would count, yes?

predictedYES

@firstuserhere yes

predictedNO

[looking for feedback]

For this to happen resolve YES:
1. There is some way to create a model optimized for interpretability
1.1. That is not a low hanging fruit (since otherwise someone working on interpretability would find it already)
1.2. OpenAI would invest a lot to find this model
2. OpenAI are comfortable releasing more models
2.1. They don't think this would advance other's capabilities in a way that would risk OpenAI's business or AI Safety
3. OpenAI keeps focusing on language models

This seems unlikely (but I'm new here, happy for feedback)

Related questions

Related questions