In early 2028, will an AI be able to generate a full high-quality movie to a prompt?
💎
Premium
3.4k
Ṁ7.6m
2028
43%
chance

EG "make me a 120 minute Star Trek / Star Wars crossover". It should be more or less comparable to a big-budget studio film, although it doesn't have to pass a full Turing Test as long as it's pretty good. The AI doesn't have to be available to the public, as long as it's confirmed to exist.

Get
Ṁ1,000
and
S3.00
Sort by:

First version of Midjourney (which was really bad) launched less than 3 years ago. Now people do good quality 5-10 minutes films/animations in a week (obviously with a lot of human effort still). It feels like we are more than 50% there

bought Ṁ250 NO at 42%

There will surely be a lot of progress, but last mile problems can be pretty significant

@JimHays I also don't expect anyone to put in the level of resources this would require without having a human at least do some review and editing.

bought Ṁ50 NO

@qumeric that feels like 20% of the way there

They haven’t even put real dialogue in the movies, no?

@AlexanderLeCampbell All components already there or almost there. Generate a scenario, split it into scenes, for each scene generate detailed sequence, split it into prompts, generate. Object permanence will be an issue but I think we are getting there. Then voice it using elevenlabs, do lipsync where needed (good lipsync exists) and generate sound effects.

Now we just need a scaffolding which will do everything for us, most importantly watching generated scenes and judging them. It can be done

I think the main issue (apart from object permanence) is cost. It would be pretty slow and exepensive just to generate 2h high quality video but you would probably need hundreds of hours of attempts

Can be arbitraged with this market.

This has probably been answered somewhere in the comments but I can't find it, when the market title says "able", does that mean "able ever" i.e. it has done it at least once even if the success rate of the resulting movie being actually high quality is <1%, or "able normally" i.e. the success rate is >50%, or "able almost always" i.e. the success rate is >99%?

If the AI requires you to give it access to a bank account containing millions of dollars, and then uses that money to contract Hollywood professionals to make a film for you based on that prompt - would that count?

@johnwhiles Was about to comment something similar. Imagine a world where the AI asks you for a lot of money, and then directs a live-action film.

I'd imagine this is unlikely, but a fun idea nonetheless.

@TiagoChamba There is a movie where an AI directs live actors on a movie set.
The Second Act, by Quentin Dupieux (so, expect something weird)

https://www.rottentomatoes.com/m/the_second_act/reviews

Just to be sure: The AI could, for example, use other tools to make the movie, right? Like using Adobe products

@TiagoChamba it can do whatever it wants agentically as long as it's from a single prompt as in the description

Newby question:

Is there something akin to Test Time Compute / CoT for video where the model can continually refine each frame to ensure temporal & aesthetic coherence across longer durations?

And if so, would that basically be the final missing piece that would get this to resolve as Yes?

My understanding is that currently we’re pretty close to photorealistic for still images but lack the aforementioned coherence when it comes to video gen

@elf I don’t think this is the last missing piece but i do think the ais will generate images and then refine them over chains of thought

@Bayesian what else do you think is lacking?

opened a Ṁ5,000 YES at 39% order

@elf Longer context window, better tokenized prepresentation of images so that they don’t take so much context, video input, more intelligence and better creative writing, better tool use, much better long term coherence

@elf also my guess is that doing the chain of thought / reasoning itself in the image/video domain, or in some surrogate latent domain (instead of just in the text domain) is going to be a big deal that significantly accelerates capabilities towards this goal.

Will be expensive to run though

@elf also there's still a limitation right now where, for example, a diffusion model can generate a video but cannot really "see" in a sense that there's something that doesn't make sense in what it produced. There's ways to explore multiple denoising paths, but these still bump into the limitations of the encoder branch. There may also be a missing piece where inference-time scaling is applied in some manner to the encoder. Similar to how a human can stare at an image for a few seconds before realising what's displayed. Currently, inference-time scaling methods mostly focus on boosting performance in the decoding task.

@elf Diffusion is already an iterative progressive-refinement technique. The secret sauce of CoT is that it extends reinforcement learning to reasoning tasks, allowing training data to be of the form "this is a long train of thought which is already natural to the model's mode of thinking which gives a correct answer" rather than just "this is the correct answer". Diffusion has some natural inference-time scalings (i.e. more denoising steps or generating several images/videos and choosing the best one, more described in the comments of the market @MalachiteEagle linked above but those are the two I already knew) - that being said, subjectively, CoT creates a qualitative difference in text-model outputs and none of these methods of scaling create qualitative differences in diffusion model outputs. I can't think of any plausible generalizations of existing CoT methods to diffusion models, which doesn't mean that they don't exist.

I also don't think that aesthetic consistency will fix the fact that AI writing is terribly, horribly bad, has plateaued at the same place that untalented but hardworking humans naturally plateau without intense guidance, and unless we figure out how to give LLMs the equivalent of a screenwriting degree I'm skeptical that we'll ever get past that point. Aesthetic cohesion comes second to aesthetic expertise, that is taste, and I don't think there's been any progress on giving LLMs better taste.

@speck I think what makes CoT work well in NLP models is that they can effectively leverage in-context learning using the Transformer architecture. As of early 2025 there is limited evidence for diffusion model variants that can do something similar when trained to generate image data. The denoising step itself is designed such that it only takes as input one noisy image sample (the noise of which can be deterministically generated from the timestep), and not a sequence of denoised images. It's possible that there will be new extensions to this method which make it much more amenable to inference-time scaling approaches in the same vein as o1/o3. But for now, it appears that autoregressive models + VAEs trained on image tokens are a much more credible candidate for tackling this challenge in the short/medium term. Diffusion models may go the same way as GANs in the near future.

The main bottleneck for creative writing in the new inference-time scaling paradigm appears to be the lack of reliable verifiers for aesthetic quality. It's not possible currently to replicate the same methodology used to boost coding and math skills in LLMs where hard verification is relatively straightforward to implement. I believe, however, that this is not a binary problem. There are ways to bootstrap creative writing quality with self-play, even if the assessment of aesthetic quality doesn't perfectly align with human preferences.

@elf A subset of this problem would be writing a full high quality book from a prompt, which we still are far from achieving. Theres a few unanswered questions regarding long-form generation. The existing data might be insufficient if theres not enough transfer from shorter form text (i.e. paragraphs of which we have hundreds of billions of meaningfully different one to train on) to longer form text (i.e. full books, of which we have maybe a dozen million). If the transfer from short text to much longer text isn't seamless we end up with a much more complicated problem and a lot less data to work with. Context windows have been getting much better for understanding but I've not seen much to imply long coherent generation is anywhere close

@MalachiteEagle Partial agree on future directions for image and especially video generation - in-context learning is part of the puzzle, but training on reasoning tokens seems legitimately incredibly important from my perspective, so I don't think that in-context reasoning is the whole point. (Part of this could be bias: in my field, problem-solving amounts to thinking for a very long time and trying many different approaches, but expertise comes from refining instinctive train of thought by training on past problem-solving attempts, so I'm primed to interpret CoT in that framework). I'm not very knowledgeable on alternate approaches to image/video generation so will take it on faith that alternate methods make CoT-flavored approaches more feasible.

Agree that verification of aesthetic quality is the hard part of developing it, but I'm much less skeptical that this can be fixed. The experience of human creatives is that new aesthetic ideals often take even longer to evaluate than they do to develop, and that developing taste as a creator without occasionally trying to do something new (and therefore hard-to-evaluate) is basically impossible. Evaluation being harder than generation seems to make CoT-flavored approaches impossible, at least to me. It's possible that humans are actually bad at creativity and this process can be fixed, but I'm not very confident.

Would be interested what you mean re: "bootstrapping creative writing quality". I'm not aware of effective approaches to this, but it seems likely that (a) you have more knowledge than I do on this topic and so will be aware of more things, (b) we will disagree about whether particular improvements actually represent meaningful progress on the qualitative front, and (c) you are probably more optimistic than I am that there's still low-to-medium hanging fruit here.

The experience of human creatives is that new aesthetic ideals often take even longer to evaluate than they do to develop

Could you clarify what you mean by this @speck ? I take it to mean that truly original/radical work usually isn't appreciated immediately (e.g. many films like Kubrick's 2001 weren't beloved until years/decades after release)

Doesn't RLHF demonstrate we can encode human preferences @MalachiteEagle ? It's certainly fuzzier than a Math or Code verifier but I imagine it'd still be beneficial even if it's incremental improvement

I, perhaps naively, assumed that LLMs would solve the taste issue through scale.

If they're compressing almost all of human knowledge, then it's difficult to have much space left over to encode other things but with enough weights it could eventually encode the throughline of enough 'great' works from many disparate artistic fields, which would effectively amount to 'taste', but I guess you'd run out of high quality data before then?

Have you played around with deep research @LuEE ?
It's not book-length yet, but I would have assumed that it demonstrates future o-series models will produce progressively lengthier output while remaining high quality until we hit book-length.

Granted, this is just me naively extrapolating, but would be interested in knowing your reasoning if you disagree

@elf That's a big part of it, yeah. I think the crucial thing is that this is also true on smaller scales - sometimes a radically genre-shifting movie takes 5 years to make but isn't understood in artistic context for 20 years, but sometimes you draw something in 5 minutes and it takes you 20 minutes to decide whether you like it. Both are major problems for this approach as far as I can tell.

(This part wasn't addressed at me, but my understanding is that RLHF kills creativity in some measurable ways and seems to also do it in some non-measurable ways. My guess is that inoffensiveness projects to one dimension much more easily than artistic quality and that low/mid quality human feedback fails to distinguish between the two, so RLHF tends towards an equilibrium of not being hated by anybody rather than being actively liked by somebody.)

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules