Listen to this article:
Editor's Note: In the months since we published this article on Nov. 5, 2022, AI technology has advanced dramatically (as we predicted it would in the story), particularly via mindboggling image generation. So rapid has been that advancement that on Jan. 13, 2023 the New York Times published a story with the same title "This Film Does Not Exist," which features the eye-popping imagery of a fictional mashup of the movie "Tron" and the visual style of Chilean director Alejandro Jodorowsky, as generated by the engine Midjourney from a prompt by artist Johnny Darrell. We're linking to the Times article so readers can compare Loka's admittedly rudimentary AI experiments with Darrell's beautifully rendered images.
From spam filtering to global navigation to bank loan default prediction, artificial intelligence has invisibly influenced our daily lives for years. In fact, AI’s out-of-sight, out-of-mind status might be the reason This Person Does Not Exist recently set the tech world buzzing. Using an AI driven, style-based generative adversarial network, or GAN, this public-facing experiment by Nvidia generated photorealistic images of fictional humans. Encountering the “face” of a neural network was startling, even unnerving for many of us. Since then, GANs have created realistic-looking images of pretty much everything and the uncanny valley has only widened. (Or narrowed?)
At the same time, AI-enabled natural-language processing, or NLP, is generating the predictive text that pops up as we type our email messages and allows Alexa to reply to spoken queries with coherent spoken answers. Robots have become so adept at carrying on sophisticated conversation–complete with humor, irony and original insight–that in July a former Google engineer was convinced that the AI he was messaging with had gained sentience.
The AI team at Loka was curious about this emerging tech, and as obsessive cinephiles, we were inspired to use it to make movies. Not full-length films–not yet!–but the seeds of them: plot outlines and movie posters, plus an associated director, all automatically generated via a few user-supplied keywords. Our method involved using a natural language processor and text-to-image generator, both available via open source. After many months of iterating, we’ve arrived at a level of oddity/quality that we’re compelled to share. Spoiler alert: It gets weird.
In the months since our Hollywoodbot experiment began, the technology behind open-source, AI-generated imagery and narrative text has advanced dramatically, as has the human artistry that drives it. Some might say our humble Hollywoodbot already feels dated; we like to think of it as “first gen”. The capabilities of AI are going supernova at this very moment. As you’ll see here, its capabilities are always unpredictable–and potentially profound. Starting with the latest installment of Iron Man…
Step 1. Gathering Data
The typical movie-making process incorporates dozens, maybe hundreds, of component pieces and systems. But at the core level, the whole thing starts with an idea, a concept, a plot. For that we turned to GPT.
Standing for Generative Pretrained Transformer, GPT is a pre-trained natural-language model developed by OpenAI. It uses an attention mechanism that focuses on previous words most relevant to the context of a written prompt and learns to predict the next ones. Even a small amount of input text is enough for GPT to create articles, poetry, short stories, news reports and dialogue. The model was trained on a variety of data derived from Common Crawl, webtexts, books and Wikipedia. It was up to us to tailor it to come up with a movie plot.
Searching for data was not as easy as we thought it would be. Believe it or not, there are more resources for movie reviews than there are for movie descriptions and plot summaries. Thanks to D. Bamman, B. O’Connor and N. Smith, as well as Kaggle user Samruddhi Mhatre, we managed to obtain data for thousands of movies, which included title, release year, rating, runtime, genre, plot, characters information and other attributes. This data contributed to various functions such as statistics, exploratory data analysis (EDA) and NLP.
Step 2. Cleaning up and Tokenization
With such a vast dataset, we had to take several steps before tackling the predictive model. Alfred Hitchock said “Drama is life with the dull bits cut out,” which is maybe why most of the movies in this dataset belong to the drama genre. Comedies are second, followed by thriller and crime. Biography, documentary, adventure, horror and animation also register, with another 10 or so genres barely showing up. Maybe we could expect some Guy Ritchie storylines? 🤔
Our goal was to generate a movie title and plot by using a simple theme of a few words–some of them wildly ambiguous–such as "friends robbing a bank". But because we can’t work with text data if we don’t transform it into machine-legible speech, we process it through tokenization. Tokenization separates a piece of text into smaller units, or tokens, which are the building blocks of natural language. For our purposes they can be either words, characters or subwords.
Our goal was to generate a movie title and plot by using a simple theme of a few words–some of them wildly ambiguous–such as friends robbing a bank.
GPT uses byte-level Byte Pair Encoding (BPE) tokenization. This means that "words" in the vocabulary are not full words, but groups of characters (or for byte-level BPE, bytes) which occur often in text.
Step 3. Fine-tuning
The text sequences we used to fine-tune GPT were constructed using the synopsis, title and full plot for every movie in the dataset. Like many generative transformer models, GPT is pretrained on the task of predicting the most probable next token in the sequence. So if we input text such as “Mike was cooking ___,” GPT would try to predict the most probable word to fill the blank from a huge vocabulary of English-language words. In this case, “dinner” would be a much more probable option for GPT than, say, “chair.”
In our case, we fed GPT the synopses and hoped that it would generate titles and plots. GPT tries to generate the next token in the sequence until it reaches its defined limit of maximum length or its special STOP token.
We ran the samples from our dataset a couple of times through GPT, expecting that it would recognize the patterns hidden inside movie plots and discover the knowledge it needs for generating its own plots. Aaaand… it worked! Sort of.
The first fine-tuned models generated a lot of nonsense. Turns out coming up with a coherent plot is hard! Let alone writing the screenplay for a feature film. But the stuff sounded pretty funny, so we were encouraged to continue. After a few sleepless nights, a lot of coffee, and $$$ spent on hosted GPU-equipped servers, we finally found the super-secret combination of hyperparameters that allowed us to fine-tune a higher-quality model. (Without getting overly technical, it has to do with the very specific way we worded the input prompt that trains the model.)
Turns out coming up with a coherent plot is hard! Let alone writing the screenplay for a feature film.
Step 4. Posterize
To make our Hollywoodbot experience even more robust, we decided to complement our movie plot with an official poster. Unlike This Person Does Not Exist, we used Stable Diffusion, an open-source, text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION.
This model is trained on 512 pixel x 512 pixel images from a subset of the LAION-5B database, which itself contains 5.85 billion image-text pairs. It uses a text encoder to condition the model on text prompts, and being relatively lightweight, enables fast creation of quality images. Additionally, specific magic words such as "highly detailed," "surrealism" or "movie" direct the AI to produce better and more relevant images.
We input around 50 image tags for Stable Diffusion to choose from, such as “sharp focus” to make the images clearer, “vibrant colors,” “fantasy,” etc, and Diffusion randomly selects ten. We also added “movie,” “film” and “movie poster” to refine it further. Usually a single prompt yields a decent result, the kind you see here. These images weren’t amended or altered in any way from the original Stable Diffusion creations. We’re using the model as-is, without significant retraining, because it’s pretty new and produces excellent–or at least interesting–results. Apparently Diffusion isn’t so great at generating legible/sensible text, but it does seem to understand the general design principles that most movie posters adhere to.
Additionally, specific magic words such as "highly detailed," "surrealism" or "movie" direct the AI to produce better and more relevant images.
Criminal underworld, guns, Miami… I Saw Him Coming sounds like the next feature from A24! Nice work, Hollywoodbot!
Coming Attractions: Life Imitates Art
Our experiment in AI microcinema is a fun playground with serious implications. Much of the technology we used powers other models that are currently revolutionizing real-world fields, particularly healthtech. AlphaFold, for instance, uses attention network mechanisms to predict a protein’s 3D structure from its amino acid sequence, a process that helps researchers better understand the biological function of the protein and thereby harness or hinder its effects. And Nvidia, the company that gave us This Person Does Not Exist, recently launched BioNeMo, a large language model that will simplify the process of training massive neural networks on biomolecular data. The goal is to make it easier for researchers to discover new patterns in biological sequences, eventually leading to new medications and therapies to improve human health.
Much of the technology we used powers other models that are currently revolutionizing real-world fields, particularly healthtech.
The more Loka’s AI engineers experiment in this field, the more we master the technology in all its forms and applications. And in the meantime, we’re all looking forward to watching Denis Villeneuve's upcoming blockbuster, the aptly-titled Robot!