AnalysisIs OpenAI's text-to-video AI model Sora a blessing or a curse?
Analysis
Is OpenAI's text-to-video AI model Sora a blessing or a curse?
The AI pioneer has unveiled Sora, a text-to-video model that enables the creation of spectacular videos using a simple text command; The fear: the new technology will be used to create fake videos featuring real people
The images on the screen are stunning. In one video, a woman strolls through the rain-soaked streets of Tokyo, colorful neon signs glowing in the background. In another, a herd of hairy mammoths charge towards the camera, the snow from their footsteps rising into the sky like a thick white cloud. There’s a trailer for a sci-fi movie where a handsome but mysterious man walks towards a spaceship. There is a battle of pirate ships in a cup of coffee, a young man relaxing on a cloud while reading a book, a historical photo of a town during the gold rush in California, a cartoon kangaroo disco dancing in what looks like a scene from a Pixar movie and much more.
What do all these videos have in common? First, none of them are real, even those that include human characters that look very real. But this is not new, we all already know the capabilities of Hollywood's special effects artists. The second common element is far more significant: they were all created using a simple text command, a short and concise sentence ("Trailer for a movie starring a 30-year-old space adventurer wearing a motorcycle helmet covered in red mesh, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"), and sometimes just a few words ("a cartoon kangaroo dancing disco"). And they all foreshadow the next stage in the generative artificial intelligence (GenAI) revolution, the possible disruption it will create for various industries, and the dangers it may entail.
OpenAI, the (unofficial) mother of the GenAI revolution, is responsible for one of the most significant changes in the world of technology in recent years, thanks to two pioneering models it launched - the text-to-image model Dall-E and the large language model (LLM) that many meet in the form of the chatbot ChatGPT. But the company has no intention of stopping, and on Thursday it unveiled the next step in its revolution: Sora (Japanese for sky), a text-to-video model that will allow users to create exciting and captivating videos with a simple text command.
The company is not the first to reveal a text-to-video model. A startup from New York called Runway AI presented such a model already last April, demonstrating how it is possible with text commands to produce videos such as a dog talking on a smartphone or a cow celebrating a birthday. These videos were short, only about 4 seconds long, blurry and distorted, but demonstrated the capabilities of the technology. Meta unveiled its own model in September, and Google in April. These also provided limited results in their quality: videos lasting a few seconds, with jumping frames and distorted and unconvincing characters. They were a nice proof of concept, but no one would suspect that it was the real thing.
Sora's results revealed by OpenAI last week are already a significant leap forward. Some of them look like they were taken out of a big budget Hollywood movie. Others, as if they were created by top level animation studios. Only a real expert will be able to recognize that this is a video created entirely by a machine using a short text command, and even that might not be true. It can be estimated that the competitors are not too far behind either.
The immediate concern is that the new technology will be used to create fake videos featuring real people, which could lead to the disruption of democratic processes. "I'm just terrified that something like this will bias an upcoming election," Prof. Oren Etzioni from the University of Washington, who specializes in artificial intelligence, told the New York Times. A possible concern is that some party will take advantage of the system in order to produce an incriminating video of one of the candidates in this year's U.S. presidential election, and distribute it at a crucial time among voters in a country or district where a difference of a few votes could sway the election results one way or the other.
OpenAI is well aware of the potential for abuse, and this is the main reason that at this stage the model is open to access only to a limited list of experimenters, mainly academics and independent researchers, selected by the company. Their mission: to identify ways in which the new capabilities can be abused. "The goal is to give a preview of what's on the horizon, so people can see the capabilities of this technology, and for us to get feedback," Dr. Tim Brooks, a member of Sora's development team, told the New York Times.
The company did not say how long it intends to test Sora before providing wide access to the model. GPT-4 was tested by the company for six months before opening it to the public. A similar schedule means that Sora will be accessible in August, just in time for the crucial moments of the U.S. election campaign. One can hope that the company will choose not to take a risk that such a powerful and unknown tool will be used to create videos that can influence the results of the elections, and will wait with the model's publich launch at least until after election day.
Another concern is related to the information used to train the model. OpenAI does not disclose the number of videos used for its training, or their source, and only states that videos that are publicly available on the web were used, as well as videos that the company received a license to use from copyright holders. The information with which the model is trained can affect the results it produces, and promote, for example, stereotypes against minorities or the creation of false content. Therefore, it is very important to know which sources OpenAI is using, if only so that it would be possible to verify that these are diverse and representative sources of information.
Beyond these concerns, there is also the question of which industries the new technology will disrupt. Hollywood is in the crosshairs, and especially professions such as photographers, special effects experts, actors and the immediate staff that surrounds them (make-up artists, hair stylists, etc.). Sora is currently only able to produce short videos, a few tens of seconds long at most, and without sound. But given the tremendous advances in technology in such a short period of time, the ability to produce more complex videos with the addition of sound, perhaps even dialogue, does not seem like a far-fetched scenario. It might not happen this year or next year, but within five years? It doesn't seem a good idea to bet against it.
And once that happens, all that will be required to create a film is a good screenwriter, and perhaps also a director of prompts who knows how to break down the script into written scenes that can be fed into the model and receive a complete film at the end of the process. Given the progress in the capabilities of large language models, it is possible that within five years even these two roles will not be needed, and with a few well-formulated sentences it will be possible to instruct a descendant of ChatGPT to produce a complete script ready to feed into a text-to-video model.
Such a scenario is still a few years in the future. But an industry that is already in the crosshairs now, or at least as soon as Sora opens up to the public, is the advertising photography industry. Many times, an advertisement is nothing more than a concept of atmosphere, style and lifestyle. This is exactly the type of videos that Sora already excels at making. A talented copywriter, in an hour of work and a few games of trial and error, can find the prompt that will make the model produce the 30 seconds they need for a commercial video. Now you have to add some music (there is a model for that), and maybe some narration (that too), and you have created a complete and original advertisement, produced exclusively by one person.
There is no reason for these to be the only two disrupted industries. Any field that uses video - news broadcasters in studios, training videos, cooking shows and more - could be facing change once Sora and similar models reach maturity and wide use. Many people may not like the result, but it is doubtful that this is what will stop the technology.