top of page
  • Augmentation Lab

The Imagination Machine: multimodal human-AI storytelling

Updated: May 20

Note: this system was built before the release of DALL-E 2 & chatGPT, and thus uses older models.


Machine learning (ML) for creative storytelling has emerged with the development of powerful generative models. While writers, artists, and musicians have experimented with integrating various generative systems into their creative work, they typically only explore one modality at a time. Storytelling, however, is becoming increasingly multimodal with the digital age. The Imagination Machine is an AI-assisted multimodal storytelling platform that integrates text, audio, and visual generative ML models. Users interface with The Imagination Machine via a web application that facilitates a bi-directional workflow in which the human creator and AI models alternate in generating content. This novel software system, comprising a database, server, and frontend interface, was developed, iteratively improved, and pilot tested in creative experiments. The Imagination Machine enables easy and fast multimodal content generation, enabling creatives and laymen alike to tell any immersive story they can imagine.


Sometimes referred to as the “final frontier,” computational creativity is key to bridging the gap between human and computer modes of understanding, communicating, and generating meaning. By mimicking and enhancing the creative process, we can better understand and embed human systems of meaning into machine algorithms to promote alignment and human-centric technological progress.

Recent developments in generative ML–such as the development of large-scale language models like GPT3 [1] and visual/audio generative adversarial networks like VQGan [4]–have enabled massive advancements in generative content. Yet existing industry and academic endeavors using these technologies focus on progress within a single medium–e.g. increasing the fluency of generated text or increasing the aesthetic appeal of generated art and music. Few efforts have been made to synergize technologies across different mediums to create cohesive multi-medium stories.

To address this gap, this work develops software infrastructure and creative workflows for AI-assisted multi-medium storytelling.


This research develops software infrastructure and creative workflows for AI-assisted multi-medium storytelling through an iterative process of system development and creative experimentation. First, methodologies for human + AI creative collaboration were established. Then, software was developed based on these frameworks. Creative experimentation using this software revealed shortcomings, and the software was then iteratively improved using this feedback loop.


Fig. 1 shows the creative workflow and software architecture for The Imagination Machine.

4.1 Software Development

The software system comprises a database, server, and frontend interface. Google sheets was selected to provide an accessible database for a general user audience. The web application was developed using the Streamlit framework and is hosted on Streamlit Share but can also be hosted on a local computer using the open source code base.

Text generation is implemented through a simple API call to OpenAI’s GPT3 API. Users can adjust text generation parameters through the user interface as desired.

For visual generation, The Imagination Machine uses GPT3, primed by few-shot examples, to generate a descriptive caption from narrative text that is then fed into a generative server. The diffusion + CLIP server is hosted using an AWS ml.g4dn.xlarge SageMaker notebook instance. This generative system was selected after experimentation with a number of other systems, including VQGAN + CLIP variations. Among other research, [6] suggests that diffusion models beat GANs in image synthesis. Thus, this server uses code from this CLIP Guided Diffusion open source notebook. The server is called at a static endpoint configured using localtunnel, with inputs specifying the illustration caption, user name, project name, and paragraph number. The server uses CLIP + diffusion to generate images and saves these images in an AWS S3 bucket. The frontend Streamlit app then fetches and displays these cloud-hosted images to the user.

Music generation is implemented based on Gonsalves’ pipeline using GPT3 fine-tuned on the OpenEWLD database of over 500 songs. The songs were pre-processed to convert MusicXML format to ABC format and then transposed to C Major to improve GPT3 learning of melodic relationships. This fine-tuned model intakes song and band names and outputs a sequence of notes in text format. The Imagination Machine uses GPT3, primed by few-shot examples, to generate relevant song and band names from narrative text, which is then used to query the fine-tuned model to generate music notes. These music notes are finally converted into .mp3 format, uploaded to an AWS S3 bucket, and displayed as a playable widget to the user on the frontend.

Each of these generative functions are available for the user to initiate for each paragraph of their story. When they are finished with the entire story, an integration function interpolates all the human and AI generated text, illustrations, and music from all sections of the story into a final video. The story text is overlaid onto the videos of the illustrations generating and converted to speech via Google Cloud Text-to-Speech. The voiceover and music integration is achieved through moviepy.

4.2 Few-Shot Learning - Prompt Exploration

Few-shot learning was used to prime GPT3 to translate narrative text into input formats appropriate for the generative visual and audio systems. Thus, the quality of the overall software system depended heavily on the effectiveness of this translation module. Prompt design exploration was conducted to find an optimal set of few-shot learning examples for GPT3 priming, as shown below.

Fig. 2 shows the few-shot learning examples ultimately selected for the translation modules for visual generation and audio generation, respectively.

4.3 User Interface and Experience

The user experience for the software system developed is as follows:

  1. Select an existing story, or create a new one.

  2. Begin or continue writing the story in the appropriate paragraph box.

  3. Adjust the generative parameters on the sidebar as desired.

  4. Click the “Write” button to have AI continue writing your story. Click the “Illustrate” button to have AI generate an illustration for your paragraph. Click the “Compose” button to have AI generate a song for your paragraph. Regenerate as many times as desired.

  5. Click “Export Project” when you are finished and want to integrate the text, visual, and audio components of your project into a cohesive product.

Fig. 3 shows two screenshots of the user interface for The Imagination Machine. The first screenshot shows an example of GPT3-generated text based on a paragraph of my example story, “The Three Bears.’ Below, the music player widget displays the generated song. The second screenshot shows the GPT3-generated caption for a paragraph of the story the illustration generated using this caption.


Pilot experimentation with this software was conducted for a number of different genres of story, including romance, science fiction, and fantasy. While the system produced relevant and appropriate outputs for science fiction and fantasy stories, romance stories suffered from explicit content, likely due to the dataset on which the visual generative model was trained. Interestingly, the system seemed to perform better at generating illustrations for science fiction stories, likely because hyperrealism is correlated with perceived quality and the dataset the visual generative model was trained on contains many real photos. However, these evaluations are entirely subjective and more sophisticated evaluation metrics and methods must be developed before coming to any conclusions.

Another notable observation was the differences in both function and effectiveness of the text, generative, and visual models. The AI-generated text served as inspiration for continuing the story, often adding unexpected details or plot points that would change the storyline. However, the generated illustrations and music served more supplementary functions, as accompaniments to the central narrative. This could be in part due to the nature of storytelling as inherently language-based, and in part due to the slower turnaround of the generative visual and audio process compared to the generative text process.

Example visual outputs from a fantasy story, titled “The Three Bears,” are shown below. Briefly, “The Three Bears” is about three bear princes fighting for the throne after their father’s death. They recruit armies of fire, water, and ice and fight in a never-ending war. It is a short story adaptation loosely based on Crowned: The Legend of the Three Bears.

Each illustration represents a paragraph of the story. For example, the first illustration was generated from the paragraph:

“The king of the Bear Kingdom lay in his royal bed half covered by his red embroidered blankets, his face frozen in pale death. The court physician and the king's personal physician were at the king's bedside, and a few servants were hovering by the door, but they knew it was too late.”

This paragraph itself was partly generated by AI–the bolded text was generated by GPT3 and then integrated into the human-written non-bolded text. Each of these paragraphs were also used to produce relevant music. Together, this multimodal storytelling system was able to capture the latent space of the imagined story in various mediums, creating a richer and more complex experience for the audience. Ultimately, the narrative text, illustrations, and music were integrated into a cohesive video. The text, audio, visual, and integrated outputs can be found in this data folder. The final output can be found here.


This work was inspired by The Imagination Machine developed by the PI, Matt Fisher, and Jack Lewis at MIT Reality Hackathon. This work builds off of previous work done by the PI generously supported by Harvard College Research Program funding under the guidance of Jefferey Schnapp.

147 views0 comments
bottom of page