Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

text2video

The first batch of AI experiments at Scroll lives in text2video repository.

Tech

  • RunPod for GPU compute.
  • Streamlit for dev UI.
  • JupyterLab for research.
  • langchain for GovGPT’s agent framework.
  • Wav2Lip for lip-syncing of AI avatars.
  • facebook/musicgen-small for Music generation.
  • Google AI for text to speech and speech to text (subtitles).
  • Bark for text to speech.
  • CoqUi for text to speech.
  • Bhashini for text to speech.
  • ElevenLabs for text-to-speech.
  • OpenAI for text summaries and text-to-speech.
  • Local Whisper for speech-to-text (subtitles).
  • WhisperAPI for speech-to-text (subtitles).
  • Translations by Google and Bing.
  • MoviePy and other media manipulation libraries.
  • Flash for Slack integration.

Experiments

GovGPT

Uses sayanarijit/datagovindia (which searches data.gov.in) to answer questions about government data with the langchain framweork.

This wasn’t used much other than some test runs.

AI Avatar

Uses MoviePy Python library with other media manipulation libraries to -

  • Generate short summary from Scroll.in article.
  • Translate the summary into multiple Languages.
  • Dub the text summary into speech.
  • Generate timestamped subtitles from the generated speech.
  • Generate talking avatar with libsync.
  • Compose the talking avatar, synced subtitles with dedicated fonts, article cover image into a template background.
  • Add a low volume background music.
  • Add zooming effect, blurred background effect, filmgrain etc.

See the flow here

No avatar videos

Later the avatar was removed from videos and the template was updated accordingly.

Slack commands

This repo also contains a Slackbot implementation that allowed the team to run experiments and demos quickly using Slack commands.

For example:

  • /dub text to generate speech for the given text.
  • /summarize https://scroll.in/articles/123 to summarise the given article.
  • /compose https://scroll.in/articles/123 to generate the AI Avatar video of the given article.

Learnings

Accurate subtitles

For news videos, displaying the correct subtitle is a must. Mistaken words can cost big. But accurate speech-to-subtitles generation is not easy, no matter how big the model is, it can always mistake one word for another, or get confused with uncommon words. So, it’s best to display the actual summary, instead of the generated subtitles. But for that we need to transfer the timestamps from the generated subtitles to the actual summary. We have created a Python library to do exactly that: matchingsplit. It takes the summary as text, and the generated subtitle as reference list of phrases, and then splits the summmary text into phrases that matches the phrases of the reference subtitle.

Placing animated video into a template

We used myvision.ai to annotate the polygon (VGG JSON format) into a template where the animated video would be placed for each template background. MoviePy, a super convenient framework on top of ffmpeg and imagemagick can easily place the video into the annotated polygon.

Running GPU

Instead of buying expensive GPUs, it’s best to start a temporary RunPod instance. There are many templates to choose from and pricing was justifiable. But for local development, having GPU adds some convenience.