text2video
The first batch of AI experiments at Scroll lives in text2video repository.
Tech
- RunPod for GPU compute.
- Streamlit for dev UI.
- JupyterLab for research.
- langchain for GovGPT’s agent framework.
- Wav2Lip for lip-syncing of AI avatars.
- facebook/musicgen-small for Music generation.
- Google AI for text to speech and speech to text (subtitles).
- Bark for text to speech.
- CoqUi for text to speech.
- Bhashini for text to speech.
- ElevenLabs for text-to-speech.
- OpenAI for text summaries and text-to-speech.
- Local Whisper for speech-to-text (subtitles).
- WhisperAPI for speech-to-text (subtitles).
- Translations by Google and Bing.
- MoviePy and other media manipulation libraries.
- Flash for Slack integration.
Experiments
GovGPT
Uses sayanarijit/datagovindia (which searches data.gov.in) to answer questions about government data with the langchain framweork.
This wasn’t used much other than some test runs.
AI Avatar
Uses MoviePy Python library with other media manipulation libraries to -
- Generate short summary from Scroll.in article.
- Translate the summary into multiple Languages.
- Dub the text summary into speech.
- Generate timestamped subtitles from the generated speech.
- Generate talking avatar with libsync.
- Compose the talking avatar, synced subtitles with dedicated fonts, article cover image into a template background.
- Add a low volume background music.
- Add zooming effect, blurred background effect, filmgrain etc.
See the flow here
No avatar videos
Later the avatar was removed from videos and the template was updated accordingly.
Slack commands
This repo also contains a Slackbot implementation that allowed the team to run experiments and demos quickly using Slack commands.
For example:
/dub textto generate speech for the given text./summarize https://scroll.in/articles/123to summarise the given article./compose https://scroll.in/articles/123to generate the AI Avatar video of the given article.
Learnings
Accurate subtitles
For news videos, displaying the correct subtitle is a must. Mistaken words can cost big. But accurate speech-to-subtitles generation is not easy, no matter how big the model is, it can always mistake one word for another, or get confused with uncommon words. So, it’s best to display the actual summary, instead of the generated subtitles. But for that we need to transfer the timestamps from the generated subtitles to the actual summary. We have created a Python library to do exactly that: matchingsplit. It takes the summary as text, and the generated subtitle as reference list of phrases, and then splits the summmary text into phrases that matches the phrases of the reference subtitle.
Placing animated video into a template
We used myvision.ai to annotate the polygon (VGG JSON format) into a template where the animated video would be placed for each template background. MoviePy, a super convenient framework on top of ffmpeg and imagemagick can easily place the video into the annotated polygon.
Running GPU
Instead of buying expensive GPUs, it’s best to start a temporary RunPod instance. There are many templates to choose from and pricing was justifiable. But for local development, having GPU adds some convenience.
Sibyl and Genie
Later, the text2video’s AI Avatar part was refactored into Sibyl (backend) and Genie (frontend) and turned into a product with many improvements, new features and revamped video design.
The project was tracked here.
Multiple News organizations showed interest in the product, so dedicated customizations per organization was made into the product (hard coded).
Tech
- Some tech from the text2video project above.
- Sarvam AI for text-to-speech and translations.
- Anthropic Claude for text generation.
- ULCA for translations.
Experiments
Auto-generation workflow
- Auto generate summaries for each new articles and post them to slack for human review.
- After review, generate short videos and post them to slack for final review and posting to social media.
Model comparisons
- Ran experiments with multiple service providers and local models for summarization, text-to-speech, speech-to-subtitles, translations etc.
- Tried multiple regional languages.
Prompt engineering
- Experiments to optimize the summarize prompt.
- Ran Slack experiments with multiple article formats like listicles, atoms, FAQs etc.
Learnings
Dates in prompts
As a best practice, always add the article publish date and current date (with weekday) along with the article in the prompt. This helps the model not get confused with the tenses and dates mentioned in the article.
Storing human reviews for future training
It’s best to store the human reviews in a structured format for future training of the model. This can be used to fine-tune the model or to create a feedback loop for continuous improvement.
Pronunciation Glossary
For accurate pronunciation of uncommon words, it’s best to feed a modified version of the text with commonly mispronounced words replaced with their phonetic pronunciation. This involves human review to identify the modify the mispronounced words. Same should be stored for future training/tweaking of the model/model params.
Getty image search using search engine API
This was a quick experiment to search for Getty images using their search engine API.
Learnings are documented here and here.
Scroll Vector Database
sense.scroll.team is a vector database that stores the vector embeddings of all latest articles published by scroll. It adds new articles as they are published.
Tech:
- Typesense for search engine and vector database.
- ts/all-MiniLM-L12-v2 model for generating vector embeddings.
- FastAPI for backend API.
- Vue.js with vue-instantsearch/vue3/es for frontend.
Learnings:
- It’s pretty easy to generate embeddings locally and semantic search is pretty accurate and fast with Typesense. Scales really well with the number of articles even on limited hardware.
- Along with semantic search Typesense also provides really good support for keyword search, filtering and faceting etc. Don’t forget to add facet=True to eligible fields in the schema before populating data.
- Other powerful features like hybrid search, geo search, Image search, natural language, conversational search, and AI agent modes are also available in Typesense.
- For semantic search, long paragraphs are better than short phrases.
Scroll Lo-Fi
scroll.in/lofi or lofi.scroll.team is a calm, slow-paced news digest for the day, with a lo-fi music background.
It’s one of the formats we came up with while experimenting with different formats for news digests.
Get the backend logic here and here is the fully vibecoded frontend which you can open in browser directly with .html extension.
Tech
- BAML for type-safe text generation and summarization.
- FastAPI for backend API.
- Vue.js for frontend.
- ElevenLabs for text-to-speech.
- Suno for music generation.
Learnings
Multiple Languages
We experimented with Hindi and Bengali languages, but the quality of the generated audio was not good enough. Also ElevenLab’s pricing makes it expensive to generate audio for all the articles in multiple languages. So we decided to stick with English for now. There’s currently no good-enough alternatives to ElevenLabs for text-to-speech.
BAML as agentic framework
While the concept of BAML is unique and theoritically powerful, it comes with minor inconveniences like having to pass API keys via environment variables, dealing with compatibility issues with different versions of the library, forgetting to compile etc.
I (Arijit) personally found PydanticAI much more practical and easier to work with.
Atomic
Along with Scroll Lo-Fi, we also explored multiple formats we can represent the same article to the end users, and compiled them into Atomic repo and hosted the demos at scroll.in/ai.
It’s a fast-paced fully experimental, disorganised repo containing both the live and the abandoned experiments.
Tech
- Scroll Vector Database for semantic search on latest articles.
- BAML and PydanticAI for agentic framework.
- FastAPI for backend API.
- Vue.js for frontend.
Experiments
Representing a single article in multiple formats
- Detail Slider - Reveal original paragraphs by importance: TL;DR -> Tell me more -> Tell me everything.
- Complexity Slider - 6 levels of complexity: Original -> Beginner -> Semi-familiar -> Aware of topic -> Domain-aware -> Expert.
- Facts - Need to Know, Good to Know.
- Calculator - e.g. calculate personal tax with full breakdown based on the article about new tax rules.
- Mindmap - Knowledge Graph mapping entities and their relationships mentioned in the article.
- Expander - Expand highlighted phrases to reveal more details about them.
- Impact - Decision tree based UI to explore how the news directly or indirectly impacts you.
- Number - Story in numbers - i.e. just the numbers extracted from the article in tabular format.
- FAQs - Nested, expandable, frequently asked questions about the article, with answers.
Formats for cross-article storylines
Learnings
BAML as agentic framework
See Scroll Lo-Fi for learnings on BAML as an agentic framework.
PydanticAI as agentic framework
Reduces a lot of boilerplate when the input type (deps_type) and output type (output_type) are used along with dynamic instructions (@agent.instructions).
Example:
PROMPT = """\
{input.foo}
"""
class Input(TypedDict): # TypedDict is convenient for direct input.
foo: str
class Output(BaseModel): # BaseModel is convenient for validation and parsing.
result: str
agent = Agent(
name="Temporal Events Extractor",
model=provider.gpt_5_1,
output_type=Output,
deps_type=Input,
)
@agent.instructions
def prompt(ctx: RunContext[Input]) -> str:
return PROMPT.format(input=ctx.deps)
output = await agent.run(input={"foo": "bar"})
Dig here for a complete example.
OpenAPI spec driven frontend generation
Being primarily a backend developer, I found it really convenient to generate the frontend directly from the OpenAPI spec of the backend API. This way, I can focus on building the backend logic and let the frontend be generated automatically.
For that, I had to name the API endpoints and their input/output models in a way that makes sense for the frontend.
Example input schema:
class EventsByArticleIDsRequest(BaseModel):
article_ids: list[int]
Example output schema:
class ComplexitySliderResponse(BaseModel):
title: str
heading: str
complexity_0_original_html: str
complexity_1_beginner_html: str
complexity_2_semi_familiar_html: str
complexity_3_aware_of_topic_html: str
complexity_4_domain_aware_html: str
complexity_5_expert_html: str
Temporal Timeline
Taking the learnings from Atomic, we wanted to dig deeper into atomising articles and see how well it helps us render a story in different formats, starting with the timeline, heatmap etc. formats.
Here we first split each article into one or more events, each of which is a self-contained unit, i.e. Atom, and then try to append the atom to the timeline of its corresponding story. The timeline is a sequence of events that are ordered by their temporal information, such as date and time.
See demo here.
See the backend logic here.
Tech
- Scroll Vector Database for semantic search on latest articles.
- PidanticAI for agentic framework.
- FastAPI for backend API.
- Vue.js for frontend.
Experiments
Atomic Events from full Adani coverage
We parsed events from all the articles related to Adani (keyword match), and assigned them different storylines based on their semantic similarity. This way, the primary timelines were created. We also tried to cluster the events using hcluster algorithm, and got a good intuition via real-time visualisation of the clusters by adjusting the distance threshold. This helped us identify related events and group them together in the timeline.
Learnings
The News Atom and Timelines
The News Atom concept is powerful for structuring news articles into self-contained units of information. This allows us to create timelines that are more granular and easier to navigate.
However, with more data, it might get expensive to keep appending atoms to the timeline. So we need to be mindful of the cost and optimize the process accordingly.
Duplicate events and updates
Multiple articles on the same topic may reference the same event. Each event being one atom, it needs to be detected accordingly and the article id needs to be added to the referencing article ids of the event. But all mentions of the event may not be the same, sometimes it can be an update to the existing related event. In that case we need detect accordingly, and add the new event as an update to the existing event.
Matching events vs matching articles
Because articles of the same format belonging to the same topic usually have many common references (same events), they tend to be closer in cosine similarity. But articles belonging to the same topic, but of different formats, may sometimes be farther in cosine similarity, in comparison to some same-format-different-topic articles. But since we are fetching topic by matching events, not articles, the format actually doesn’t matter in our case.
Duplicate topics
One hard-to-solve problem is:
- After extracting events from a new article, it tries to find most similar past events and then the topics those events belong to, and then append the new events to the appropriate topic.
- But if the most similar past events belong to a different topic, and we don’t fetch enough past events, it might think it’s a new topic, and create a new topic for the same story, which results in duplicate topics for the same story. To mitigate this, we need to make sure we fetch enough past events to cover all the topics, but that also increases the cost. So we need to find a balance between fetching enough past events and keeping the cost in check.
Quick clustering using hcluster algorithm
We can use the hcluster algorithm to quickly cluster articles based on their semantic similarity. We can get a real-time visual representation of the clusters by adjusting the distance threshold. This can help us identify related articles and group them together in the timeline.
See the experiment in notebook, and web demo for more details.
It is observed that around the 44% cluster radius value, we can get a good view of the clusters without too much fragmentation or merging.
This method of forming clusters is much faster and more cost-effective than the traditional method of searching and matching relevant topics.