Temporal Timeline
Taking the learnings from Atomic, we wanted to dig deeper into atomising articles and see how well it helps us render a story in different formats, starting with the timeline, heatmap etc. formats.
Here we first split each article into one or more events, each of which is a self-contained unit, i.e. Atom, and then try to append the atom to the timeline of its corresponding story. The timeline is a sequence of events that are ordered by their temporal information, such as date and time.
See demo here.
See the backend logic here.
Tech
- Scroll Vector Database for semantic search on latest articles.
- PidanticAI for agentic framework.
- FastAPI for backend API.
- Vue.js for frontend.
Experiments
Atomic Events from full Adani coverage
We parsed events from all the articles related to Adani (keyword match), and assigned them different storylines based on their semantic similarity. This way, the primary timelines were created. We also tried to cluster the events using hcluster algorithm, and got a good intuition via real-time visualisation of the clusters by adjusting the distance threshold. This helped us identify related events and group them together in the timeline.
Learnings
The News Atom and Timelines
The News Atom concept is powerful for structuring news articles into self-contained units of information. This allows us to create timelines that are more granular and easier to navigate.
However, with more data, it might get expensive to keep appending atoms to the timeline. So we need to be mindful of the cost and optimize the process accordingly.
Duplicate events and updates
Multiple articles on the same topic may reference the same event. Each event being one atom, it needs to be detected accordingly and the article id needs to be added to the referencing article ids of the event. But all mentions of the event may not be the same, sometimes it can be an update to the existing related event. In that case we need detect accordingly, and add the new event as an update to the existing event.
Matching events vs matching articles
Because articles of the same format belonging to the same topic usually have many common references (same events), they tend to be closer in cosine similarity. But articles belonging to the same topic, but of different formats, may sometimes be farther in cosine similarity, in comparison to some same-format-different-topic articles. But since we are fetching topic by matching events, not articles, the format actually doesn’t matter in our case.
Duplicate topics
One hard-to-solve problem is:
- After extracting events from a new article, it tries to find most similar past events and then the topics those events belong to, and then append the new events to the appropriate topic.
- But if the most similar past events belong to a different topic, and we don’t fetch enough past events, it might think it’s a new topic, and create a new topic for the same story, which results in duplicate topics for the same story. To mitigate this, we need to make sure we fetch enough past events to cover all the topics, but that also increases the cost. So we need to find a balance between fetching enough past events and keeping the cost in check.
Quick clustering using hcluster algorithm
We can use the hcluster algorithm to quickly cluster articles based on their semantic similarity. We can get a real-time visual representation of the clusters by adjusting the distance threshold. This can help us identify related articles and group them together in the timeline.
See the experiment in notebook, and web demo for more details.
It is observed that around the 44% cluster radius value, we can get a good view of the clusters without too much fragmentation or merging.
This method of forming clusters is much faster and more cost-effective than the traditional method of searching and matching relevant topics.