Adventures in Sourcing the Global Database of Events, Language and Tone (GDELT) Data
How discursus.io revamped its approach to sourcing and processing GDELT data for the monitoring of protest movements.
The GDELT Project is at the heart of discursus.io. It’s a massive data project that scrapes world media to surface events as they are being reported. The kind of data that allows us to monitor protest movements in close to real-time.
I love GDELT! It’s the most fascinating, deep and rich data project out there. I’ve been using it for many years now, and it’s still the most thrilling source of social data out there.
And although there are quite a few access points to its data (online tools, raw CSV files, and BigQuery datasets), there are many datasets to choose from. So when starting work with GDELT, you’re confronted with 2 important questions:
- What are the specific events I’m interested in
- From where do I want to consume those data points
Before talking about our approach, let’s have a look at the data production process from GDELT.
GDELT Data Production Process
Let’s take this Québec article that describes a march that occurred in Montréal on May 22, 2023, to celebrate La journée nationale des Patriotes (while the rest of Canada celebrated Victoria Day).
Below is a very naive and high-level abstraction of what would be the GDELT journey of that article. It’s only partially informed on what’s publicly available as, even though GDELT data is free and open to use, the actual process of generating that data is private and proprietary. But we can make assumptions and come up with the following diagram
Our article is thus first scraped and although we cannot see the actual output of that data collection, we can at least consult the metadata that was captured in that initial processing step.
(Be careful when querying GDELT data from BigQuery as those are massive tables. You could significantly hike up your costs. It’s important to consider the table partitioning and incorporate this into your ‘where’ clause — something that I’m not showcasing in my queries below.)
select
date,
title,
lang,
metatags
from `gdelt-bq.gdeltv2.gemg`
where url = 'https://www.ledevoir.com/societe/791505/une-marche-pour-la-journee-des-patriotes'
Translation and NLP processing of the article follows. The Mentions
table provides information on where events and actors were identified in the article.
select
GLOBALEVENTID,
EventTimeDate,
SentenceID,
Actor1CharOffset,
Actor2CharOffset,
Confidence,
MentionDocTone,
MentionDocTranslationInfo
from `gdelt-bq.gdeltv2.eventmentions`
where MentionIdentifier = 'https://www.ledevoir.com/societe/791505/une-marche-pour-la-journee-des-patriotes';
That NLP processing allows the construction of a rich representation of events, as we see from that very small subset of available fields in the Events
table.
select
GLOBALEVENTID,
DATEADDED,
EventCode,
ActionGeo_FullName,
Actor1Name,
Actor2Name,
GoldsteinScale,
NumArticles,
AvgTone
from `gdelt-bq.gdeltv2.events`
where GLOBALEVENTID = 1104069925
Finally, we have another layer of processing which extracts additional information from articles, such as covered themes, identified persons and organizations, locations, etc.
select
GKGRECORDID,
DATE,
Themes,
Locations,
Persons,
Organizations
from `gdelt-bq.gdeltv2.gkg`
where DocumentIdentifier = 'https://www.ledevoir.com/societe/791505/une-marche-pour-la-journee-des-patriotes';
Now that we better understand that processing flow and what are some of the data available, let’s go back to how we use that data at discursus.io.
Our Old Approach To Consuming GDELT Data
Our old approach was to ingest raw CSV files of Events
and Mentions
. We read a Txt file from a URL, regex and unzip our way to CSV files that then gave us access to an enormous amount of data.
Below is a representation of our data assets production graph. GDELT’s events were sourced every 15 minutes and were then used downstream to source only the relevant GDELT mentions (articles), which then were further processed.
Sourcing GDELT’s data every 15 minutes can be massive. So filtering is very important!
Let’s say you’re interested in protest events in North America. You need to rely on the processing done by GDELT to apply event type and location filters on the Events
feed. But the processing rules are proprietary to GDELT and hidden from us. It just might be that our definition of a protest differs from the processing rules for example.
So we end up with a few opportunities for false positives and missing data points. That means that data issues can creep in:
- Events that are not protest events.
- Events that occurred outside of North America.
- Events that don’t surface in the data at all.
I’m definitely not implying that the GDELT is wrong. Simply saying that GDELT’s processing rules might differ from what you would have in mind to codify events, locations and actors.
But fortunately, GDELT is open enough to provide a lot of the raw data for us to apply our own processing rules. Which is what we’ve started to do at discursus.io.
Our New Approach To Consuming GDELT Data
Before we get into how our approach changed, here’s for reference an Entity Relationship Diagram of the data schema we are exposing at discursus.io.
To be able to serve that semantic layer, we are refactoring how we consume and process the GDELT data:
- First off, we now only consume a single feed, which is the
GKG
feed. The advantage of that feed is that we have all attributes we require to enhance and process the data: for each article, we have its themes, locations, people, organizations, etc. - We now use BigQuery to access the data, as we can more effectively query the data (again being very mindful of table partitioning), as well as backfill historical data.
- We do our own scraping of the articles, to extract the content and metadata.
- We use that scraped content to generate a summary of the article using OpenAI’s LLM (Large Language Model).
With that raw data in place and all enhanced attributes, we can now better control the processing rules to generate the data which more accurately represents protest movements in North America.
Our new approach is reflected in this high-level diagram of the discursus.io data process.
Implementing Our New Approach
The discursus.io data platform is supported by Dagster, an orchestrator for the production of data assets. Below is a presentation of the materialization flow of a subset of data assets that relates to how we are ingesting, enhancing and processing GDELT data.
What is happening here:
- We are sourcing the GDELT GKG feed first. The partitions are the 15-minute increments that are inherited from how GDELT processes its data and makes it available.
- The enhancement step is where we scrape the article’s content and metadata from those articles.
- The summaries are generated using an OpenAI LLM on top of that scraped content.
- And finally, we only see a really small subset of dbt models which is actually how we use that raw and enhanced data to process it into the semantic representation of protest movements we discussed above.
The Results
Now that we’ve gone through all that data-processing fun, what is the output? It’s still too early to review the overall data quality, which we’ll do at a later date, but I can at least compare the output for a single article.
Let’s take this article which covers a coming protest from environmentalists against a provision in the debt-ceiling agreement which gives the green light to the construction of a pipeline.
The raw data from GDELT includes all the attributes we covered in our last post. Below is a subset of those attributes we ingested.
We then scrape the metadata and content.
And finally, we can generate a summary of the article with the help of an LLM.
Now with all those attributes, we can recreate the semantic representation of protest movements we discussed above, but this time around using our own processing rules.
And we can see the result in our publicly available monitoring dashboard.
Conclusion
At this point, we’ve made a significant step towards having more control over the processing of GDELT data, which should allow us to have better levers to increase data quality.
The next steps are to make various improvements to those processing rules (for example improving the event’s location and excluding articles which are editorials, opinion pieces, etc.). And then we will be in a better position to review the overall data quality of our data. But that’s for a future post.