Consuming the GDELT Project’s Events Database

Olivier Dupuis
3 min readMay 31, 2019

--

I recently published an Introduction to the discursus Project, which is an open source OSINT project on protest events. That project mines GDELT data and has an up-to-date miner and ETL scripts for consuming, transforming and exposing GDELT data.

You can visit the Github repo here 👉 https://github.com/discursus-io/discursus_core

From the GDELT Project (https://www.gdeltproject.org/)

This post is a bit different as instead of talking about product analytics, I decided to share a few scripts that allows me to mine the GDELT project’s data.

Here’s a link to a GitHub repository that contains a few scripts that will allow you to easily consume new GDELT events as they are being published every 15 minutes.

A little bit on GDELT

Don’t know what GDELT is? It’s one of the most ambitious project I’ve encountered in the past few years, as it scrapes all newspaper articles, stores them in a massive database and provides an impressive amount of descriptive values to analyze that stream of data.

And that’s just the start as they are constantly improving the database. For example, they now let you scan the day’s top trending topics compiled from the national television station’s closed captioning. Amazing!

It’s massive and supported by Google on their BigQuery platform.

The technical requirements

So I’m working on a project with a friend that aims to monitor worldwide protests in real-time. And of course, one of the main source of data for our project is the GDELT Event database.

We could use Google BigQuery as they do provide a really cheap (or is it even free, I can’t remember) way to query that massive database. But that’s not really the requirements of the discursus.io project. We want to do live monitoring, so would like to plug ourselves directly on the stream.

And GDELT, in all its awesomeness, easily lets you do that.

So our requirements is to use the GDELT 2 dataset and download all new events from their “stream”, which is really a csv file published every 15 minutes.

Once we’ve downloaded those new events, we want to filter the ones we only care about and then commit them to our database. It should be worth mentioning that each event’s encoding is very detailed. Just look at the GDELT Event codebook for a description of all fields.

The scripts

So, with all that being said, we now have 3 scripts in our GDELT Mining repository.

  • The gdelt.sql script is only provided to give you an idea of how I structured my own mysql table to store the events I’m interested in.
  • gdelt_miner.sh is a bash script that just goes through the loops to download the latest csv file from the GDELT Event 2 database.
  • gdelt_transfer.py is a python script that reads the latest csv file, filters the events and commits the ones I’m interested in to my database.

For the miner and transfer scripts, I’ve added cron jobs that triggers those scripts every 15 minutes (of course, the transfer one is triggered 2–3 minutes after the miner one).

Happy GDELTing

Easy! And that’s part of what makes that project so great, how easy it is to use it for your own requirements.

If you have any suggestions regarding how to make those scripts better, please send those my way.

--

--

Responses (3)