Today’s guest blog comes from Kalev H. Leetaru, a fellow and adjunct faculty in the Edmund A. Walsh School of Foreign Service at Georgetown University in Washington DC. His award-winning work centers on the application of high performance computing and "big data" to grand challenge problems.

The entire quarter-billion-record GDELT Event Database is now available as a public dataset in Google BigQuery.

BigQuery is Google’s powerful cloud-based analytical database service, designed for the largest datasets on the planet. It allows users to run fast, SQL-like queries against multi-terabyte datasets in seconds. Scalable and easy to use, BigQuery gives you real-time insights about your data. With the availability of GDELT in BigQuery, you can now access realtime insights about global human society and the planet itself!

You can take it for a spin here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card or commitment is needed).

The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

From the very beginning, one of the greatest challenges in working with GDELT has been in how to interact with a dataset of this magnitude. Few traditional relational database servers offer realtime querying or analytics on data of this complexity, and even simple queries would normally require enormous attention to data access patterns and intricate multi-column indexing to make them possible. Traditional database servers require the creation of indexes over the most-accessed columns to speed queries, meaning one has to anticipate apriori how users are going to interact with a dataset.

One of the things we’ve learned from working with GDELT users is just how differently each of you needs to query and analyze GDELT. The sheer variety of access patterns and the number of permutations of fields that are collected together into queries makes the traditional model of creating a small set of indexes impossible. One of the most exciting aspects of having GDELT available in BigQuery is that it doesn’t have the concept of creating explicit indexes over specific columns – instead you can bring together any ad-hoc combination of columns and query complexity and it still returns in just a few seconds. This means that no matter how you access GDELT, what columns you look across, what kinds of operators you use, or the complexity of your query, you will still see results pretty much in near-realtime.

For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database. Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out “what if” hypotheses on global-scale trends in near-real time.

On the technical side, BigQuery is completely turnkey: you just hand it your data and start querying that data – that’s all there is to it. While you could spin up a whole cluster of virtual machines somewhere in the cloud to run your own distributed clustered database service, you would end up spending a good deal of your time being a systems administrator to keep the cluster working and it wouldn’t support BigQuery’s unique capabilities. BigQuery eliminates all of this so all you have to do is focus on using your data, not spending your days running computer servers.

We automatically update the public dataset copy of GDELT in BigQuery every morning by 5AM ET, so you don’t even have to worry about updates – the BigQuery copy always has the latest global events. In a few weeks when GDELT unveils its move from daily updates to updating every 15 minutes, we’ll be taking advantage of BigQuery’s new stream updating capability to ensure the data reflects the state of the world moment-by-moment.

Check out the GDELT blog for future posts where we will showcase how to harness some of BigQuery’s power to perform some pretty incredible analyses, all of them running entirely in the database system itself. For example, we’re particularly excited about the ability to use features like BigQuery’s new Pearson correlation support to be able to search for patterns across the entire quarter-billion-record dataset in just seconds. And we can’t wait to see what you do with it.

Give it a try here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card nor commitment is needed).

Then copy and paste the query below into the query box and click the bright red “Run Query” button:

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank
(SELECT Actor1Name, Actor2Name,  Year FROM [] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),  (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [] WHERE Actor1Name > Actor2Name  and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
HAVING Count > 100
WHERE rank=1

You just processed more than 250 million records detailing worldwide events from the last 30 years and discovered the top defining relationship for each year – all of this in around 6 seconds and for free (BigQuery gives its users a free monthly querying quota). Even more excitingly, you did all of this entirely in BigQuery, showcasing how you can run real analyses entirely in the database itself!

Now try copying and pasting this query and click the “Run Query” button:

SELECT MonthYear MonthYear, INTEGER(norm*100000)/1000 Percent
SELECT ActionGeo_CountryCode, EventRootCode, MonthYear, COUNT(1) AS c, RATIO_TO_REPORT(c) OVER(PARTITION BY MonthYear ORDER BY c DESC) norm FROM []
GROUP BY ActionGeo_CountryCode, EventRootCode, MonthYear
WHERE ActionGeo_CountryCode='UP' and EventRootCode='14'
ORDER BY ActionGeo_CountryCode, EventRootCode, MonthYear;

Congratulations, you just scanned more than a quarter-billion records to compile every protest in Ukraine that GDELT found in the world’s news media, by month, from 1979 to present. Even more powerfully, BigQuery has converted the raw count of protests per month into a normalized “intensity” measure that accounts for the fact that there is a lot more news media today in 2014 than there was in 1979 and that machine processing of news isn’t perfect.

You can click “Download as CSV” at the top right of the query results table and open the spreadsheet in your favorite spreadsheet program and graph it as a timeline. You can instantly see the “Revolutions of 1989,” the violent “Ukraine without Kuchma” protests of March 2001, the “Orange Revolution” of November 2004, and the Euromaidan protests of November 2013 to present.

Let me know if you come up with a particularly cool query you want to share with the world and we might just add it to the GDELT Blog. We can’t wait to see what you do with GDELT and BigQuery.