Fun with Quandl

Quandl collects numerical data from hundreds of sources around the world and makes them accessible for free via a uniform GUI, format and API. The API makes it possible to use the same small set of powerful idioms for selecting and summarizing data across different sources and domains. This dramatically lowers the barrier to casual exploration, as you do not need to trawl through several instances of bad web design, format conversions and one-off cleanup scripts to process data before starting to make sense of it.

Pigshell takes a similar tack by providing a uniform file abstraction for web resources. The pigshell CLI lets the user create ad-hoc tools by composing pipelines of simple commands. Since the Quandl API supports CORS, we can grab data in pigshell, process it and produce images like these:

India GDP

India's GDP quadrupled in the decade from 2002 to 2012. In spite of living and working in India during that entire period, with ready access to newspapers and the Internet, this fact completely escaped my notice. The high frequency noise of news drowned out even such a gross signal as India recording its best growth decade ever.

GDP

Looking a little further, we see that it wasn't just India. There was a rising tide led by China, which lifted many boats into 300%-400% growth per decade. Frankly, it looks pretty damn scary.

We can also plot country data as a choropleth map:

Life expectancy

Let's walk through a blow-by-blow account of how to acquire and process this data in pigshell.

First, we need to get GDP data from Quandl. This requires a URL as shown at the bottom of this page. Instead of cutting and pasting this huge URL, we compute it from a list of three-letter country codes.

sh -s /usr/share/misc/countries.sh

The countries variable now contains a list of country-codes like AUS, IND, etc.

auth_token='&auth_token=dsahFHUiewjjd'
url='http://www.quandl.com/api/v1/multisets.csv?columns='^$(sjoin , ODA.^$countries^_NGDPD.1)^'&trim_end=2012-12-31'^$"auth_token

auth_token must be replaced by your auth token from Quandl. You could simply skip the auth token (set it to '') and try an anonymous query - up to 50 anon queries a day are allowed.

The concatenation operator ^ has been used everywhere, though it is strictly needed only to resolve ambiguity. sjoin (string join) is a little function defined in /etc/profile which uses its first argument to join the remaining arguments.

Now to actually fetch the data.

cat $url > world-gdp.csv

If this fails, you might need an access token.

Convert the CSV into Javascript objects, rename fields of the form "ODA.IND_NGDPD - Value" into "IND" and collect the results into a variable:

cat world-gdp.csv | csv2js | rename -f $(sjoin , ODA.$countries^"_NGDPD - Value") -t $(sjoin , $countries) | read gdp

The variable gdp contains a list of objects, each corresponding to a row of the original CSV, with property names corresponding to (renamed) column names.

Charting the GDP progress of a single country through time is very easy:

echo $gdp | cut -f "Date,IND" | chart -t tseries

India GDP

cut selects the given fields from the input object and produces an object containing only those fields.

chart produces an SVG displayed in the terminal. Redirecting the output results in a PNG file. So:

echo $gdp | cut -f "Date,IND" | chart -t tseries >/downloads/india-gdp.png

will convert the chart into a PNG and save it in the browser's download folder. You could save it to any other mounted filesystem.

Charting multiple countries is equally simple:

echo $gdp | cut -f "Date,BRA,RUS,IND,CHN,ZAF" | chart -t tseries -o ytitle="GDP (billion USD)",title="BRICS GDP"

BRICS GDP

Now let's take a snapshot of 2012 and print the top 10 economies by GDP:

gdp12=$(echo $gdp | grep -f Date 2012 | reshape -c $(sjoin , $countries) -f country -v GDP)

This command reshapes the data from "wide" to "long" format. We expect a single object to emerge after grep, with one property per country. After the reshape, this single object is converted to a list of objects, one per country, with the country property containing the three-letter code and the value property containing the GDP value.

echo $gdp12 | sort -rnf GDP | head | printf "%(country)s %(GDP)s\n"

USA 16244.575
CHN 8229.381
JPN 5937.767
DEU 3427.853
FRA 2612.667
GBR 2484.445
BRA 2247.745
ITA 2014.382
RUS 2004.252
IND 1858.748

Finally, we could display a choropleth of GDP on a world map:

echo $gdp12 | iframe -o colors=Greens -g /usr/share/template/d3-worldmap1

GDP world map

iframe displays the given HTML page in a sandboxed iframe, and passes it data from standard input using postMessage. This enables easy reuse of publicly available visualization examples.

But yeeeugh, that looks awful! Lesson learned: choropleths are not good for variables which have a high range of values. The US sets a mean grade curve. A vertical bar chart is often much more useful in such cases:

echo $gdp12 | sort -rnf GDP | head -n 15 | chart -t vertbar -o key=country,field=GDP

GDP world map

To see the above steps in action, a pigshell widget has been embedded at the end of this post. Note that this will work on modern versions of Chrome, Firefox, Safari. Mobile browsers, IE and others will not work to varying degrees.

You can start your own quandling right on this page, or start a new shell at http://pigshell.com.

Pigshell does not intend to replace existing tools in the data scientist's arsenal, any more than shell scripts compete with mainstream programming languages. But the shell is invaluable as a low-barrier-to-entry canvas for doodling, exploring and hacking. In addition:

  • Pigshell runs in the browser as a static web app, no setup required.
  • It is free software. You can run a local copy of the entire stack with no SaaS dependencies.
  • You can share gists of data visualizations, like GDP time series and Life expectancy map which encourage readers to try out their own variations, without having to start from scratch.

Future posts will expand on some of these aspects in more detail.

Happy quandling!

Leave a Reply

Your email address will not be published. Required fields are marked *