Quandl collects numerical data from hundreds of sources around the world and makes them accessible for free via a uniform GUI, format and API. The API makes it possible to use the same small set of powerful idioms for selecting and summarizing data across different sources and domains. This dramatically lowers the barrier to casual exploration, as you do not need to trawl through several instances of bad web design, format conversions and one-off cleanup scripts to process data before starting to make sense of it.
Pigshell takes a similar tack by providing a uniform file abstraction for web resources. The pigshell CLI lets the user create ad-hoc tools by composing pipelines of simple commands. Since the Quandl API supports CORS, we can grab data in pigshell, process it and produce images like these:
India's GDP quadrupled in the decade from 2002 to 2012. In spite of living and working in India during that entire period, with ready access to newspapers and the Internet, this fact completely escaped my notice. The high frequency noise of news drowned out even such a gross signal as India recording its best growth decade ever.
Looking a little further, we see that it wasn't just India. There was a rising tide led by China, which lifted many boats into 300%-400% growth per decade. Frankly, it looks pretty damn scary.
We can also plot country data as a choropleth map:
Let's walk through a blow-by-blow account of how to acquire and process this data in pigshell.
First, we need to get GDP data from Quandl. This requires a URL as shown at the bottom of this page. Instead of cutting and pasting this huge URL, we compute it from a list of three-letter country codes.
sh -s /usr/share/misc/countries.sh
countries variable now contains a list of country-codes like
auth_token='&auth_token=dsahFHUiewjjd' url='http://www.quandl.com/api/v1/multisets.csv?columns='^$(sjoin , ODA.^$countries^_NGDPD.1)^'&trim_end=2012-12-31'^$"auth_token
auth_token must be replaced by your auth token from Quandl. You could simply
skip the auth token (set it to '') and try an anonymous query - up to 50 anon
queries a day are allowed.
The concatenation operator
^ has been used everywhere, though it is strictly
needed only to resolve ambiguity.
sjoin (string join) is a little function
/etc/profile which uses its first argument to join the remaining
Now to actually fetch the data.
cat $url > world-gdp.csv
If this fails, you might need an access token.
cat world-gdp.csv | csv2js | rename -f $(sjoin , ODA.$countries^"_NGDPD - Value") -t $(sjoin , $countries) | read gdp
gdp contains a list of objects, each corresponding to a row of
the original CSV, with property names corresponding to (renamed) column names.
Charting the GDP progress of a single country through time is very easy:
echo $gdp | cut -f "Date,IND" | chart -t tseries
cut selects the given fields from the input object and produces an
object containing only those fields.
chart produces an SVG displayed in the terminal. Redirecting the
output results in a PNG file. So:
echo $gdp | cut -f "Date,IND" | chart -t tseries >/downloads/india-gdp.png
will convert the chart into a PNG and save it in the browser's download folder. You could save it to any other mounted filesystem.
Charting multiple countries is equally simple:
echo $gdp | cut -f "Date,BRA,RUS,IND,CHN,ZAF" | chart -t tseries -o ytitle="GDP (billion USD)",title="BRICS GDP"
Now let's take a snapshot of 2012 and print the top 10 economies by GDP:
gdp12=$(echo $gdp | grep -f Date 2012 | reshape -c $(sjoin , $countries) -f country -v GDP)
This command reshapes the data from "wide" to "long" format. We expect a single
object to emerge after
grep, with one property per country. After the
reshape, this single object is converted to a list of objects, one per
country, with the
country property containing the three-letter code and the
value property containing the GDP value.
echo $gdp12 | sort -rnf GDP | head | printf "%(country)s %(GDP)s\n" USA 16244.575 CHN 8229.381 JPN 5937.767 DEU 3427.853 FRA 2612.667 GBR 2484.445 BRA 2247.745 ITA 2014.382 RUS 2004.252 IND 1858.748
Finally, we could display a choropleth of GDP on a world map:
echo $gdp12 | iframe -o colors=Greens -g /usr/share/template/d3-worldmap1
iframe displays the given HTML page in a sandboxed iframe, and passes it
data from standard input using
postMessage. This enables easy reuse of
publicly available visualization examples.
But yeeeugh, that looks awful! Lesson learned: choropleths are not good for variables which have a high range of values. The US sets a mean grade curve. A vertical bar chart is often much more useful in such cases:
echo $gdp12 | sort -rnf GDP | head -n 15 | chart -t vertbar -o key=country,field=GDP
To see the above steps in action, a pigshell widget has been embedded at the end of this post. Note that this will work on modern versions of Chrome, Firefox, Safari. Mobile browsers, IE and others will not work to varying degrees.
You can start your own quandling right on this page, or start a new shell at http://pigshell.com.
Pigshell does not intend to replace existing tools in the data scientist's arsenal, any more than shell scripts compete with mainstream programming languages. But the shell is invaluable as a low-barrier-to-entry canvas for doodling, exploring and hacking. In addition:
- Pigshell runs in the browser as a static web app, no setup required.
- It is free software. You can run a local copy of the entire stack with no SaaS dependencies.
- You can share gists of data visualizations, like GDP time series and Life expectancy map which encourage readers to try out their own variations, without having to start from scratch.
Future posts will expand on some of these aspects in more detail.