Google spreadsheets are a commonly used way of making data publicly available – for example a lot of the data from the Guardian Datablog is exposed in this way. They can be viewed using a standard web browser – but suppose you wish to do something more with the data than just view it? For example you may wish to analyse or visualise the data using R. One possibility is to save the spreadsheet as a csv file and read it in to R. However, there is a quicker way to access Google Spreadsheet data directly from the command line. Not only is this quicker, but it means data access can be automated in R scripts and functions.
Firstly note that typical Google spreadsheets have a URL looking something like:
This points to a spreadsheet about waste recycling in the UK.
Usually (in my experience) if you replace the ccc?key part with tq?tqx=out:csv&key and change https to http giving
this URL corresponds to a csv file – and pointing your browser at this causes the file to be downloaded. However, a really useful feature in R is that many of the file reading commands work with urls as well as local files – for example the read.csv function. Thus, you can enter
waste <- read.csv("http://spreadsheets.google.com/tq?tqx=out:csv&key=0AonYZs4MzlZbdHl1SThHUGc4aUlRUmJJQ2tYY1JJVkE& hl=en#gid=3",stringsAsFactors=FALSE)
which will read the data into a data frame called ‘waste’. Ideally you could then go ahead and analyse (or visualise) the data. In reality you need to do a bit more work.
One issue with spreadsheets is that there is some ambiguity as to whether they are best used as data storage or data presentation devices. In this situation we really just want to use them for data storage. However, spreadsheets are often compiled to present the data for browsing – and cells contain information about such things as how the data was compiled., limitations on its use, regional summaries and so on, in addition to the ‘raw’ data. When you download the spreadsheet as a csv, you get all this stuff as well. This additional information is important, but it would be nice if it wasn’t scrambled in with the core data! This implication of this for R users is that data usually needs a little post-processing before it can be used – and that the precise nature of this will vary from data set to data set. Here, I’ll give an example of what needs to be done for this particular data set.
Actually, this one isn’t too bad. The key problems are:
1) Very long column names
2) The last 4 rows contain almost all nulls – not actual data – because some explanatory text was inserted in cell 1, 4 four columns after the data had finished.
3) Some of the numbers have been formatted to contain commas – eg 1,102 for 1102 – which has made them download as character strings. This occurs in columns 5 to 12.
Lets address these in turn:
colnames(waste) <- c("GOR","JPP.ord","Authority","AuthType","MunWaste","HHWaste", "HHCompost", "NonHHCompost","MunCompost","HHReject","NonHHReject","MunReject")
This gives shorter column names – note this isn’t strictly necessary and that arguably the longer names are more informative. however these ones are easier to use in code…
waste <- waste[1:(nrow(waste)-4),]
This ‘chops off’ the final 4 rows.
for (i in 5:ncol(waste)) waste[,i] <- as.numeric(gsub(",","",waste[,i]))
This takes columns 5-12, removes all of the commas from the data in string form, and then converts it to numeric form.
That pretty much leaves the data frame ready for analysis. For example,
hist(100*waste$MunCompost/waste$MunWaste,col='red',main="Municipal Waste", xlab="Percent Recycled")
will draw a histogram of percent municipal waste recycled by local authority.
You can now go on to do more useful analyses with the same data…