Data BC what's working and what's broken
So Data BC has been live for a short while now and they've clearly been listening and making improvements to the service. So here's a quick update to my prior post:
First, what works.
They've added some new high value datasets -- new fiscal data, payment card datasets, park trails, etc.. Really useful stuff for app developers. Unlike the federal government attempt at opendata, this one seems to have real potential to get a meaningful amount of BC's high value data out to the public. So on that front they must be commended and are doing a really good job. I met with Herb Lainchbury (of opendatabc) this week and he's going to be adding some data request tracking to the opendatabc website shortly, so we'll be able to get an even better picture of the request->fulfilment landscape going forward -- and not just for provincial data, but for federal and municipal requests too.
Next, data bc has added some usability features, top datasets, most downloaded, editors picks etc.. excellent stuff.
The search engine seems markedly improved from days prior, and I seem to be able to find data I'm searching for.
But with the good comes the bad, here's what doesn't work:
Data still doesn't appear to be locally hosted, or mirrored, and frequently download links take you off to some other system that requests personal information and presents other licensing terms. The @data_bc twitter folks stress that the good opendata license does apply to linked data, which seems to suggest that we can disregard the sub-licenses, but the process seems largely in transition. What opendata folks want: data on an endpoint. Links that actually download files directly.
Next, they've made a lot of good geospatial information available. Obstacles to fish passage is a huge one for environmentalists, and they've also released bathymetric mapping for lakes in BC. Sounds awesome, but the KML data files provided are mere links to WMS servers -- I must stress, this is awesome data, but WMS is a -service- not -data-. Today's ios and android apps need vector data available to work offline. KML would be awesome for this but a KML frontend on a WMS server means it wont work.
The spatial data also seems to have a lot of 'collection' data, that is data files that are largely links to other data files, many of which actually end up at a WMS server. These files need to be split, dumped and hosted at data.gov.bc.ca as individual data sets on an endpoint -- then we can use them. If you're curious if a KML file is data or a link to data, try searching it on maps.google.ca. A well-formed KML file will be displayable on a google map, a pointer to something else wont. (just put the url to the KML right in the search box on maps.google.ca and hit query)
We also need a change-notification service, RSS feeds, or 'watch this dataset' feature so that we can tell when data we rely on has been updated. Again, opendatabc may have the answer with a crawler in the near future, however, any such crawler will depend on the data being on good end-points.
I've also noticed that some of the CSV's provided (which are on nice endpoints) have some minor encoding issues, like CSV's with multiple headers, or non-ascii/utf character sets (em-dash being the biggest pain) and the server serving the data does not present the correct character set. Take for example, the excellent data, "Members of the Legislative Assembly Members' Compensation Paid in the Fiscal Year Ended March 31, 2011" which you can find here: http://www.data.gov.bc.ca/local/dbc/docs/fin/FYE11_Members_Compensation.csv its a very cool dataset, but the server's saying its en_US, but the charset isn't readable by any of my standard tools, file, enca, iconv, none can make heads or tails of it without dropping those em-dash characters it seems. Because its encoded strangely, PostgreSQL can't import the data directly and I have to manually go through the file and clean it up. There's an empty column, and data that probably should be 0's is null. (for those non-programmer types, there's a huge distinction between null and 0 to an app developer)... so good data, but It's clear this data is coming from a human-readable format and needs work to make it machine readable csv. My cleaned up datasets that run proactivedisclosure.ca are available at proactivedisclosure.ca/data.
That said, believe it or not, despite the growing pains this is the right way to release open data... this post might sound like a bit of a gripe but it is not. Really. See, by letting developers poke around in the data we can identify issues with machine parsability, with data-that-isn't-really-data, with change management, etc. Without this public review, well, the data just sits unused somewhere, and because everyone knows that the data is human-readable and not machine-readable, people just wont try to use it. This is the power of a publish-first-fix-second approach, and it will make for better government data in the long run. Meaning when an internal government user goes to actually do something with this data, it will be pristine and ready to parse and that civil servant will be just that much more efficient. Rather than focusing on encoding issues, they'll only be focusing on the subject-matter issue of their project. Which in my books is a pretty big opendata win.
So congrats to data bc on a great launch, keep up the good work and keep releasing data sets. Also, since there should be lots of requests and bugs to track, I might suggest a bugzilla instance would be a good addition to the service.