Data BC what's working and what's broken

So Data BC has been live for a short while now and they've clearly been listening and making improvements to the service. So here's a quick update to my prior post:

First, what works.

They've added some new high value datasets -- new fiscal data, payment card datasets, park trails, etc.. Really useful stuff for app developers. Unlike the federal government attempt at opendata, this one seems to have real potential to get a meaningful amount of BC's high value data out to the public. So on that front they must be commended and are doing a really good job. I met with Herb Lainchbury (of opendatabc) this week and he's going to be adding some data request tracking to the opendatabc website shortly, so we'll be able to get an even better picture of the request->fulfilment landscape going forward -- and not just for provincial data, but for federal and municipal requests too.

Next, data bc has added some usability features, top datasets, most downloaded, editors picks etc.. excellent stuff.

The search engine seems markedly improved from days prior, and I seem to be able to find data I'm searching for.

But with the good comes the bad, here's what doesn't work:

Data still doesn't appear to be locally hosted, or mirrored, and frequently download links take you off to some other system that requests personal information and presents other licensing terms. The @data_bc twitter folks stress that the good opendata license does apply to linked data, which seems to suggest that we can disregard the sub-licenses, but the process seems largely in transition. What opendata folks want: data on an endpoint. Links that actually download files directly.

Next, they've made a lot of good geospatial information available. Obstacles to fish passage is a huge one for environmentalists, and they've also released bathymetric mapping for lakes in BC. Sounds awesome, but the KML data files provided are mere links to WMS servers -- I must stress, this is awesome data, but WMS is a -service- not -data-. Today's ios and android apps need vector data available to work offline. KML would be awesome for this but a KML frontend on a WMS server means it wont work.

The spatial data also seems to have a lot of 'collection' data, that is data files that are largely links to other data files, many of which actually end up at a WMS server. These files need to be split, dumped and hosted at data.gov.bc.ca as individual data sets on an endpoint -- then we can use them. If you're curious if a KML file is data or a link to data, try searching it on maps.google.ca. A well-formed KML file will be displayable on a google map, a pointer to something else wont. (just put the url to the KML right in the search box on maps.google.ca and hit query)

We also need a change-notification service, RSS feeds, or 'watch this dataset' feature so that we can tell when data we rely on has been updated. Again, opendatabc may have the answer with a crawler in the near future, however, any such crawler will depend on the data being on good end-points.

I've also noticed that some of the CSV's provided (which are on nice endpoints) have some minor encoding issues, like CSV's with multiple headers, or non-ascii/utf character sets (em-dash being the biggest pain) and the server serving the data does not present the correct character set. Take for example, the excellent data, "Members of the Legislative Assembly Members' Compensation Paid in the Fiscal Year Ended March 31, 2011" which you can find here: http://www.data.gov.bc.ca/local/dbc/docs/fin/FYE11_Members_Compensation.csv its a very cool dataset, but the server's saying its en_US, but the charset isn't readable by any of my standard tools, file, enca, iconv, none can make heads or tails of it without dropping those em-dash characters it seems. Because its encoded strangely, PostgreSQL can't import the data directly and I have to manually go through the file and clean it up. There's an empty column, and data that probably should be 0's is null. (for those non-programmer types, there's a huge distinction between null and 0 to an app developer)... so good data, but It's clear this data is coming from a human-readable format and needs work to make it machine readable csv. My cleaned up datasets that run proactivedisclosure.ca are available at proactivedisclosure.ca/data.

That said, believe it or not, despite the growing pains this is the right way to release open data... this post might sound like a bit of a gripe but it is not. Really. See, by letting developers poke around in the data we can identify issues with machine parsability, with data-that-isn't-really-data, with change management, etc. Without this public review, well, the data just sits unused somewhere, and because everyone knows that the data is human-readable and not machine-readable, people just wont try to use it. This is the power of a publish-first-fix-second approach, and it will make for better government data in the long run. Meaning when an internal government user goes to actually do something with this data, it will be pristine and ready to parse and that civil servant will be just that much more efficient. Rather than focusing on encoding issues, they'll only be focusing on the subject-matter issue of their project. Which in my books is a pretty big opendata win.

So congrats to data bc on a great launch, keep up the good work and keep releasing data sets. Also, since there should be lots of requests and bugs to track, I might suggest a bugzilla instance would be a good addition to the service.

Comments

Hi Kevin,
Thank you for your post.
 
With regard to your comment about data not being hosted locally and taking you off to another system. . . Just to clarify, can we assume that you aren't particularly concerned what system or sub-domain the dataset is hosted on as long as it is available as a direct download from the link provided?
 
We are looking at direct downloads, but we face a variety of different issues to consider with distributing spatial data including: some datasets that change frequently (hourly / daily), some of the datasets are very large in terms of spatial extent or volume, some of them have too much data volume (e.g. a high density of vertices) that they are not well suited to presentation via Google Earth or web mapping platforms.  We also face a problem that many of our datasets do not have good versioning or time / date information at the feature level - so it is difficult to know what exactly changes within a dataset - the providing agency might have only changed one small feature - but because of the our replication process we need to drop the entire dataset and replace the whole thing.  We also have clients that require datasets in different spatial projections - for example someone might want the exact boundary as recorded with UTM  projection - because of positional accuracy issues which crop up with data re-projection and need data to remain as close to the original representation as provided; whereas others have little knowledge of how to re-project datasets and want it in the required format and projection for their platform.
 
Next, your comment about our KML being mere links to WMS Servers. . .
 
We know that WMS is not suitable for many of the purposes that people have for open data.  We are looking at a number of options about how to improve our distribution of spatial datasets, and we welcome suggestions on how this may be achieved. Many datasets are small enough for us to make them available as vector or placemark kml. Given the considerations noted above, we are seeing what we can do about that right now.
 
We have found KML Regions not to be a valid solution. With enhancements made to the latest Mapserver build (MS6), we are now able to stream dynamic KML placemarks directly from the database. We understand the importance of having raw vector data available for developers and we are working hard on making this available in a sustainable manner. 
 
(*Note that several datasets are only made available as WMS and are not downloadable. This is because the custodians have not made the vector data available for download - ex. Integrated Cadastral Fabric. We feel that having accessible images  better than not being able to supply anything)
 
We truly appreciate your comments and recommendations. We strive on improving the offerings and services, so please keep the comments coming.
 
All the best, The DataBC Team

Hey Data BC!

"Just to clarify, can we assume that you aren't particularly concerned what system or sub-domain the dataset is hosted on as long as it is available as a direct download from the link provided?"

Correct, any endpoint is fine, but need to directly link.

To the questions about the spatial data -- it's perfectly valid from an opendata perspective to release a periodicity updated database backup file as a 'dataset'. For example, if you have rapidly changing data, the correct solution where it cannot be made available in realtime, would simply to set a 'cron job' to re-export the data on a specific schedule. Publishing the export schedule would be really helpful as the developers can set similar import scripts to load the latest data.

As for realtime data, the government may want to look at technologies like GeoRSS. This wont be applicable for say fish passage data, but for other datasets where its largely placemark, and timeliness is the priority, it can be highly useful.

Most opendata apps will not be read/write on government published opendata. (Some will) but the majority will likely be simple consumers with local value-add.

You may also want to look at how other agencies distribute vector open-data. For example in the US, NOAA has developed some very excellent open-data systems that allow their datasets to be downloaded. The snapshot is created when the dataset is downloaded, and its up to the user how often to download it. Some of these datasets are very large. A good url example is: http://www.nauticalcharts.noaa.gov/mcd/enc/index.htm

To the issue of projections, this is where it becomes crucial to follow the success of the BC Government Social Media Guidelines. Obviously, no agency can predict exactly what formats people want data in. Each user is going to have slightly different wishes. One solution to this problem is to empower the data custodians to directly respond to data users requests -- better yet, publish a default dataset, and provide the custodians contact info with something like 'for other projections, formats, and updates contact john.doe@gov.bc.ca'... empowering individuals civil servants to publish data can solve a lot of the 'what format do we publish' questions.

I would also argue that size isn't a terribly relevant measure anymore. For example, some of the datasets I work with in GIS are in the tera-byte class. Where it becomes prohibitive to transfer via the internet, data bc could follow in the City of Nanaimo's footsteps, by bringing along portable hard drives to hackathons. For example, I was able to pick up the full lidar and ortho sets for the City of Nanaimo at their last hackathon.

Thanks for responding, and making an awesome opendata experience for BC! The Vancouver hackathon should be awesome!

--

Kevin

Pages