How To Improve Data.gov.uk

OK, so the Data.gov.uk stuff hit a raw nerve with me yesterday. In itself, it was pretty disappointing and the reporting also touched on one of my pet hates: Why do a lot of journalists not ask questions any more? They just seem to repost what they’re told. But, I’m past all that now, so here are a few thoughts on how a government data website could be better implemented.

The W3 put a bit of effort into a data browser extension for Firefox called Tabulator. It’s nice and deserves to be better. If I look at a page on Wikipedia with Firefox, say for example

http://en.wikipedia.org/wiki/Ystrad_Meurig

Then if there is data I am interested in as a starting point, I can go to Dbpedia and at http://dbpedia.org/resource/Ystrad_Meurig my data browser extension will find a graph of data that it can understand.

This is really nice.

The Data.gov.uk site has a query page for looking at its semantic data, but no clues as to what I can ask for. Dbpedia has a query page and from my Ystrad Meurig example I already know a lot about how Dbpedia might be storing its data. I know that I can ask for latitudes and longitudes using the W3 positioning predicates (using Tabulator I can browse from the data page to those predicates and find out more about them – it’s all linked). I know Dbpedia has a thing called ‘distance to Cardiff‘ so I could query Dbpedia for all things with a distance to Cardiff that have latitudes and longitudes and then I could plot them on a map.

This is properly linked data. This is what a government data site should be like.

I mentioned the Ordinance Survey ontology in yesterday’s rant. I like it, but it could be better. It has a solid structure for an administrative geography of the UK (not including Northern Ireland). However, the current version 2, is already out of date. A number of the unitary authorities were merged last year. This information is already reflected on the corresponding pages in Dbpedia, along with nice tagging to link the new authorities to the data on the authorities they have replaced.

The OS version 2 ontology replaced version 1 in a fairly unhelpful way, but that was OK because they said they were still playing around with how to work with ontologies. Will the next version play well with the current one?

The Dbpedia way of doing things means that not only do we end up with an up to date administrative structure but we also maintain a history of that structure. That history can be useful if we have to consider people and not administrations. A person might get around to acknowledging a change in the administrative make up of his area – eventually – but it wont happen immediately. The online structures need to be able to link them from old knowledge to new concepts. Here is the advantage in all these new ideas: Nothing leads to a dead end, everything is given more meaning by its connections.

The other Tim

Tagged with:  

It’s The Data, Stupid.

Data.gov.uk

Data.gov.uk

Oh, the excitement of it! Tim Berners-Lee is getting governments to listen to his cry, to set data free. Oh, the disappointment of our first look at the UK’s efforts! Where is the semantic data? Where are the ontologies to link concepts across datasets?

[For those of you not interested in the technical side of things skip over the next paragraph if you like - it's just technical ranting...]

This being a first pass the semantic data and the ontologies may be in there, but if they are they’re well hidden. There is a sparql page but no indication of which values are searchable. All the data sets I looked for were available in CSV and XSL; hardly linked. Turning one of the CSV data sets into RDF using well known namespaces took me about 30 minutes, so it shouldn’t be too hard for the site to get better, and quickly. Will it?

OK, that bit aside, the point is that the launch of this site seems to have been a deadline achieving exercise rather than an announcement of anything actually being ready. That being the case, somebody needs to put up their hand and say “That’s rubbish”.

It’s especially hard for me because I’m as excited about the possibilities of the semantic web as my more illustrious name-sake, but this ain’t those dreams, not even close. I was hoping to be able to complain loudly in the pub about my own useless government here in Ireland and how they weren’t doing anything to make their data available. “Look at how good the UK is”, I could have said. Oh well, another day.

So why am I so disappointed? What is this stuff anyway?
Well the semantic web can be a way to connect… well everything.

When I talk about a thing, say Sutton, I can link it to a description and eliminate any possibility that I am talking about a different Sutton. “Which Sutton?” you ask. “Sutton, Peterborough; Sutton, Craven etc…” and I answer “Exactly my point”. I can link it though:

http://www.ordnancesurvey.co.uk/ontology/AdministrativeGeography/v2.0/AdministrativeGeography.rdf#osr7000000000001643

[N.B. Don't go to the above URL unless you want to download a massive file!]

which is a link to the Ordinance Survey UK’s ontology for Administrative places in the UK. If this excellent data (if a little limited, and a little out of date) is on the new data.gov site, then it’s hard to find. It would be a good way to tie geographic data sets together. An arbitrarily named field in an xls file is, on the other hand, not a good way to link data together.

Nice idea Tim, now you need them to actually do the work.

The other Tim

Tagged with:  

Getting It Right

Following on from my ramble about addresses (see Return To Sender) I’m going to complain in general about some blocks of information on the web.

Say you work as a professional, and your professional body sends around a form for you to fill out. They tell you it will be published on the web, say to help people find their nearest qualified aerospace engineer, of which you are one of course. Now, if you fill in an email address on the form and it gets published you will get emails. It is surprising how many people think this will not happen.

Which brings me to what my real point is: a whole load of data that gets thrown out there on the web is published without considering either its intention or its purpose.

If you make your phone number public, is it decipherable? The bottom line is that I don’t want phone numbers like (0)-555/767.768/9 appearing anywhere. [OK, below is a long post-amble about what I think a well formed number might be. Read it to see how finikity (which may or may not be a word, making it all the more embiggening) I am.]

If you write your number like this people will probably still be able to reach you, but you’re making it hard for them, and if you’re inputting the number on the web somewhere, you’re probably caused a few people to scratch their heads and wonder if they should bother coding solutions for numbers of such opacity. (They should not.)

In case you’re getting angry at coders at this stage, the flip-side is the dumb entry field that insists you enter your number in some bizarre format chosen by a coder through ignorance/laziness. (Yes Bord Gáis a + is valid in a telephone number, in fact it is the single best starting character for a number to have, but enough about that… )

Where was all this going?

Oh yes, why is all this information going on the web in the first place?

A lot of official data now seems to find a place on the web in PDF format. The motivations behind this may be laudable; the original may be a Microsoft Word document and PDF is a more open format. However, PDFs tend to be designed for human rather that machine legibility and lack the semantic structures that are increasingly available (if underused) in html/xml formats.

If the official Canadian list of area dialling codes is in a PDF that lacks a readable table structure then someone (or worse still, more than one person) will create unofficial lists using wikis or web pages. These will only maintain some level of compatibility with each other and the original source, so any further sources of information that verify themselves against these pages become less and less reliable. This is all a long way from Tim Berners-Lee’s vision for a semantic web.

If governments are responsible for making data they hold accessible, and this is good in any number of ways, then they also have a responsibility for the level of accessibility. A locked filing cabinet in a basement that is accessible by asking the staff member responsible, available the first Monday of the month, between 9:00 and 9:30am, is a level of accessibility…

This is not a purely esoteric complaint. Somebody pays for that list of dialling codes, and those emergency service numbers and a whole host of data that has been decreed web worthy, but very little thought seems to go into how much of that information appears to the web or how its value could be maximized.

ISO, the International Organisation for Standards publish a lot of information on the web. Take their currency pages for example, which have nice tables of data. Well, their HTML is invalid; the doctype does not match the data contained. Their entity names are arbitrarily chosen forms of country names, even though they themselves define unique identifiers for countries, so the table must be interpreted for it to have value. I guess if ISO can’t get this right there is little hope for other folks.

Is there a point to all this? Well the ISO pages are better than a PDF, and PDFs are better than MS Word documents, and all are better than no information at all. There is at least a chance that something like linked data will be widely used and provide a more useful web, but, in the meantime, just a little thought about the quality of information and reasons for it to be placed on the web will make me a lot happier…

*Post-amble

What does a dash (-) in a phone number mean, why was it put there? Probably it was just put in to separate the number into human digestible chunks 555-767-768. Whitespace would have been as good but no problems so far.

What does a slash (/, virgule) mean in a phone number? In a sentence it might have a dash type meaning but I would usually use it to designate a substitution or a logical OR type statement. I would avoid using a – for OR and also avoid using a / for a join.

The number 555-767-768/9 then means that either 555-767-768 or 555-767-769 will get you through.

The number 555/767/768-9 would be ambiguous to me: it could mean the same as above but I might try 555767768 wait then press 9, or some such.

The use of a full-stop as a visual separator seems dubious; it has other meanings both in number sequences and in textual contexts.

Tagged with:  
© 2010 WhatClinic.com Blog