Tim Hawkins

Author


Ordnance Survey UK Improve Their Open Data

When the UK’s data sharing website data.gov.uk launched I was pretty unimpressed. I mentioned a few things that annoyed me: Where were the examples? What were the ontologies used? Without this information the provision of a sparql endpoint is fairly meaningless.

Well it turns out that one section of the government is getting stuck in. Maybe I should have remembered that the marketeers love a launch without a product, and that the people doing the real work are up late, slaving away cursing their managers, trying to get the stuff out the door. Just saying; it’s not like I’ve ever seen anything like that in my job :)

Anyway…, I already liked the efforts the UK’s ordinance survey were making and, defying the normal stereotype of public sector computing, they have not been content with their first or even their second stab at presenting a linked data interface to their info-sets.

http://data.ordnancesurvey.co.uk/ presents examples, a sparql endpoint, and the ontologies used, including the use of standard ontologies like foaf.  Nice!

Now what can you do with any of this?

Well last week I was in the UK, in Kingham. If I create a sparql query like this:

Construct {
?Place a <http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/NamedPlace> .
?Place a ?Type .

?Place <http://http://www.w3.org/2004/02/skos/core#broader> ?BiggerPlace .
?BiggerPlace a <http://data.ordnancesurvey.co.uk/ontology/50kGazetteer/NamedPlace> .
?BiggerPlace <http://www.w3.org/2000/01/rdf-schema#label> ?BiggerPlaceName .
?BiggerPlace <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/contains> ?Place .
}
WHERE
{
?Place <http://www.w3.org/2000/01/rdf-schema#label> ‘Kingham‘ .
?Place <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?Type .
?Place <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?Type .
?BiggerPlace <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/contains> ?Place .
?BiggerPlace <http://www.w3.org/2000/01/rdf-schema#label> ?BiggerPlaceName
}

and enter it on the endpoint page  (http://api.talis.com/stores/ordnance-survey/services/sparql) then I get back a graph of information about the places that say they contain Kingham. I also get an URL for Kingham (http://data.ordnancesurvey.co.uk/id/7000000000008699*) which I can use from now on as a unique identifier in my code for the Civil Parish that is Kingham.

This to me is exactly what government should be lending to the data world. The administrative levels in the ordinance survey data can be linked through to election results, the provision of services, etc. A commitment on the part of an authority to maintain a high level of integrity for such data can provide a genuinely valuable resource.

Technology and governments do not usually go well together. The thing about data though is that it really isn’t about technology. The only criteria for success is availability.

It’s the business of governments to supply services to all their citizens and with a fair degree of equality (hopefully). To assess the success of governance requires a lot of categorization and correlation: the number of doctors per 1,000 people; the average wealth in a given district; employment levels, etc. So the work is already being done. Making it open means we get more value for our taxes, accountability increases and we get a data set that allows us to talk authoritatively of entities within a state.

http://data.ordnancesurvey.co.uk/id/7000000000008699 refers to the Civil Parish of Kingham, not the village, or some other nebulous form. The Kingham link describes the nature of this relationship  by describing its type as a Civil Parish. Another graph might describe the village and also, form a relationship such as:

<http://example.com/ukplaces/villages/Kingham> <http://http://www.w3.org/2004/02/skos/core#broader <http://data.ordnancesurvey.co.uk/id/7000000000008699>

letting us know that the village and the civil parish have a strong relationship.

Sure there are things wrong with the OIS data but bucking my usual nature I’m not going to complain about them. Why? Because, I trust them to make their data even better in future. That’s a rare enough thing for me to expect in a commercial product and almost unheard of in the public sector.

Tim (not the other Tim)

*If you don’t have an rdf plugin look at these by prefixing the rdf URLs with http://demo.openlinksw.com/ode/?uri=

Tagged with:  

Labelling the World

Composite image of the Earth at night.

I’ve been looking a lot at the way we locate things, through maps and through the naming of things. Our addresses are weird things to begin with. We usually start by naming the place we are in and then move outwards in increments into the world around us. It shows some of the evolution of our idea of position.

In the past we wrote our address and most of the people who read it would know within the first few words who and where we were: “Oh that’s Joe who lives in the next village over” they would say, and that would be an end of it. They wouldn’t have to think about Counties, States, or Countries. Who talked to people in other countries anyway, except kings and their like?

Nowadays the world has grown smaller, and in many ways it would be preferable to have an address that narrowed with each phrase, each increment bringing the sender closer to the addressee. In the online world, our lovely folk-evolved addresses seem more trouble than they’re worth. Wouldn’t it be nice to have a little code we could parse?

Well, of course, some such mechanisms exist, but many of them are old things that evolved out of archaic postal systems and are still maintained for those systems. What should it matter to me how a postman is routed from his supply depot?

Something like Geohash on the other hand provides a code for all places on the earth, much like latitudes and longitudes, and to a high degree of accuracy – it’s a sphere, it’s not too hard. So, we can easily choose one system for everywhere on earth, and not have different systems for the UK, USA, China, India, etc. Just one common system.

We can then leave it up to each postal service to translate that system to their old postal code thingy, and we have no barrier to entry to the finding places game. The resulting addresses are nice and simple too, but even though the code-to-place procedure becomes more open and accessible, it would still be nice not to have to type a code into a phone to find out you are going just round the corner. Well, a Wikipedia type thing (not Wikipedia though, we’re not all worth an encyclopedia entry) could easily map locations to names, and using semantic web structures we could get to know a lot about each place we bother to name.

My own country (Ireland) has always been blessed by the ability of putting things off for a long time. Sometimes this enables it, when it finally gets round to doing things, to adopt systems unencumbered by legacy. So, if we choose to have an encoding for places, why not use Geohash or an equivalent?

If the Irish postal system wants to connect their own system to this it should be relatively simple, and we won’t be left with some debacle like those poor people in the UK, whose once public, now private(-ish) postal system had it’s postcode system adopted as a de facto identifier standard, even though it’s not a particularly good identifier, except for delivering mail. And just to make matters worse, their government, who are supposedly interested in unlocking innovation through free access to public sector data and information, continue to allow the postal system to charge for access to the postcode system!

Thank god my government is bound to:

  • Learn from this, and not spend millions on some new weird standard.
  • Choose, or make, a freely available system.
  • Quickly reap the benefits.

That’s what we’re going to do, right?…

Tim.

Tagged with:  

How To Improve Data.gov.uk

OK, so the Data.gov.uk stuff hit a raw nerve with me yesterday. In itself, it was pretty disappointing and the reporting also touched on one of my pet hates: Why do a lot of journalists not ask questions any more? They just seem to repost what they’re told. But, I’m past all that now, so here are a few thoughts on how a government data website could be better implemented.

The W3 put a bit of effort into a data browser extension for Firefox called Tabulator. It’s nice and deserves to be better. If I look at a page on Wikipedia with Firefox, say for example

http://en.wikipedia.org/wiki/Ystrad_Meurig

Then if there is data I am interested in as a starting point, I can go to Dbpedia and at http://dbpedia.org/resource/Ystrad_Meurig my data browser extension will find a graph of data that it can understand.

This is really nice.

The Data.gov.uk site has a query page for looking at its semantic data, but no clues as to what I can ask for. Dbpedia has a query page and from my Ystrad Meurig example I already know a lot about how Dbpedia might be storing its data. I know that I can ask for latitudes and longitudes using the W3 positioning predicates (using Tabulator I can browse from the data page to those predicates and find out more about them – it’s all linked). I know Dbpedia has a thing called ‘distance to Cardiff‘ so I could query Dbpedia for all things with a distance to Cardiff that have latitudes and longitudes and then I could plot them on a map.

This is properly linked data. This is what a government data site should be like.

I mentioned the Ordinance Survey ontology in yesterday’s rant. I like it, but it could be better. It has a solid structure for an administrative geography of the UK (not including Northern Ireland). However, the current version 2, is already out of date. A number of the unitary authorities were merged last year. This information is already reflected on the corresponding pages in Dbpedia, along with nice tagging to link the new authorities to the data on the authorities they have replaced.

The OS version 2 ontology replaced version 1 in a fairly unhelpful way, but that was OK because they said they were still playing around with how to work with ontologies. Will the next version play well with the current one?

The Dbpedia way of doing things means that not only do we end up with an up to date administrative structure but we also maintain a history of that structure. That history can be useful if we have to consider people and not administrations. A person might get around to acknowledging a change in the administrative make up of his area – eventually – but it wont happen immediately. The online structures need to be able to link them from old knowledge to new concepts. Here is the advantage in all these new ideas: Nothing leads to a dead end, everything is given more meaning by its connections.

The other Tim

Tagged with:  

It’s The Data, Stupid.

Data.gov.uk

Data.gov.uk

Oh, the excitement of it! Tim Berners-Lee is getting governments to listen to his cry, to set data free. Oh, the disappointment of our first look at the UK’s efforts! Where is the semantic data? Where are the ontologies to link concepts across datasets?

[For those of you not interested in the technical side of things skip over the next paragraph if you like - it's just technical ranting...]

This being a first pass the semantic data and the ontologies may be in there, but if they are they’re well hidden. There is a sparql page but no indication of which values are searchable. All the data sets I looked for were available in CSV and XSL; hardly linked. Turning one of the CSV data sets into RDF using well known namespaces took me about 30 minutes, so it shouldn’t be too hard for the site to get better, and quickly. Will it?

OK, that bit aside, the point is that the launch of this site seems to have been a deadline achieving exercise rather than an announcement of anything actually being ready. That being the case, somebody needs to put up their hand and say “That’s rubbish”.

It’s especially hard for me because I’m as excited about the possibilities of the semantic web as my more illustrious name-sake, but this ain’t those dreams, not even close. I was hoping to be able to complain loudly in the pub about my own useless government here in Ireland and how they weren’t doing anything to make their data available. “Look at how good the UK is”, I could have said. Oh well, another day.

So why am I so disappointed? What is this stuff anyway?
Well the semantic web can be a way to connect… well everything.

When I talk about a thing, say Sutton, I can link it to a description and eliminate any possibility that I am talking about a different Sutton. “Which Sutton?” you ask. “Sutton, Peterborough; Sutton, Craven etc…” and I answer “Exactly my point”. I can link it though:

http://www.ordnancesurvey.co.uk/ontology/AdministrativeGeography/v2.0/AdministrativeGeography.rdf#osr7000000000001643

[N.B. Don't go to the above URL unless you want to download a massive file!]

which is a link to the Ordinance Survey UK’s ontology for Administrative places in the UK. If this excellent data (if a little limited, and a little out of date) is on the new data.gov site, then it’s hard to find. It would be a good way to tie geographic data sets together. An arbitrarily named field in an xls file is, on the other hand, not a good way to link data together.

Nice idea Tim, now you need them to actually do the work.

The other Tim

Tagged with:  

Getting It Right

Following on from my ramble about addresses (see Return To Sender) I’m going to complain in general about some blocks of information on the web.

Say you work as a professional, and your professional body sends around a form for you to fill out. They tell you it will be published on the web, say to help people find their nearest qualified aerospace engineer, of which you are one of course. Now, if you fill in an email address on the form and it gets published you will get emails. It is surprising how many people think this will not happen.

Which brings me to what my real point is: a whole load of data that gets thrown out there on the web is published without considering either its intention or its purpose.

If you make your phone number public, is it decipherable? The bottom line is that I don’t want phone numbers like (0)-555/767.768/9 appearing anywhere. [OK, below is a long post-amble about what I think a well formed number might be. Read it to see how finikity (which may or may not be a word, making it all the more embiggening) I am.]

If you write your number like this people will probably still be able to reach you, but you’re making it hard for them, and if you’re inputting the number on the web somewhere, you’re probably caused a few people to scratch their heads and wonder if they should bother coding solutions for numbers of such opacity. (They should not.)

In case you’re getting angry at coders at this stage, the flip-side is the dumb entry field that insists you enter your number in some bizarre format chosen by a coder through ignorance/laziness. (Yes Bord Gáis a + is valid in a telephone number, in fact it is the single best starting character for a number to have, but enough about that… )

Where was all this going?

Oh yes, why is all this information going on the web in the first place?

A lot of official data now seems to find a place on the web in PDF format. The motivations behind this may be laudable; the original may be a Microsoft Word document and PDF is a more open format. However, PDFs tend to be designed for human rather that machine legibility and lack the semantic structures that are increasingly available (if underused) in html/xml formats.

If the official Canadian list of area dialling codes is in a PDF that lacks a readable table structure then someone (or worse still, more than one person) will create unofficial lists using wikis or web pages. These will only maintain some level of compatibility with each other and the original source, so any further sources of information that verify themselves against these pages become less and less reliable. This is all a long way from Tim Berners-Lee’s vision for a semantic web.

If governments are responsible for making data they hold accessible, and this is good in any number of ways, then they also have a responsibility for the level of accessibility. A locked filing cabinet in a basement that is accessible by asking the staff member responsible, available the first Monday of the month, between 9:00 and 9:30am, is a level of accessibility…

This is not a purely esoteric complaint. Somebody pays for that list of dialling codes, and those emergency service numbers and a whole host of data that has been decreed web worthy, but very little thought seems to go into how much of that information appears to the web or how its value could be maximized.

ISO, the International Organisation for Standards publish a lot of information on the web. Take their currency pages for example, which have nice tables of data. Well, their HTML is invalid; the doctype does not match the data contained. Their entity names are arbitrarily chosen forms of country names, even though they themselves define unique identifiers for countries, so the table must be interpreted for it to have value. I guess if ISO can’t get this right there is little hope for other folks.

Is there a point to all this? Well the ISO pages are better than a PDF, and PDFs are better than MS Word documents, and all are better than no information at all. There is at least a chance that something like linked data will be widely used and provide a more useful web, but, in the meantime, just a little thought about the quality of information and reasons for it to be placed on the web will make me a lot happier…

*Post-amble

What does a dash (-) in a phone number mean, why was it put there? Probably it was just put in to separate the number into human digestible chunks 555-767-768. Whitespace would have been as good but no problems so far.

What does a slash (/, virgule) mean in a phone number? In a sentence it might have a dash type meaning but I would usually use it to designate a substitution or a logical OR type statement. I would avoid using a – for OR and also avoid using a / for a join.

The number 555-767-768/9 then means that either 555-767-768 or 555-767-769 will get you through.

The number 555/767/768-9 would be ambiguous to me: it could mean the same as above but I might try 555767768 wait then press 9, or some such.

The use of a full-stop as a visual separator seems dubious; it has other meanings both in number sequences and in textual contexts.

Tagged with:  

Return To Sender

One summer, in college, my friend received a postcard. It was addressed: (his surname), Offaly, Ireland.

This is a classic example of where humans deal with ambiguity a lot better than computers. Addresses are tricky things for non-humans to understand. The London Road is unlikely to be in London but is likely to be in any number of places on the way to London.

If we want to look for something these days we are likely to type it into a search engine. If what we want to look for is a place, then that search will be interpreted against a mapping services and in case you hadn’t noticed, most of these are awful. Well, I mean, they are great –  mostly, but you would not want to rely on them.

The problem runs something like this:

Rathfarmham is a place in Dublin, Ireland. It is a place which people quite happily put in their addresses and then receive replies to.

The Google Maps product has a geocoder. If you send it an address, it sends back where it thinks that is. If you type in ‘Grange Road, Rathfarnham’, it will tell you where that is. However, it doesn’t really know a Grange Road in Rathfarnham, only a Grange Road, and because of the crazy old world country I live in there is another Grange Road a few clicks away.

Even worse, it doesn’t really know Rathfarnham is a locality. If I try ‘Silverwood Drive, Rathfarnham’, then I get back the correct ‘Silverwood Drive’, but if you look deep inside the response it thinks the real address is Silverwood Drive, Ballyboden.

Now until I used the Google geocoder I did not know where Ballyboden was, despite having lived close to it throughout my childhood years. Somebody, somewhere down the line has made the seemingly unimportant decision that Ballyboden is the main locality and that Rathfarnham is a place and not a locality.

Ok, so this stuff is just annoying, nothing more, right? Well not really. There’s money in search results. How many people who have a business think about the appearance of their physical address to search engines? I’m guessing very few. For instance it is common for people from the same business to write their business address in quite different ways. Even if you have a standard address and never deviate from it, is it one that is mapping data friendly? You might be missing out to a competitor with a less ambiguous address (machine-interpretable).

Ambiguity and commonality are the enemies of correct identification. Our ‘Grange Road’ issue becomes a more common problem if our address is Main Street or High Street. Now we definitely need another matchable identifier within the address to have any chance of finding our High Street and a different one on the other side of the county, state, or country.

The challenge is to provide a solution that is both machine-interpretable and human-readable. This has been solved by several countries such as Singapore where the postcodes have been honed to building level accuracy, and from there a standardised floor/suite syntax completes the address.

As we use more machine based location services it makes sense to use increasingly machine readable addresses. This is not to say we should give up the addresses of our forefathers to an alpha-numeric string or abstract machine interpretable symbol but there is a good case for including a unique identifier in physical addresses as standard.

At one stage ad hoc delivery/location solutions were fine: a package would cross an international border and a large purpose built organization would take responsibility for its correct delivery. However this approach is unreliable and archaic when applied to divererent automated systems controled by an expanding number of companies. The internationa situation makes life even more complicated as because of a network of varying standards and methodologies across the world. In the absence of international standards it is up to individual governments to assemble solutions, each requiring a separate approach.

Zip codes and other unique identifiers are one solution. However, these were often designed to handle a different problem and may not be sufficent accurate, relying finally on a level of human interpretation. Some of these are further tied to commercial organizations making their availability unreliable. For many applications any charges  will create an unacceptable barrier to entry. Location based services aren’t just going to be “find a pizza joint”; public information from health screening services to disaster emergency updates are part of the story too.

Some countries, notably for me Ireland, are falling worryingly behind. Systems are  being developed for public services that rely on commercial organizations with varying levels of commitment and skill. Where a government should see a responsibility to provide equal service support across its territory, current mapping and geocoding vendors will naturally apply their efforts to high return sectors first.

Many countries recognized early on the importance of a mail system to their societies, enacting legislation to ensure universal accessibility and harsh penalties for interference in its operation. As the transference of information and its type changes, the willingness to approach the associated opportunities at a societal or governmental level appears particularly moribund.

Tagged with:  
© 2010 WhatClinic.com Blog