By way of background to this post – and because I think it’s quite interesting in itself – here’s a short paper I gave last year at this conference (great company, shame about the catering). It was co-written with my colleagues Judith Aldridge and Karen Clarke. I don’t stand by everything in it – as I’ve got deeper into the project I’ve moved further away from Clay’s scepticism and closer towards people like Carole Goble and Keith Cole – but I think it still sets out an argument worth having.
Mind the gap: Metadata in e-social science
1. Towards the final turtle
It’s said that Bertrand Russell once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the centre of our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: “What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.”
Russell smiled and replied, “What is the tortoise standing on?”
“You’re very clever, young man, very clever,” said the old lady. “But it’s turtles all the way down.”
The Russell story is emblematic of the logical fallacy of infinite regress: proposing an explanation which is just as much in need of explanation as the original fact being explained. The solution, for philosophers (and astronomers), is to find a foundation on which the entire argument can be built: a body of known facts, or a set of acceptable assumptions, from which the argument can follow.
But what if infinite regress is a problem for people who want to build systems as well as arguments? What if we find we’re dealing with a tower of turtles, not when we’re working backwards to a foundation, but when we’re working forwards to a solution?
WSDL [Web Services Description Language] lets a provider describe a service in XML [Extensible Markup Language]. [...] to get a particular provider’s WSDL document, you must know where to find them. Enter another layer in the stack, Universal Description, Discovery, and Integration (UDDI), which is meant to aggregate WSDL documents. But UDDI does nothing more than register existing capabilities [...] there is no guarantee that an entity looking for a Web Service will be able to specify its needs clearly enough that its inquiry will match the descriptions in the UDDI database. Even the UDDI layer does not ensure that the two parties are in sync. Shared context has to come from somewhere, it can’t simply be defined into existence. [...] This attempt to define the problem at successively higher layers is doomed to fail because it’s turtles all the way up: there will always be another layer above whatever can be described, a layer which contains the ambiguity of two-party communication that can never be entirely defined away. No matter how carefully a language is described, the range of askable questions and offerable answers make it impossible to create an ontology that’s at once rich enough to express even a large subset of possible interests while also being restricted enough to ensure interoperability between any two arbitrary parties.
Clay Shirky is a longstanding critic of the Semantic Web project, an initiative which aims to extend Web technology to encompass machine-readable semantic content. The ultimate goal is the codification of meaning, to the point where understanding can be automated. In commercial terms, this suggests software agents capable of conducting a transaction with all the flexibility of a human being. In terms of research, it offers the prospect of a search engine which understands the searches it is asked to run and is capable of pulling in further relevant material unprompted.
This type of development is fundamental to e-social science: a set of initiatives aiming to enable social scientists to access large and widely-distributed databases using ‘grid computing’ techniques.
A Computational Grid performs the illusion of a single virtual computer, created and maintained dynamically in the absence of predetermined service agreements or centralised control. A Data Grid performs the illusion of a single virtual database. Hence, a Knowledge Grid should perform the illusion of a single virtual knowledge base to better enable computers and people to work in cooperation.
(Keith Cole et al)
Is Shirky’s final turtle a valid critique of the visions of the Semantic Web and the Knowledge Grid? Alternatively, is the final turtle really a Babel fish — an instantaneous universal translator — and hence (excuse the mixed metaphors) a straw person: is Shirky setting the bar impossibly high, posing goals which no ‘semantic’ project could ever achieve? To answer these questions, it’s worth reviewing the promise of automated semantic processing, and setting this in the broader context of programming and rule-governed behaviour.
2. Words and rules
We can identify five levels of rule-governed behaviour. In rule-driven behaviour, firstly, ‘everything that is not compulsory is forbidden’: the only actions which can be taken are those dictated by a rule. In practice, this means that instructions must be framed in precise and non-contradictory terms, with thresholds and limits explicitly laid down to cover all situations which can be anticipated. This is the type of behaviour represented by conventional task-oriented computer programming.
A higher level of autonomy is given by rule-bound behaviour: rules must be followed, but there is some latitude in how they are applied. A set of discrete and potentially contradictory rules is applied to whatever situation is encountered. Higher-order rules or instructions are used to determine the relative priority of different rules and resolve any contradiction.
Rule-modifying behaviour builds on this level of autonomy, by making it possible to ‘learn’ how and when different rules should be applied. In practice, this means that priority between different rules is decided using relative weightings rather than absolute definitions, and that these weightings can be modified over time, depending on the quality of the results obtained. Neither rule-bound nor rule-modifying behaviour poses any fundamental problems in terms of automation.
Rule-discovering behaviour, in addition, allows the existing body of rules to be extended in the light of previously unknown regularities which are encountered in practice (“it turns out that many Xs are also Y; when looking for Xs, it is appropriate to extend the search to include Ys”). This level of autonomy — combining rule observance with reflexive feedback — is fairly difficult to envisage in the context of artificial intelligence, but not impossible.
The level of autonomy assumed by human agents, however, is still higher, consisting of rule-interpreting behaviour. Rule-discovery allows us to develop an internalised body of rules which corresponds ever more closely to the shape of the data surrounding us. Rule-interpreting behaviour, however, enables us to continually and provisionally reshape that body of rules, highlighting or downgrading particular rules according to the demands of different situations. This is the type of behaviour which tells us whether a ban is worth challenging, whether a sales pitch is to be taken literally, whether a supplier is worth doing business with, whether a survey’s results are likely to be useful to us. This, in short, is the level of Shirky’s situational “shared context” — and of the final turtle.
We believe that there is a genuine semantic gap between the visions of Semantic Web advocates and the most basic applications of rule-interpreting human intelligence. Situational information is always local, experiential and contingent; consequently, the data of the social sciences require interpretation as well as measurement. Any purely technical solution to the problem of matching one body of social data to another is liable to suppress or exclude much of the information which makes it valuable.
We cannot endorse comments from e-social science advocates such as this:
variable A and variable B might both be tagged as indicating the sex of the respondent where sex of the respondent is a well defined concept in a separate classification. If Grid-hosted datasets were to be tagged according to an agreed classification of social science concepts this would make the identification of comparable resources extremely easy.
(Keith Cole et al)
work has been undertaken to assert the meaning of Web resources in a common data model (RDF) using consensually agreed ontologies expressed in a common language [...] Efforts have concentrated on the languages and software infrastructure needed for the metadata and ontologies, and these technologies are ready to be adopted.
(Carole Goble and David de Roure; emphasis added)
Statements like these suggest that semantics are being treated as a technical or administrative matter, rather than a problem in its own right; in short, that meaning is being treated as an add-on.
3. Google with Craig
To clarify these reservations, let’s look at a ‘semantic’ success story.
The service, called “Craigslist-GoogleMaps combo site” by its creator, Paul Rademacher, marries the innovative Google Maps interface with the classifieds of Craigslist to produce what is an amazing look into the properties available for rent or purchase in a given area. [...] This is the future….this is exactly the type of thing that the Semantic Web promised
‘This’ is is an application which calculates the location of properties advertised on the ‘Craigslist’ site and then displays them on a map generated from Google Maps. In other words, it takes two sources of public-domain information and matches them up, automatically and reliably.
That’s certainly intelligent. But it’s also highly specialised, and there are reasons to be sceptical about how far this approach can be generalised. On one hand, the geographical base of the application obviates the issue of granularity. Granularity is the question of the ‘level’ at which an observation is taken: a town, an age cohort, a household, a family, an individual? a longitudinal study, a series of observations, a single survey? These issues are less problematic in a geographical context: in geography, nobody asks what the meaning of ‘is’ is. A parliamentary constituency; a census enumeration district; a health authority area; the distribution area of a free newspaper; a parliamentary constituency (1832 boundaries) — these are different ways of defining space, but they are all reducible to a collection of identifiable physical locations. Matching one to another, as in the CONVERTGRID application (Keith Cole et al) — or mapping any one onto a uniform geographical representation — is a finite and rule-bound task. At this level, geography is a physical rather than a social science.
The issue of trust is also potentially problematic. The Craigslist element of the Rademacher application brings the social element to bear, but does so in a way which minimises the risks of error (unintentional or intentional). There is a twofold verification mechanism at work. On one hand, advertisers — particularly content-heavy advertisers, like those who use the ‘classifieds’ and Craigslist — are motivated to provide a (reasonably) accurate description of what they are offering, and to use terms which match the terms used by would be buyers. On the other hand, offering living space over Craigslist is not like offering video games over eBay: Craigslist users are not likely to rely on the accuracy of listings, but will subject them to in-person verification. In many disciplines, there is no possibility of this kind of ‘real-world’ verification; nor is there necessarily any motivation for a writer to use researchers’ vocabularies, or conform to their standards of accuracy.
In practice, the issues of granularity and trust both pose problems for social science researchers using multiple data sources, as concepts, classifications and units differ between datasets. This is not just an accident that could have been prevented with more careful planning; it is inherent in the nature of social science concepts, which are often inextricably contingent on social practice and cannot unproblematically be recorded as ‘facts’. The broad range covered by a concept like ‘anti-social behaviour’ means that coming up with a single definition would be highly problematic — and would ultimately be counter-productive, as in practice the concept would continue to be used to cover a broad range. On the other hand, concepts such as ‘anti-social behaviour’ cannot simply be discarded, as they are clearly produced within real — and continuing — social practices.
The meaning of a concept like this — and consequently the meaning of a fact such as the recorded incidence of anti-social behaviour — cannot be established by rule-bound or even rule-discovering behaviour. The challenge is to record both social ‘facts’ and the circumstances of their production, tracing recorded data back to its underlying topic area; to the claims and interactions which produced the data; and to the associations and exclusions which were effectively written into it.
4. Even better than the real thing
As an approach to this problem, we propose a repository of content-oriented metadata on social science datasets. The repository will encompass two distinct types of classification. Firstly, those used within the sources themselves; following Barney Glaser, we refer to these as ‘In Vivo Concepts’. Secondly, those brought to the data by researchers (including ourselves); we refer to these as ‘Organising Concepts’. The repository will include:
• relationships between Organising Concepts
‘theft from the person’ is a type of ‘theft’
• associations between In-Vivo Concepts and data sources
the classification of ‘Mugging’ appears in ‘British Crime Survey 2003’
• relationships between In-Vivo Concepts
‘Snatch theft’ is a subtype of the classification of ‘Mugging’
• relationships between Organising Concepts and In-Vivo Concepts
the classification of ‘Snatch theft’ corresponds to the concept of ‘theft from the person’
The combination of these relationships will make it possible to represent, within a database structure, a statement such as
Sources of information on Theft from the person include editions of the British Crime Survey between 1996 and the present; headings under which it is recorded in this source include Snatch theft, which is a subtype of Mugging
The structure of the proposed repository has three significant features. Firstly, while the relationships between concepts are hierarchical, they are also multiple. In English law, the crime of Robbery implies assault (if there is no physical contact, the crime is recorded as Theft). The In-Vivo Concept of Robbery would therefore correspond both to the Organising Concept of Theft from the person and that of Personal violence. Since different sources may share categories but classify them differently, multiple relationships between In-Vivo Concepts will also be supported. Secondly, relationships between concepts will be meaningful: it will be possible to record that two concepts are associated as synonyms or antonyms, for example, as well as recording one as a sub-type of the other. Thirdly, the repository will not be delivered as an immutable finished product, but as an open and extensible framework. We shall investigate ways to enable qualified users to modify both the developed hierarchy of Organising Concepts and the relationships between these and In-Vivo Concepts.
In the context of the earlier discussion of semantic processing and rule-governed behaviour, this repository will demonstrate the ubiquity of rule-interpreting behaviour in the social world by exposing and ‘freezing’ the data which it produces. In other words, the repository will encode shifting patterns of correspondence, equivalence, negation and exclusion, demonstrating how the apparently rule-bound process of constructing meaning is continually determined by ‘shared context’.
The repository will thus expose and map the ways in which social data is structured by patterns of situational information. The extensible and modifiable structure of the repository will facilitate further work along these lines: the further development of the repository will itself be an example of rule-interpreting behaviour. The repository will not — and cannot — provide a seamless technological bridge over the semantic gap; it can and will facilitate the work of bridging the gap, but without substituting for the role of applied human intelligence.