Cloudbuilding (1)

This one’s about work.

I’m currently documenting the concepts underlying the 2005 Mixmag Drug Survey using Protege. Here’s why:

The documentation of social science datasets on a conceptual level, so as to make multiple datasets comprehensible within a shared conceptual framework, is inherently problematic: the concepts on which the data of the social sciences are constructed are imprecise, contested and mutable, with key concepts defined differently by different sources. When a major survey release is published, for example, the accompanying metadata often includes not only a definition of key terms, but discussion of how and why the definitions have changed since the previous release. This information is of crucial importance to the social scientist, both as a framework for understanding statistical data and as a body of social data in its own right.

It follows that we cannot think in terms of ironing out inconsistencies between social science datasets and resolving ambiguities. Rather, documenting the datasets must include documenting the definitions of the conceptual framework on which the datasets are built, however imprecise or inappropriate these concepts might appear in retrospect. This will also involve preserving – and exposing – the variations between different sources, or successive releases from a single source.

There are currently two main approaches to conceptually-oriented data documentation. A ‘top down’ approach is exemplified by the European Language Social Sciences Thesaurus (ELSST). The Madiera portal allows researchers to explore ELSST and access European survey data which has been linked to ELSST keywords. The limitations of the top-down approach can be gauged from ELSST’s concepts relating to drug use. Drug Abuse, Drug Addiction, Illegal Drugs and Drug Effects are all ‘leaf’ concepts – headings which have no subheadings under them. However, they are in different parts of the overall ELSST tree: for example, Drug Abuse is under Social Problems->Abuse, while Drug Effects is under Biology->Pharmacology. Although the hierarchy is augmented by a list of ‘related’ concepts, to some extent facilitating horizontal as well as vertical navigation, the hierarchy inevitably makes some types of enquiry easier than others. Anyone using the ELSST ‘tree’ will be visually reminded of the affinities identified by ELSST’s authors between Pharmacology and Physiology, or between Drug Abuse and Child Abuse. These problems follow from the initial design choice of a single conceptual hierarchy.

This approach to classification has recently come under criticism. Advocates of ‘bottom-up’ approaches argue that top-down taxonomies like the Dewey Decimal System or ELSST are an artificial imposition on the world of knowledge, which is better represented as a set of individual acts of labelling or ‘tagging’. It is argued that the ‘trees’ of hierarchical taxonomies can be replaced with a pile of ‘leaves’.

One successful ‘bottom-up’ approach is the framework for documenting survey data developed by the Data Documentation Initiative (DDI). The DDI standard makes it possible to search on keywords associated with surveys, sections of surveys and individual questions; the short text of individual questions is also searchable. Searches of DDI metadata can also be run from the Madiera portal: a search on ‘marijuana’, for instance, brings back short text items including the following:

– Health Behaviour in School-Aged Children (Switzerland, 1990)

Smoking cannabis should be legal? Q2.31
– Scottish Social Attitudes Survey (Scotland, 2001)

– Eurobarometer 37.0 (EU-wide, 1992)

Clearly, this way in to the data makes it easy for a well-prepared researcher to track the use of particular concepts ‘in the wild’ (in vivo concepts). However, this gain comes at the cost of some information. There is wide variation both in the terminology used in the surveys and in the concepts to which they refer. In one survey smoking cannabis might be a type of petty crime; in others it might figure as a type of leisure activity or a potential health risk. These conceptual differences are reflected in the vocabulary used by data sources – and by researchers. Depending on context, three researchers using ‘marijuana’, ‘hashish’ and ‘cannabis’ as search terms may be asking for the same data or for three different sets of data.

Neither the ‘top-down’ nor the ‘bottom-up’ approach articulates the conceptual assumptions which underlie the construction of a dataset – assumptions expressed both in the definition of in vivo concepts and in relationships between them. Rather than leaving much of this conceptual information undocumented (the DDI approach) or encoding one ‘correct’ set of assumptions while excluding or sidelining others (the ELSST approach), we propose to offer a coherent hierarchy of in vivo concepts for each individual source, based on the definitions (explicit and implicit) used in each source. Comparing the in vivo conceptual hierarchies used in multiple datasets will enable researchers both to see where concepts are directly comparable and to see where – and how – their definitions diverge and overlap.

To document hierarchies of in vivo concepts, we shall use description logic and the Semantic Web language OWL-DL (Web Ontology Language – Description Logic). OWL-DL makes it possible to formulate a precise logical specification of concepts such as

– use of cannabis (either marijuana or hashish) in the month prior to the survey
– use of either Valium or temazepam, at any time
– seizures of Class A drugs by HM Customs in the financial year 2004/5

At least, that’s the idea. Now wait for part 2…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: