Category Archives: taxonomy

The Liberal Democrat Party: a concluding unscientific postscript

Unlike leftish fiction-writer Ian McEwan, I am disinclined to extend much goodwill in the direction of the coalition government. In fact, anyone capable of judging this government – and the Lib Dems’ role in making it possible – as positively as McEwan strikes me as having something important missing from their own political makeup. It’s a bit like hearing it seriously argued that apartheid was good for the South African economy, or that Mussolini did in fact make the trains run on time: you just know that you’re not going to agree with this person on anything. (Not that I’ve agreed with old Leftie McEwan for quite a while.) Tory government is bad; if you join a Tory government, or (even worse) make a Tory government possible, you and your party are off the political roll-call forever.

This position seems pretty fundamental to me. But can I justify it on the basis of anything other than what McEwan refers to as “deep tribal reasons”? Continue reading

Not one of us

Nick Cohen in Standpoint (via):

a significant part of British Islam has been caught up in a theocratic version of the faith that is anti-feminist, anti-homosexual, anti-democratic and has difficulties with Jews, to put the case for the prosecution mildly. Needless to add, the first and foremost victims of the lure of conspiracy theory and the dismissal of Enlightenment values are British Muslims seeking assimilation and a better life, particularly Muslim women.

It’s the word ‘significant’ that leaps out at me – that, and Cohen’s evident enthusiasm to extend the War on Terror into a full-blown Kulturkampf. I think what’s wrong with Cohen’s writing here is a question of perspective, or more specifically of scale. You’ve got 1.6 million British Muslims, as of 2001. Then you’ve got the fraction who take their faith seriously & probably have a fairly socially conservative starting-point with regard to politics (call it fraction A). We don’t really know what this fraction is, but anecdotal evidence suggests that it’s biggish (60%? 70%?) – certainly bigger than the corresponding fraction of Catholics, let alone Anglicans. Then there’s fraction B, the fraction of the A group who sign up for the full anti-semitic theocratic blah; it’s pretty clear that fraction B is tiny, probably below 1% (i.e. a few thousand people). Finally, you’ve got fraction C, the proportion of the B group who are actually prepared to blow people up or help other people to do so – almost certainly 10% or less, i.e. a few hundred people, and most of them almost certainly known to Special Branch.

I think we can and should be fairly relaxed about fraction A; we should argue with the blighters when they come out with stuff that needs arguing with, but we shouldn’t be afraid to stand with them when they’re raising just demands. (Same as any other group, really.) Fraction B is not a good thing, and if it grows to the point of getting on the mainstream political agenda then it will need to be exposed and challenged. But it hasn’t reached that level yet, and I see no sign that it’s anywhere near doing so. (Nigel Farage gets on Question Time, for goodness’ sake. Compare and contrast.) The real counter-terrorist action, it seems to me, is or should be around fraction C. Let’s say there are 5,000 believers in armed jihad out there – 500 serious would-be jihadis and 4,500 armchair jihadis, who buy the whole caliphate programme but whose own political activism doesn’t go beyond watching the Martyrdom Channel. What’s more important – eroding the 5,000 or altering the balance of the 500/4,500 split? In terms of actually stopping people getting killed, the answer seems pretty obvious to me.

Nick Cohen and his co-thinkers, such as the Policy Exchange crowd, focus on fraction B rather than fraction A. In itself this is fair enough – I think it’s mistaken, but it’s a mistake a reasonable person can make. What isn’t so understandable is the urgency – and frequency – with which they raise the alarm against this tiny, insignificant group of people, despite the lack of evidence that they’re any sort of threat. “A small minority of British Muslims believe in the Caliphate” is on a par with “A small minority of British Conservatives would bring back the birch tomorrow” or “A small minority of British Greens believe in Social Credit”. It’s an advance warning of possible weird nastiness just over the horizon; it’s scary, but it’s not that scary.

What explains the tone of these articles, I think, is an additional and unacknowledged slippage, from fraction B back out to fraction A. What’s really worrying Cohen, in other words, isn’t the lure of conspiracy theory and the dismissal of Enlightenment values so much as the lure of Islam (in any form) and the dismissal of secularism. (What are these Enlightenment values, anyway? Nobody ever seems to specify which values they’re referring to. Somebody should make a list). Hence this sense of a rising tide of theocratic bigotry, and of the need for a proper battle of values to combat it. This seems alarmingly wrongheaded. Let’s say that there’s a correlation between religious devotion and socially conservative views (which isn’t always the case) – then what? A British Muslim who advocates banning homosexuality needs to be dealt with in exactly the same way as a British Catholic who advocates banning abortion – by arguing with their ideas. (Their ideas are rooted in their identities – but then, so are mine and yours.) And hence, too, that odd reference to British Muslims seeking assimilation and a better life, as if stepping out of the dark ages must mean abandoning your faith – or, at least, holding it lightly, in a proper spirit of worldly Anglican irony. Here, in fact, Cohen is a hop and a skip from forgetting about all the fractions and identifying the problem as Muslims tout court. Have a care, Nick – that way madness lies.

An eerie sight

David introduces a new feature at Librarything:

“tagmashes,” which are (in essence) searches on two or more tags. So, you could ask to see all the books tagged “france” and “wwii.” But the fact that you’re asking for that particular conjunction of tags indicates that those tags go together, at least in your mind and at least at this moment. Library turns that tagmash into a page with a persistent URL.

I like everything about this, apart from the horrible name. As somebody points out in comments, it’s not a new idea – a large part of David’s post could have been summed up in the words “Librarything have implemented faceted tagging“. But I think this is still something worth shouting about, for two reasons. Firstly, they have implemented it – it’s there now to be played with, even if it’s got a silly name. Secondly and more importantly, they’ve implemented ground-up faceted tagging: the facets are created by the act of searching for particular combinations of tags. At a stroke this addresses the disadvantages I identified in my post; rather than being imposed beforehand, the dimensions into which the tags are organised emerge from the ways people want to combine tags. Arguably, what Librarything have ended up with is something like a cross between faceted tagging and Flickr-style tag clusters (in which dimensions emerge from an aggregate of past searches).

What’s more, the ability to record an association between two tags addresses a question I raised way back here. If, to quote Tom Evslin, “we think in terms of associations” (rather than conceptual hierarchies); and if “the relationship between documents is actually dynamic … open tagging and hyperlinking are both ways to impose particular relationships on documents to meet the need of some subset of readers”; then it’s curious, to say the least, that it’s been so hard until now to use tagging to say this is like that (as distinct from this has frequently been applied to resources which have also been classified as that). From del.icio.us on, tagging has been a simple naming operation, hitching up things to names (stuff-for-classifying to tags), but not allowing any connection between those names. The implication is that the higher-order knowledge of what went with what would only emerge – could only emerge – from the aggregate of everyone else’s naming acts.

The ‘tagmash’ reminds us that (pace David) everything is not miscellaneous: yes, we think in associations and we apply our own labels and classifying schemes to the world, but as we do so we’re also connecting A to B and treating D as a sub-type of C. When we talk, we don’t just spray names around; we’re always adding a bit of structure to the conversational cloud, making a few more connections. It’s the connections, not the nodes, that map out the shape of a cloud of knowing.

Update Changed post title. I spent a good five minutes this afternoon thinking of an appropriate lyrical reference (eventually settling on this one); I don’t know how I missed the obvious choice.

Alright, yeah

Stephen Lewis (via Dave) has a good and troubling post about the limits of the Web as a repository of knowledge.

while the web might theoretically have the potential of providing more shelf space than all libraries combined, in reality it is quite far from being as well stocked. Indeed, only a small portion of the world’s knowledge is available online. The danger is that as people come to believe that the web is the be-all and end-all source of information, the less they will consult or be willing to pay for the off-line materials that continue to comprise the bulk of the world’s knowledge, intellectual achievement, and cultural heritage. The outcome: the active base of knowledge used by students, experts, and ordinary people will shrink as a limited volume of information, mostly culled from older secondary sources, is recycled and recombined over and again online, leading to an intellectual dark-age of sorts. In this scenario, Wikipedia entries will continue to grow uncontrolled and unverified while specialized books, scholarly journals and the world’s treasure troves of still-barely-explored primary sources will gather dust. Present-day librarians, experts in the mining of information and the guidance of researchers, will disappear. Scholarly discourse will slow to a crawl while the rest of us leave our misconceptions unquestioned and the gaps in our knowledge unfilled.

The challenge is either – or both – to get more books, periodicals, and original source materials online or to prompt people to return to libraries while at the same time ensuring that libraries remain (or become) accessible. Both tasks are dauntingly expensive and, in the end, must be paid for, whether through taxes, grants, memberships, donations, or market-level or publicly-subsidized fees.

Lewis goes on to talk about the destruction of the National and University Library in Sarajevo, among other things. Read the whole thing.

But what particularly struck me was the first comment below the post.

I think you’re undervaluing the new primary sources going up online, and you’re undervaluing the new connections that are possible which parchment can’t compete with like this post I’m making to you. I definitely agree that there is a ton of great knowledge stored up in books and other offline sources, but people solve problems with the information they have, and in many communities – especially rural third world communities, offline sources are just as unreachable, if not more, than online sources.

This is a textbook example of how enthusiasts deal with criticism. (I’m not going to name the commenter, because I’m not picking on him personally.) It’s a reaction I’ve seen a lot in debates around Wikipedia, but I’m sure it goes back a lot further. I call it the “your criticism may be valid but” approach – it starts by formally conceding the criticism, thus avoiding the need to refute or even address it. Counter-arguments can then be deployed at will, giving the rhetorical effect of debate without necessarily addressing the original point. It’s a very persuasive style of argument.In this case there are three main strategies. The criticism may be valid…

I think you’re undervaluing the new primary sources going up online

but (#1) things are getting better all the time, and soon it won’t be valid any more! (This is a very common argument among ‘social software’ fans. Say something critical about Wikipedia on a public forum, then start your stopwatch. See also Charlie Stross’s ‘High Frontier’ megathread.)

you’re undervaluing the new connections that are possible which parchment can’t compete with like this post I’m making to you. … in many communities – especially rural third world communities, offline sources are just as unreachable, if not more, than online sources

but (#2) you’re just looking at the negatives and ignoring the positives, and that’s wrong! Look at the positives, never mind the negatives! (Also very common out on the Web 2.0 frontier.)

I definitely agree that there is a ton of great knowledge stored up in books and other offline sources, but people solve problems with the information they have

but (#3) …hey, we get by, don’t we? Does it really matter all that much?

I’m not a fan of Richard Rorty, but I believe that communities have conversations, and that knowledge lives in those conversations (even if some of them are very slow conversations that have been serialised to paper over the decades). I also believe that knowledge comes in domains, and that each domain follows the shape of the overall cloud of knowledge constituted by a conversation. But I’ve been in enough specialised communities (Unix geeks, criminologists, folk singers, journalists…) to know that there’s a wall of ignorance and indifference around each domain; there probably has to be, if we’re not to keel over from too much perspective. Your stuff, you know about and you know that you don’t know all that much; you know you’re not an expert. Their stuff, well, you know enough; you know all you need to know, and anyway how complicated can it be?

Enthusiasts are good people to have around; they hoard the knowledge and keep the conversation going, even when there’s a bit of a lull. The trouble is, they tend to keep the wall of ignorance and apathy in place while they’re doing it. The moral is, if your question is about something just outside a particular domain of knowledge, don’t ask an enthusiast – they’ll tell you there’s nothing there. (Or: there’s something there now, but it won’t be there for long. Or: there’s something there, but look at all the great stuff we’ve got here!)

A taxonomy of terror

I attended part of a very interesting conference on terrorism last week. The organisers intend to launch a network and a journal devoted to ‘critical terrorism studies’, a project which I strongly support. As the previous blog entry suggests, I’ve studied a bit of terrorism in my time – and I’m very much in favour of people being encouraged to approach the phenomenon critically, which is to say without necessarily endorsing the definitions and interpretive frameworks offered by official sources.

However, it seems to me that the nature of the object of study still needs to be defined – and defined at once more precisely and more loosely. In other words, I don’t believe there’s much common ground between someone who thinks of terrorism in terms of gathering intelligence on the IRA, and someone who maintains that George W. Bush is a bigger terrorist than Osama bin Laden; I don’t think it’s particularly productive to try to find common ground between those two images of terrorism, or to simply allow them to coexist without defining the differences between them. On the other hand, I don’t see much mileage in a ‘purist’ Terrorism Studies which would focus solely on groups akin to the IRA – or in an alternative purism which would concentrate on terror attacks by Western governments.

A third approach offers to resolve the gap between these two – although I should say straight away that I don’t believe it does so. This approach is that of terrorism as an object of discourse: what is under analysis is not so much an identifiable set of actions, or types of action, as the texts and utterances which purport to analyse and describe terrorism. The effect is to turn the analytical gaze back on the governmental discourse of terrorism, which in turn makes it possible to contrast the official image of the terrorist threat with data from other sources; an interesting example of this approach in practice is Richard Jackson’s paper Religion, Politics and Terrorism: A Critical Analysis of Narratives of “Islamic Terrorism” (DOC file available from here).

I think this is a powerful and constructive approach – my own thesis (as yet unpublished) includes some quite similar work on Italian left-wing armed groups of the 1970s, whose presentation in both the mainstream and the Communist press was heavily shaped by differing ideological assumptions. But I think it should be recognised that it’s an approach of a different order from the other two. To combine them would be to mix ontological and epistemological arguments – to say, in other words, That’s what is officially labelled terrorism, but this is real terrorism. (Or: That’s what they call terrorism, but this is what we know to be the reality of terrorism.) The problem with this is that it implies a commitment to a particular idea of real terrorism, without actually suggesting a candidate. At best, this formulation frees the analyst to retain his or her prior commitments, bolstered with added ontological certitude. At worst, it suggests that real terrorism is the inverse of officially labelled terrorism – or at least that there is no possible overlap between officially labelled terrorism and real terrorism. This is surely inadequate: a critical approach should be able to do more with the official version than simply reverse it.

I believe that the study of terrorism must include all of these elements, and recognise that they may overlap but don’t coincide. In other words, it must include the following:

  1. Organised political violence by non-state actors: ‘terrorism’ as a political intervention (call it T1)
  2. Indiscriminate large-scale attacks on civilians: terror as a tactic, in warfare or otherwise (T2)
  3. The constructed antagonist of the War on Terror: ‘Terrorism’ as object of discourse (T3)

We can think of it as a three-circle Venn diagram, with areas of intersection between each pair of circles and a triple intersection in the middle.

What is immediately apparent about this list is how little of the field of terrorism falls into all three categories. The (white) triple intersect – mass killing of civilians by a non-state political actor, officially labelled (and denounced) as terrorism – is represented by a relatively small number of horrific events, chief among them September 11th. By contrast, much of what students of terrorism – myself included – would like to be able to look at under that name falls into only two categories, or even one. The (red) intersect of T1 and T3, most obviously, is represented by those acts by armed groups which are officially denounced but don’t involve mass killing of civilians: the ‘execution’ of Aldo Moro and the IRA’s Brighton bomb, for example. The use of terror tactics by non-governmental death squads, such as the Nicaraguan Contras and the Salvadorean ORDEN militia, falls into the blue intersect of T1 and T2. The use of state terror by official enemies and ‘rogue states’ – such as the Syrian Hama massacre or Saddam Hussein’s gassing of the people of Halabja – falls into the green intersect of T2 and T3. And this is without considering all those activities which fall into only one category: T1 (magenta) alone, activities by armed groups which fall below the radar of the discourse of ‘terrorism’ (a large and interesting category); T2 (cyan) alone, terror tactics used by states and not denounced as terrorism; and T3 (yellow) alone, officially-denounced ‘terrorism’ which involves neither an organised armed group nor a mass attack on civilians.

I don’t, myself, see any problem with studying all three of these categories – or rather, all seven. I hope the remit of the new Critical Terrorism Studies is broad enough to encompass all of these without imposing an artificial unity on them. Paramilitary fundraising in Northern Ireland cannot be studied in the same way as the attack on Fallujah or press reporting of the ‘ricin plot’; each of these deserves to be studied, however, and the different approaches appropriate to studying them can only strengthen the field.

The people with the answers

Nick:

Larry Sanger, the controversial online encyclopedia’s cofounder and leading apostate, announced yesterday, at a conference in Berlin, that he is spearheading the launch of a competitor to Wikipedia called The Citizendium. Sanger describes it as “an experimental new wiki project that combines public participation with gentle expert guidance.”The Citizendium will begin as a “fork” of Wikipedia, taking all of Wikipedia’s current articles and then editing them under a new model that differs substantially from the model used by what Sanger calls the “arguably dysfunctional” Wikipedia community. “First,” says Sanger, in explaining the primary differences, “the project will invite experts to serve as editors, who will be able to make content decisions in their areas of specialization, but otherwise working shoulder-to-shoulder with ordinary authors. Second, the project will require that contributors be logged in under their own real names, and work according to a community charter. Third, the project will halt and actually reverse some of the ‘feature creep’ that has developed in Wikipedia.”

I’ve been thinking about Wikipedia, and about what makes a bad Wikipedia article so bad, for some time – this March 2005 post took off from some earlier remarks by Larry Sanger. I’m not attempting to pass judgment on Wikipedia as a whole – there are plenty of good Wikipedia articles out there, and some of them are very good indeed. But some of them are bad. Picking on an old favourite of mine, here’s the first paragraph of the Wikipedia article on the Red Brigades, with my comments.

The Red Brigades (Brigate Rosse in Italian, often abbreviated as BR) are

The word is ‘were’. The BR dissolved in 1981; its last successor group gave up the ghost in 1988. There’s a small and highly violent group out there somewhere which calls itself “Nuove Brigate Rosse” – the New Red Brigades – but its continuity with the original BR is zero. This is a significant disagreement, to put it mildly.

a militant leftist group located in Italy. Formed in 1970, the Marxist Red Brigades

‘Marxist’ is a bizarre choice of epithet. Most of the Italian radical left was Marxist, and almost all of it declined to follow the BR’s lead. Come to that, the Italian Communist Party (one of the BR’s staunchest enemies) was Marxist. Terry Eagleton’s a Marxist; Jeremy Hardy’s a Marxist; I’m a Marxist myself, pretty much. The BR had a highly unusual set of political beliefs, somewhere between Maoism, old-school Stalinism and pro-Tupamaro insurrectionism. ‘Maoist’ would do for a one-word summary. ‘Marxist’ is both over-broad and misleading.

sought to create a revolutionary state through armed struggle

Well, yes. And no. I mean, I don’t think it’s possible to make any sense of the BR without acknowledging that, while they did have a famous slogan about portare l’attacco al cuore dello stato (‘attacking at the heart of the state’), their anti-state actions were only a fairly small element of what they did. To begin with they were a factory-based group, who took action against foremen and personnel managers; in their later years – which were also their peak years – the BR, like other armed groups, got drawn into what was effectively a vendetta with the police, prioritising revenge attacks over any kind of ‘revolutionary’ programme. You could say that the BR were a revolutionary organisation & consequently had a revolutionary programme throughout, even if their actions didn’t always match it – but how useful would this be?

and to separate Italy from the Western Alliance

Whoa. I don’t think the BR were particularly in favour of Italy’s NATO membership, but the idea that this was one of their key goals is absurd. If the BR had been a catspaw for the KGB, intent on fomenting subversion so as to destabilise Italy, then this probably would have been high on their list. But they weren’t, and it wasn’t.

In 1978, they kidnapped and killed former Prime Minister Aldo Moro under obscure circumstances.

Remarkably well-documented circumstances, I’d have said.

After 1984′s scission

This is just wrong – following growing and unresolvable factionalism, the BR formally dissolved in October 1981.

Red Brigades managed with difficulty to survive the official end of the Cold War in 1989

This is both confused and wrong. Given that there was a split, how would the BR have survived beyond 1981 (or 1984), let alone 1989? As for the BR’s successor groups, the last one to pack it in was last heard from in 1988.

even though it is now a fragile group with no original members.

Or rather, even though the name is now used by a small group about which very little is know, but which is not believed to have any connection to the original group (whose members are after all knocking on a bit by now).

Throughout the 1970’s the Red Brigades were credited with 14,000 acts of violence.

Good grief. Credited by whom? According to the sources I’ve seen, between 1970 and 1981 Italian armed struggle groups were responsible for a total of 3,258 actions, including 110 killings; the BR’s share of the total came to 472 actions, including 58 killings. (Most ‘actions’ consisted of criminal damage and did not involve personal violence.) I’d be the first to admit that the precision of these figures is almost certainly spurious, but even if we doubled that figure of 472 we’d be an awful long way short of 14,000.

I’m not even going to look at the body of the article.

I think there are two main problems here; the good news is that Larry’s proposals for the neo-Wikipedia (Nupedia? maybe not) would address both of them.

Firstly, first mover advantage. The structure of Wikipedia creates an odd imbalance between writers and editors. Writing a new article is easy: the writer can use whatever framework he or she chooses, in terms both of categories used to structure the entry and of the overall argument of the piece. Making minor edits to an article is easy: mutter 1984? no way, it was 1981!, log on, a bit of typing and it’s done. But making major edits is hard – you can see from the comments above just how much work would be needed to make that BR article acceptable, starting from what’s there now. It would literally be easier to write a new article. What’s more, making edits stick is hard; I deleted one particularly ignorant falsehood from the BR article myself a few months ago, only to find my edit reverted the next day. (Of course, I re-reverted it. So there!)

Larry’s suggestion of getting experts on board is very much to the point here. Slap my face and call me a credentialled academic, but I don’t believe that everyone is equally qualified to write an encyclopedia article about their favourite topic – and I do think it matters who gets the first go.

Secondly, gaming the system. Wikipedia is a community as well as an encyclopedia. I’ll pass over Larry’s suggestion that Wikipedia is dysfunctional as a community, but I do think it’s arguable that some behaviours which work well for Wikipedia-the-community are dysfunctional for Wikipedia-the-resource. It’s been suggested, for instance, that what really makes Wikipedia special is the ‘history’ pages, which take the lid off the debate behind the encyclopedia and let us see knowledge in the process of formation. It follows from this that to show the world a single, ‘definitive’ version of an article on a subject would actually be a step backwards: The discussion tab on Wikipedia is a great place to point to your favorite version … Does the world need a Wikipedia for stick-in-the-muds? W. A. Gerrard objects:

Of what value is publicly documenting the change history of an encyclopedia entry? How can something that purports to be authoritative allow the creation of alternative versions which readers can adopt as favorites?If an attempt to craft a wiki that strives for accuracy, even via a flawed model, is considered something for “stick-in-the-muds”, then it’s apparent that many of Wikipedia’s supporters value the dynamics of its community more than the credibility of the product they deliver.

I think this is exactly right: the history pages are worth much more to members of the Wikipedia community than to Wikipedia users. People like to form communities and communities like to chat – and edits and votes are the currency of Wikipedia chat. And gaming the system is fun (hence the word ‘game’). Aaron Swartz quotes comments about Wikipedia regulars who delete your newly[-]create[d] article without hesitation, or revert your changes and accuse you of vandalis[m] without even checking the changes you made, or who “edited” thousands of articles … [mostly] to remove material that they found unsuitable. This clearly suggest the emergence of behaviours which are driven more by social expectations than by a concern for Wikipedia. The second writer quoted above continues: Indeed, some of the people-history pages contained little “awards” that people gave each other — for removing content from Wikipedia.

Now, all systems can be gamed, and all communities chat. The question is whether the chatting and the gaming can be harnessed for the good of the encyclopedia – or, failing that, minimised. I’m not optimistic about the first possibility, and I suspect Larry Sanger isn’t either. Larry does, however, suggest a very simple hack which would help with the second: get everyone to use their real name. This would, among other things, make it obvious when a writer had authority in a given area. I don’t entirely agree with Aaron’s conclusion:

Larry Sanger famously suggested that Wikipedia must jettison its anti-elitism so that experts could feel more comfortable contributing. I think the real solution is the opposite: Wikipedians must jettison their elitism and welcome the newbie masses as genuine contributors to the project, as people to respect, not filter out.

This is half right: Wikipedia-the-community has produced an elite of ‘regulars’, whose influence over Wikipedia-the-resource derives from their standing in the community rather than from any kind of claim to expertise. I agree with Aaron that this is an unhealthy situation, but I think Larry was right as well. The artificial elitism of the Wikipedia community doesn’t only marginalise the ‘masses’ who contribute most of the original content; it also sidelines the subject-area experts who, within certain limited domains, have a genuine claim to be regarded as an elite.

I don’t know if the Citizendium is going to address these problems in practice; I don’t know if the Citizendium is going anywhere full stop. But I think Larry Sanger is asking the right questions. It’s increasingly clear that Wikipedia isn’t just facing in two directions at once, it’s actually two different things – and what’s good for Wikipedia-the-community isn’t necessarily good for Wikipedia-the-resource.

So much that hides

Alex points to this piece by Rashmi Sinha on ‘Findability with tags’: the vexed question of using tags to find the material that you’ve tagged, rather than as an elaborate way of building a mind-map.

I should stress, parenthetically, that that last bit wasn’t meant as a putdown – it actually describes my own use of Simpy. I regularly tag pages, but almost never use tags to actually retrieve them. Sometimes – quite rarely – I do pull up all the pages I’ve tagged with a generic “write something about this” tag. Apart from that, I only ever ask Simpy two questions: one is “what was that page I tagged the other day?” (for which, obviously, meaningful tags aren’t required); the other is “what does my tag cloud look like?”.

Now, you could say that the answer to the second question isn’t strictly speaking information; it’s certainly not information I use, unless you count the time I spend grooming the cloud by splitting, merging and deleting stray tags. I like tag clouds and don’t agree with Jeffrey Zeldman’s anathema, but I do agree with Alex that they’re not the last word in retrieving information from tags. Which is where Rashmi’s article comes in.

Rashmi identifies three ways of layering additional information on top of the basic item/tag pairing, all of which hinge on partitioning the tag universe in different ways. This is most obvious in the case of faceted tagging: here, the field of information is partitioned before any tags are applied. Rashmi cites the familiar example of wine, where a ‘region’ tag would carry a different kind of information from ‘grape variety’, ‘price’ or for that matter ‘taste’. Similar distinctions can be made in other areas: a news story tagged ‘New Labour’, ‘racism’ and ‘to blog about’ is implicitly carrying information in the domains ‘subject (political philosophy)’, ‘subject (social issue)’ and ‘action to take’.

There are two related problems here. A unique tag, in this model, can only exist within one dimension: if I want separate tags for New Labour (the people) and New Labour (the philosophy), I’ll either have to make an artificial distinction between the two (New_Labour vs New_Labour_philosophy) or add a dimension layer to my tags (political_party.New_Labour vs political_philosophy.New_Labour). Both solutions are pretty horrible. More broadly, you can’t invoke a taxonomist’s standby like the wine example without setting folksonomic backs up, and with some reason: part of the appeal of tagging is precisely that you start with a blank sheet and let the domains of knowledge emerge as they may.

Clustered tagging (a new one on me) addresses both of these problems, as well as answering the much-evaded question of how those domains are supposed to emerge. A tag cluster – as seen on Flickr – consists of a group of tags which consistently appear together, suggesting an implicit ‘domain’. Crucially, a single tag can occur in multiple clusters. The clusters for the Flickr ‘election’ tag, for example, are easy to interpret:

vote, politics, kerry, bush, voting, ballot, poster, cameraphone, democrat, president

wahl, germany, deutschland, berlin, cdu, spd, bundestagswahl

canada, ndp, liberal, toronto, jacklayton, federalelection

and, rather anticlimactically,

england, uk

Clustering, I’d argue, represents a pretty good stab at building emergent domains. The downside is that it only becomes possible when there are huge numbers of tagging operations.

The third enhancement to tagging Rashmi describes is the use of tags as pivots:

When everything (tag, username, number of people who have bookmarked an item) is a link, you can use any of those links to look around you. You can change direction at any moment.

Lurking behind this, I think, is Thomas‘s original tripartite definition of ‘folksonomy’:

the three needed data points in a folksonomy tool [are]: 1) the person tagging; 2) the object being tagged as its own entity; and 3) the tag being used on that object. Flattening the three layers in a tool in any way makes that tool far less valuable for finding information. But keeping the three data elements you can use two of the elements to find a third element, which has value. If you know the object (in del.icio.us it is the web page being tagged) and the tag you can find other individuals who use the same tag on that object, which may lead (if a little more investigation) to somebody who has the same interest and vocabulary as you do. That person can become a filter for items on which they use that tag.

This, I think, is pivoting in action: from the object and its tags, to the person tagging and the tags they use, to the person using particular tags and the objects they tag. (There’s a more concrete description here.)

Alex suggests that using tags as pivots could also be considered a subset of faceted browsing. I’d go further, and suggest that facets, clusters and pivots are all subsets of a larger set of solutions, which we can call domain-based tagging. If you use facets, the domains are imposed: this approach is a good fit to relatively closed domains of knowledge and finite groups of taggers. If you’ve got an epistemological blank sheet and a limitless supply of taggers, you can allow the domains to emerge: this is where clusters come into their own. And if what you’re primarily interested in is people – and, specifically, who‘s saying what about what – then you don’t want multiple content-based domains but only the information which derives directly from human activity: the objects and their taggers. Or rather, you want the objects and the taggers, plus the ability to pivot into a kind of multi-dimensional space: instead of tags existing within domains, each tag is a domain in its own right, and what you can find within each tag-domain is the objects and their taggers.

What all of this suggests is that, unsurprisingly, there is no ‘one size fits all’ solution. I suggested some time ago that

If ‘cloudiness’ is a universal condition, del.icio.us and Flickr and tag clouds and so forth don’t enable us to do anything new; what they are giving us is a live demonstration of how the social mind works.

All knowledge is cloudy; all knowledge is constructed through conversation; conversation is a way of dealing with cloudiness and building usable clouds; social software lets us see knowledge clouds form in real time. I think that’s fine as far as it goes; what it doesn’t say is that, as well as having conversations about different things, we’re having different kinds of conversations and dealing with the cloud of knowing in different ways. Ontology is not, necessarily, overrated; neither is folksonomy.

Living in the thick of it

Chris and Rob have been finding different kinds of fault in the classic left/right political spectrum: Chris prefers two criteria which (he argues) are more or less orthogonal (pro- and anti-state, pro- and anti-poor people), while Rob opts for ‘conservative’ and ‘liberal’ as fundamental alternatives.

The trouble with all these discussions is that so many different oppositions end up being overlaid. In comments on Chris’s post, for example, Tim Worstall makes a pretty good fist of locating himself on the Left. Speaking as a Marxist, I’m not fooled for a minute – but I have to admit that I often feel closer to the Worstall Right than to the Euston Manifesto Left.

I gave some thought to this stuff some time ago, in an attempt to work out why I counted at least one Tory among my trusted friends while finding many genuine socialists hard to be around. I dismissed the thought that I was moving Right with age, partly because it was uncomfortable and partly because I knew that my position on Chris’s rich-or-poor scale hadn’t budged; I don’t think there are many right-wingers who enjoy singing along to “The Blackleg Miner“, put it that way. I also dismissed the thought that the difference between my Tory friend and my irritating socialist acquaintances was that the former was a thoughtful and intelligent bloke; there was no a priori reason for this exclusion, you understand, it was just a bit too obvious.

Anyway, what I came up with was a two-part scale, covering both your views on human nature and your views on political change (the greatest flaw of Robert’s liberal/conservative scale, in my view, is that it tends to conflate these). Each of these two breaks down into two elements, giving a total of sixteen distinct positions. Where human nature is concerned, we look at whether people should be controlled or liberated and at who should be doing the controlling or liberating. As for political change, we ask both whether we believe change should be welcomed or resisted and how we relate this change to the present.

Human nature first. The most fundamental question: are people good or bad? In other words, if left to themselves would people destroy social order or create a new and better society? For this part of the scale I’ll borrow from Church history.

An Augustinian believes that, ultimately, people are sinful; politics is, or should be, concerned with establishing laws and institutions which enable sinful people to coexist without tearing one another apart.

A Pelagian believes that, ultimately, people are good; politics is, or should be, concerned with enabling people to work together, play together and generally enjoy life in ways which have hitherto not been possible.

Now for the location of control or liberation: central or local? government or community? ruler or family?

A Jacobin believes that all politics worthy of the name happens in government; left to their own devices, communities tend to stagnate or run wild

A Digger believes that politics happens in affective communities and in everyday life; left to government, politics becomes managerial and sterile

An Augustinian Jacobin is an Authoritarian: people need to be governed, and who better to govern than the government?
An Augustinian Digger is a Communitarian: what we want isn’t law-abiding individuals but communities of respect
A Pelagian Jacobin is a Liberal: the government can help people realise their potential, either by freeing them from oppressive conditions or simply by getting out of the way
A Pelagian Digger is a Hippie (sorry Paul): isn’t it great when people get together and do stuff, without waiting for politicians to tell them what to do?

A Liberal is the opposite of a Communitarian; an Authoritarian is the opposite of a Hippie.

Now for attitudes to political change.

A Whig believes that change should, all things being equal, be embraced: that the risk of regression and lost opportunities is greater than the risk that change will destroy something worth preserving

A Tory believes that change should, all things being equal, be resisted: that the risk of losing valuable cultural and political resources outweighs the risk of failing to grasp opportunities for progress

Finally, let’s look at how change relates to the present. For this part of the act I’ll need a volunteer from the history of Western philosophy; specifically, G.W.F. Hegel. Hegel believed that historical change had an immanent meliorist teleology – in other words, that things were getting better and better, and would eventually reach a point where they couldn’t get any better. He also believed that this point had in fact been reached (cf. Francis Fukuyama, who rather amusingly trotted out precisely the same argument the best part of two centuries down the line). Marx adopted the Hegelian framework, but with the crucial modification of placing the end of history the far side of a future revolution. We can call these two positions Right-Hegelianism and Left-Hegelianism.

A Right-Hegelian believes that, to the extent that it makes sense to talk of a good society, the good society is an extension of trends which have a visible and increasingly dominant influence on society as it is now

A Left-Hegelian believes that it emphatically does make sense to talk of a good society, and that such a society will in important senses require the reversal or overthrow of society as it is now

A Right-Hegelian Whig is a Reformer: things have changed, things will continue to change, there has been progress and there will be more progress

A Right-Hegelian Tory is a Conservative: our existing institutions are valuable and should not be put at risk for the sake of speculative benefits

A Left-Hegelian Whig is a Revolutionary: things could be much better, and things can be much better if we push a bit harder

A Left-Hegelian Tory is a Historian: things could be much better, but our main task is to keep alive the resources of that hope

The opposite of a Revolutionary is a Conservative.
The opposite of a Reformer is a Historian.

Liberal, Authoritarian, Communitarian, Hippie; Conservative, Reformer, Revolutionary, Historian. That gives us a total of sixteen hats to try on, and to fit to our various political rivals. See how you get on.

Me, I’m PDLT, a Hippie Historian (who’d have thought it?); this makes me the polar opposite of an AJRW, an Authoritarian Reformer. (Like, for instance, Charles Clarke.) Works for me.

I have spotted one potential weakness of this scale. It gets in most of the points made by Rob, Chris and their commenters, including Matt and Tim, but with one obvious gap: Chris’s rich/poor scale, which (as I’ve said) is fairly fundamental to my own sense of political identity. Can this be fitted into the model, and if so where? Or is this a different kind of question?

Update 30th April

Jamie, the only other Hippie Historian to have surfaced so far (if anyone can think of a better label than ‘Hippie’ for the Pelagian/Digger combination, by the way, I’ll be all ears), writes

I’m also, incidentally, mildly annoyed at having to qualify libertarian with left wing. Hayekianism is not a libertarian doctrine.

I think this is an important point & goes some way to addressing my point about the rich/poor axis, just above. Consider: if I believe in freedom of action, I must necessarily believe in freedom of action for everyone, to be curtailed only by provisions which have a similarly universal reach. But equality of opportunity and constraint for rich and poor is no equality at all – in Anatole France’s formulation, The law, in its majestic equality, forbids the rich as well as the poor to sleep under bridges, to beg in the streets and to steal bread. Inequalities of wealth are, in effect, inequalities of constraint and opportunity; any consistent libertarianism would begin by establishing whether these inequalities follow any consistent pattern, and would oppose them if so. The alternative would be to take the current distribution of wealth and power (and hence of effective liberty) as given, accept it as a more-or-less immutable starting-point. I don’t understand why anyone would do that – but then, I’m a Left-Hegelian (see also my posts on Euston).

Cloudbuilding (3)

By way of background to this post – and because I think it’s quite interesting in itself – here’s a short paper I gave last year at this conference (great company, shame about the catering). It was co-written with my colleagues Judith Aldridge and Karen Clarke. I don’t stand by everything in it – as I’ve got deeper into the project I’ve moved further away from Clay’s scepticism and closer towards people like Carole Goble and Keith Cole – but I think it still sets out an argument worth having.

Mind the gap: Metadata in e-social science

1. Towards the final turtle

It’s said that Bertrand Russell once gave a public lecture on astronomy. He described how the earth orbits around the sun and how the sun, in turn, orbits around the centre of our galaxy. At the end of the lecture, a little old lady at the back of the room got up and said: “What you have told us is rubbish. The world is really a flat plate supported on the back of a giant tortoise.”

Russell smiled and replied, “What is the tortoise standing on?”

“You’re very clever, young man, very clever,” said the old lady. “But it’s turtles all the way down.”

The Russell story is emblematic of the logical fallacy of infinite regress: proposing an explanation which is just as much in need of explanation as the original fact being explained. The solution, for philosophers (and astronomers), is to find a foundation on which the entire argument can be built: a body of known facts, or a set of acceptable assumptions, from which the argument can follow.

But what if infinite regress is a problem for people who want to build systems as well as arguments? What if we find we’re dealing with a tower of turtles, not when we’re working backwards to a foundation, but when we’re working forwards to a solution?

WSDL [Web Services Description Language] lets a provider describe a service in XML [Extensible Markup Language]. [...] to get a particular provider’s WSDL document, you must know where to find them. Enter another layer in the stack, Universal Description, Discovery, and Integration (UDDI), which is meant to aggregate WSDL documents. But UDDI does nothing more than register existing capabilities [...] there is no guarantee that an entity looking for a Web Service will be able to specify its needs clearly enough that its inquiry will match the descriptions in the UDDI database. Even the UDDI layer does not ensure that the two parties are in sync. Shared context has to come from somewhere, it can’t simply be defined into existence. [...] This attempt to define the problem at successively higher layers is doomed to fail because it’s turtles all the way up: there will always be another layer above whatever can be described, a layer which contains the ambiguity of two-party communication that can never be entirely defined away. No matter how carefully a language is described, the range of askable questions and offerable answers make it impossible to create an ontology that’s at once rich enough to express even a large subset of possible interests while also being restricted enough to ensure interoperability between any two arbitrary parties.
(Clay Shirky)

Clay Shirky is a longstanding critic of the Semantic Web project, an initiative which aims to extend Web technology to encompass machine-readable semantic content. The ultimate goal is the codification of meaning, to the point where understanding can be automated. In commercial terms, this suggests software agents capable of conducting a transaction with all the flexibility of a human being. In terms of research, it offers the prospect of a search engine which understands the searches it is asked to run and is capable of pulling in further relevant material unprompted.

This type of development is fundamental to e-social science: a set of initiatives aiming to enable social scientists to access large and widely-distributed databases using ‘grid computing’ techniques.

A Computational Grid performs the illusion of a single virtual computer, created and maintained dynamically in the absence of predetermined service agreements or centralised control. A Data Grid performs the illusion of a single virtual database. Hence, a Knowledge Grid should perform the illusion of a single virtual knowledge base to better enable computers and people to work in cooperation.
(Keith Cole et al)

Is Shirky’s final turtle a valid critique of the visions of the Semantic Web and the Knowledge Grid? Alternatively, is the final turtle really a Babel fish — an instantaneous universal translator — and hence (excuse the mixed metaphors) a straw person: is Shirky setting the bar impossibly high, posing goals which no ‘semantic’ project could ever achieve? To answer these questions, it’s worth reviewing the promise of automated semantic processing, and setting this in the broader context of programming and rule-governed behaviour.

2. Words and rules

We can identify five levels of rule-governed behaviour. In rule-driven behaviour, firstly, ‘everything that is not compulsory is forbidden’: the only actions which can be taken are those dictated by a rule. In practice, this means that instructions must be framed in precise and non-contradictory terms, with thresholds and limits explicitly laid down to cover all situations which can be anticipated. This is the type of behaviour represented by conventional task-oriented computer programming.

A higher level of autonomy is given by rule-bound behaviour: rules must be followed, but there is some latitude in how they are applied. A set of discrete and potentially contradictory rules is applied to whatever situation is encountered. Higher-order rules or instructions are used to determine the relative priority of different rules and resolve any contradiction.

Rule-modifying behaviour builds on this level of autonomy, by making it possible to ‘learn’ how and when different rules should be applied. In practice, this means that priority between different rules is decided using relative weightings rather than absolute definitions, and that these weightings can be modified over time, depending on the quality of the results obtained. Neither rule-bound nor rule-modifying behaviour poses any fundamental problems in terms of automation.

Rule-discovering behaviour, in addition, allows the existing body of rules to be extended in the light of previously unknown regularities which are encountered in practice (“it turns out that many Xs are also Y; when looking for Xs, it is appropriate to extend the search to include Ys”). This level of autonomy — combining rule observance with reflexive feedback — is fairly difficult to envisage in the context of artificial intelligence, but not impossible.

The level of autonomy assumed by human agents, however, is still higher, consisting of rule-interpreting behaviour. Rule-discovery allows us to develop an internalised body of rules which corresponds ever more closely to the shape of the data surrounding us. Rule-interpreting behaviour, however, enables us to continually and provisionally reshape that body of rules, highlighting or downgrading particular rules according to the demands of different situations. This is the type of behaviour which tells us whether a ban is worth challenging, whether a sales pitch is to be taken literally, whether a supplier is worth doing business with, whether a survey’s results are likely to be useful to us. This, in short, is the level of Shirky’s situational “shared context” — and of the final turtle.

We believe that there is a genuine semantic gap between the visions of Semantic Web advocates and the most basic applications of rule-interpreting human intelligence. Situational information is always local, experiential and contingent; consequently, the data of the social sciences require interpretation as well as measurement. Any purely technical solution to the problem of matching one body of social data to another is liable to suppress or exclude much of the information which makes it valuable.

We cannot endorse comments from e-social science advocates such as this:

variable A and variable B might both be tagged as indicating the sex of the respondent where sex of the respondent is a well defined concept in a separate classification. If Grid-hosted datasets were to be tagged according to an agreed classification of social science concepts this would make the identification of comparable resources extremely easy.
(Keith Cole et al)

Or this:

work has been undertaken to assert the meaning of Web resources in a common data model (RDF) using consensually agreed ontologies expressed in a common language [...] Efforts have concentrated on the languages and software infrastructure needed for the metadata and ontologies, and these technologies are ready to be adopted.
(Carole Goble and David de Roure; emphasis added)

Statements like these suggest that semantics are being treated as a technical or administrative matter, rather than a problem in its own right; in short, that meaning is being treated as an add-on.

3. Google with Craig

To clarify these reservations, let’s look at a ‘semantic’ success story.

The service, called “Craigslist-GoogleMaps combo site” by its creator, Paul Rademacher, marries the innovative Google Maps interface with the classifieds of Craigslist to produce what is an amazing look into the properties available for rent or purchase in a given area. [...] This is the future….this is exactly the type of thing that the Semantic Web promised
(Joshua Porter)

‘This’ is is an application which calculates the location of properties advertised on the ‘Craigslist’ site and then displays them on a map generated from Google Maps. In other words, it takes two sources of public-domain information and matches them up, automatically and reliably.

That’s certainly intelligent. But it’s also highly specialised, and there are reasons to be sceptical about how far this approach can be generalised. On one hand, the geographical base of the application obviates the issue of granularity. Granularity is the question of the ‘level’ at which an observation is taken: a town, an age cohort, a household, a family, an individual? a longitudinal study, a series of observations, a single survey? These issues are less problematic in a geographical context: in geography, nobody asks what the meaning of ‘is’ is. A parliamentary constituency; a census enumeration district; a health authority area; the distribution area of a free newspaper; a parliamentary constituency (1832 boundaries) — these are different ways of defining space, but they are all reducible to a collection of identifiable physical locations. Matching one to another, as in the CONVERTGRID application (Keith Cole et al) — or mapping any one onto a uniform geographical representation — is a finite and rule-bound task. At this level, geography is a physical rather than a social science.

The issue of trust is also potentially problematic. The Craigslist element of the Rademacher application brings the social element to bear, but does so in a way which minimises the risks of error (unintentional or intentional). There is a twofold verification mechanism at work. On one hand, advertisers — particularly content-heavy advertisers, like those who use the ‘classifieds’ and Craigslist — are motivated to provide a (reasonably) accurate description of what they are offering, and to use terms which match the terms used by would be buyers. On the other hand, offering living space over Craigslist is not like offering video games over eBay: Craigslist users are not likely to rely on the accuracy of listings, but will subject them to in-person verification. In many disciplines, there is no possibility of this kind of ‘real-world’ verification; nor is there necessarily any motivation for a writer to use researchers’ vocabularies, or conform to their standards of accuracy.

In practice, the issues of granularity and trust both pose problems for social science researchers using multiple data sources, as concepts, classifications and units differ between datasets. This is not just an accident that could have been prevented with more careful planning; it is inherent in the nature of social science concepts, which are often inextricably contingent on social practice and cannot unproblematically be recorded as ‘facts’. The broad range covered by a concept like ‘anti-social behaviour’ means that coming up with a single definition would be highly problematic — and would ultimately be counter-productive, as in practice the concept would continue to be used to cover a broad range. On the other hand, concepts such as ‘anti-social behaviour’ cannot simply be discarded, as they are clearly produced within real — and continuing — social practices.

The meaning of a concept like this — and consequently the meaning of a fact such as the recorded incidence of anti-social behaviour — cannot be established by rule-bound or even rule-discovering behaviour. The challenge is to record both social ‘facts’ and the circumstances of their production, tracing recorded data back to its underlying topic area; to the claims and interactions which produced the data; and to the associations and exclusions which were effectively written into it.

4. Even better than the real thing

As an approach to this problem, we propose a repository of content-oriented metadata on social science datasets. The repository will encompass two distinct types of classification. Firstly, those used within the sources themselves; following Barney Glaser, we refer to these as ‘In Vivo Concepts’. Secondly, those brought to the data by researchers (including ourselves); we refer to these as ‘Organising Concepts’. The repository will include:

• relationships between Organising Concepts
‘theft from the person’ is a type of ‘theft’

• associations between In-Vivo Concepts and data sources
the classification of ‘Mugging’ appears in ‘British Crime Survey 2003’

• relationships between In-Vivo Concepts
‘Snatch theft’ is a subtype of the classification of ‘Mugging’

• relationships between Organising Concepts and In-Vivo Concepts
the classification of ‘Snatch theft’ corresponds to the concept of ‘theft from the person’

The combination of these relationships will make it possible to represent, within a database structure, a statement such as

Sources of information on Theft from the person include editions of the British Crime Survey between 1996 and the present; headings under which it is recorded in this source include Snatch theft, which is a subtype of Mugging

The structure of the proposed repository has three significant features. Firstly, while the relationships between concepts are hierarchical, they are also multiple. In English law, the crime of Robbery implies assault (if there is no physical contact, the crime is recorded as Theft). The In-Vivo Concept of Robbery would therefore correspond both to the Organising Concept of Theft from the person and that of Personal violence. Since different sources may share categories but classify them differently, multiple relationships between In-Vivo Concepts will also be supported. Secondly, relationships between concepts will be meaningful: it will be possible to record that two concepts are associated as synonyms or antonyms, for example, as well as recording one as a sub-type of the other. Thirdly, the repository will not be delivered as an immutable finished product, but as an open and extensible framework. We shall investigate ways to enable qualified users to modify both the developed hierarchy of Organising Concepts and the relationships between these and In-Vivo Concepts.

In the context of the earlier discussion of semantic processing and rule-governed behaviour, this repository will demonstrate the ubiquity of rule-interpreting behaviour in the social world by exposing and ‘freezing’ the data which it produces. In other words, the repository will encode shifting patterns of correspondence, equivalence, negation and exclusion, demonstrating how the apparently rule-bound process of constructing meaning is continually determined by ‘shared context’.

The repository will thus expose and map the ways in which social data is structured by patterns of situational information. The extensible and modifiable structure of the repository will facilitate further work along these lines: the further development of the repository will itself be an example of rule-interpreting behaviour. The repository will not — and cannot — provide a seamless technological bridge over the semantic gap; it can and will facilitate the work of bridging the gap, but without substituting for the role of applied human intelligence.

Cloudbuilding (2)

Here’s a problem I ran into, halfway through building my first ontology, and some thoughts on what the solution might be.

Question 47 of the Mixmag survey reads:

Have you ever had an instance[sic] where your drug use caused you to:
Get arrested?
Lose a job?
Fail an exam?
Crash a car/bike?
Be kicked out of a club?

What this tells us is that one of the things the Mixmag questionnaire is ‘about’ – one of the in vivo concepts (or groups of in vivo concepts) that we need to record – is misadventures consequent on drug use. The question is how we define this concept logically – and this isn’t just an abstract question, as the way that we define it will affect how people can access the information. There are three main possibilities.

1. Model the world
We could say that to have a job is to be a party to a contract of employment, which is a type of agreement between two parties, which is agreed on a set occasion and covers a set timespan. Hence to lose a job is to cease to be a party to a previously-agreed contract of employment; this may occur as a consequence of drug use (defined, in the Mixmag context, as the use of a psychoactive substance other than alcohol and tobacco).

This is all highly logical and would make it explicit that the Mixmag data contains some information on terminations of contracts of employment (as well as on drug-related stuff). However, the Mixmag survey isn’t actually about contracts of employment, and doesn’t mandate the definitional assumptions I made above. So this isn’t really legitimate. (It would also be incredibly laborious, particularly when we turn our attention away from the relatively succinct Mixmag survey and look at more typical social survey data: surveys of physical capacity, for example, routinely ask people whether they can (a) walk to the shops (b) walk to the Post Office (c) walk to the nearest bus stop, and so on down to (j) or (k). All, in theory, capable of being modelled logically – but perhaps only in theory.)

2. Stick to the theme
Alternatively, we could begin by taking a view as to the key concepts which a data source is about – in this case, psychoactive consumption, feelings about psychoactive consumption, consequences of psychoactive consumption, and sexual behaviour – and draw the line at anything beyond those concepts. On this assumption the fact that the survey covers misadventures consequent on drug use would be within scope, but the list of misadventures given above wouldn’t be: that’s part of the data that researchers will find when they look at the data source itself, not part of the conceptual ‘catalogue’ that we’re building. The advantage of this is that it’s conceptually very ‘clean’ and makes it that much clearer what a source is about; the disadvantage is obviously that it cuts off some ways in to the data and hides some information.

3. Include black boxes
What I’ve got at the moment – following the principle of using the definitions supplied by the source – is an ontology in which some concepts are defined and others are undefined (black boxes). For instance, I’ve got a concept of Job loss, but all that OWL ‘knows’ about it is that it’s a type of Misadventure (which may be consequent on drug use) – which is in turn a type of Life event, (which is a type of event that happens to one person). This would allow anyone searching for events consequent on drug use to get to job loss as a type of misadventure, but wouldn’t let them get to drug-related misadventure from job loss – unless they happened to enter the exact name of the ‘job loss’ concept. I’m coming to believe that this is unsatisfactory: we should define the model in terms of what a data source is about. This means that we’ve got to either take a narrow, domain-specific view or take the view that each source gives us one piece of a much larger picture – in which case we’re inevitably committed to modelling the world. But the ‘black box’ option isn’t really sustainable.

Cloudbuilding (1)

This one’s about work.

I’m currently documenting the concepts underlying the 2005 Mixmag Drug Survey using Protege. Here’s why:

The documentation of social science datasets on a conceptual level, so as to make multiple datasets comprehensible within a shared conceptual framework, is inherently problematic: the concepts on which the data of the social sciences are constructed are imprecise, contested and mutable, with key concepts defined differently by different sources. When a major survey release is published, for example, the accompanying metadata often includes not only a definition of key terms, but discussion of how and why the definitions have changed since the previous release. This information is of crucial importance to the social scientist, both as a framework for understanding statistical data and as a body of social data in its own right.

It follows that we cannot think in terms of ironing out inconsistencies between social science datasets and resolving ambiguities. Rather, documenting the datasets must include documenting the definitions of the conceptual framework on which the datasets are built, however imprecise or inappropriate these concepts might appear in retrospect. This will also involve preserving – and exposing – the variations between different sources, or successive releases from a single source.

There are currently two main approaches to conceptually-oriented data documentation. A ‘top down’ approach is exemplified by the European Language Social Sciences Thesaurus (ELSST). The Madiera portal allows researchers to explore ELSST and access European survey data which has been linked to ELSST keywords. The limitations of the top-down approach can be gauged from ELSST’s concepts relating to drug use. Drug Abuse, Drug Addiction, Illegal Drugs and Drug Effects are all ‘leaf’ concepts – headings which have no subheadings under them. However, they are in different parts of the overall ELSST tree: for example, Drug Abuse is under Social Problems->Abuse, while Drug Effects is under Biology->Pharmacology. Although the hierarchy is augmented by a list of ‘related’ concepts, to some extent facilitating horizontal as well as vertical navigation, the hierarchy inevitably makes some types of enquiry easier than others. Anyone using the ELSST ‘tree’ will be visually reminded of the affinities identified by ELSST’s authors between Pharmacology and Physiology, or between Drug Abuse and Child Abuse. These problems follow from the initial design choice of a single conceptual hierarchy.

This approach to classification has recently come under criticism. Advocates of ‘bottom-up’ approaches argue that top-down taxonomies like the Dewey Decimal System or ELSST are an artificial imposition on the world of knowledge, which is better represented as a set of individual acts of labelling or ‘tagging’. It is argued that the ‘trees’ of hierarchical taxonomies can be replaced with a pile of ‘leaves’.

One successful ‘bottom-up’ approach is the framework for documenting survey data developed by the Data Documentation Initiative (DDI). The DDI standard makes it possible to search on keywords associated with surveys, sections of surveys and individual questions; the short text of individual questions is also searchable. Searches of DDI metadata can also be run from the Madiera portal: a search on ‘marijuana’, for instance, brings back short text items including the following:

CONSUMED HASHISH,MARIJUANA
- Health Behaviour in School-Aged Children (Switzerland, 1990)

Smoking cannabis should be legal? Q2.31
- Scottish Social Attitudes Survey (Scotland, 2001)

Q92C DRUGS EV B OFFERED – MARIJUANA
- Eurobarometer 37.0 (EU-wide, 1992)

Clearly, this way in to the data makes it easy for a well-prepared researcher to track the use of particular concepts ‘in the wild’ (in vivo concepts). However, this gain comes at the cost of some information. There is wide variation both in the terminology used in the surveys and in the concepts to which they refer. In one survey smoking cannabis might be a type of petty crime; in others it might figure as a type of leisure activity or a potential health risk. These conceptual differences are reflected in the vocabulary used by data sources – and by researchers. Depending on context, three researchers using ‘marijuana’, ‘hashish’ and ‘cannabis’ as search terms may be asking for the same data or for three different sets of data.

Neither the ‘top-down’ nor the ‘bottom-up’ approach articulates the conceptual assumptions which underlie the construction of a dataset – assumptions expressed both in the definition of in vivo concepts and in relationships between them. Rather than leaving much of this conceptual information undocumented (the DDI approach) or encoding one ‘correct’ set of assumptions while excluding or sidelining others (the ELSST approach), we propose to offer a coherent hierarchy of in vivo concepts for each individual source, based on the definitions (explicit and implicit) used in each source. Comparing the in vivo conceptual hierarchies used in multiple datasets will enable researchers both to see where concepts are directly comparable and to see where – and how – their definitions diverge and overlap.

To document hierarchies of in vivo concepts, we shall use description logic and the Semantic Web language OWL-DL (Web Ontology Language – Description Logic). OWL-DL makes it possible to formulate a precise logical specification of concepts such as

- use of cannabis (either marijuana or hashish) in the month prior to the survey
- use of either Valium or temazepam, at any time
- seizures of Class A drugs by HM Customs in the financial year 2004/5

At least, that’s the idea. Now wait for part 2…

We climbed and we climbed

I don’t trust Yahoo!, for reasons which have nothing to do with my dislike of misused punctuation marks (although the bang certainly doesn’t help); I don’t trust Google either. Maybe it’s because I’m old enough to remember when MicroSoft [sic] were new and exciting and a major attractor of geek goodwill; maybe it’s just because I’m an incurable pinko and don’t trust anyone who’s making a profit out of me. Anyway, I don’t trust Yahoo!, or like them particularly; I switched to Simpy when Yahoo! bought del.icio.us, and I’ve felt a bit differently about Tom – hitherto one of my favourite bloggers anywhere – since he joined Yahoo!.

Still. This (PDF) is Tom’s presentation to the Future of Web Apps conference, and it’s good stuff – both useful and beautiful, to use William Morris’s criteria. The fourth rule (precept? guideline? maxim?) spoke to me particularly clearly:

Identify your first order objects and make them addressable

Start with the data, in other words; then work out what the data is; then make sure that people (and programs) can get at it. (Rule 5: “Use readable, reliable and hackable URLs”.) It’s a simple idea, but surprisingly radical when you consider its implications – and it’s already meeting resistance, as radical ideas do (see Guy Carberry’s comments here).

More or less in passing, Tom’s presentation also shows why the Shirkyan attempt to counterpose taxonomy to folksonomy is wrongheaded. If you’re going to let people play with your data (including conceptual data), then it needs to be exposed – but if you’re going to expose data in ways that people can get at, you need structure. And it doesn’t matter if it’s not the right structure, not least because there is no right structure (librarians have always known this); what matters is that it’s consistent and logical enough to give people a way in to what they want to find. To put it another way, what matters is that the structure is consistent and logical enough to represent a set of propositions about the data (or concepts). Once you’ve climbed that scaffolding, you can start slinging your own links. But ethnoclassification builds on classification: on its own, it won’t get you the stuff you’re looking for – unless what you’re looking for isn’t so much the stuff as what people are saying about stuff. (Which is why new-media journalists and researchers like tagging, of course.)

Anyway – very nice presentation by the man Coates. Check it out.

Home again

So, I’m a researcher. (At least until the money runs out next year; hopefully I’ll have something similar lined up by then.) Before I was a researcher I was a freelance journalist for about six years, while I did my doctorate; before that I was a full-time journalist for three years; and before that I worked in IT. Which is a whole other dark and backward abysm of time – I was a Unix sysadmin, and before that I was an Oracle DBA, and before that… database design, data analysis, Codasyl[1] database admin, a ghastly period running a PC support team, and before that systems analysis and if you go back far enough you get to programming, and frankly I still don’t trust any IT person who didn’t start in programming. (I’m getting better – at one time I didn’t trust anyone who didn’t start in programming.)

Now, there’s an odd kind of intellectual revelation which you sometimes get, when you’re a little way into a new field. It’s not so much a Eureka moment as a homecoming moment: you get it, but it feels as if you’re getting it because you knew it already. You feel that you understand what you’ve learnt so fully that you don’t need to think about it, and that everything that’s left to learn is going to follow on just as easily. Which usually turns out to be the case. The way it feels is that the structures you’re exploring are how your mind worked all along – or, perhaps, how your mind would have been working all along if you’d had these tools to play with. (Or: “It’s Unix! I know this!”)

I had that feeling a few times in my geek days – once back at the start, when I was loading BASIC programs off a cassette onto an Acorn Atom (why else would I have carried on?); once when I was introduced to Codasyl databases; and once (of course) when I met Unix, or rather when I understood piping and redirection. But the strongest homecoming moment was when, after being trained in data analysis, I saw a corporate information architecture chart (developed by my employer’s then parent company, with a bit of help from IBM). Data analysis hadn’t come naturally, but once I’d got it it was there – and, now that I had got it, just look what you could do with it! It was a sheet of A3 covered with lines and boxes, expressing propositions such as “a commercial transaction takes place between two parties, one of which is an organisational unit while the other may be an individual or an organisational unit”; propositions like that, but mostly rather more complex. I thought it was wonderful.

Fast forward again: database design, DBA, sysadmin, journalism, freelancing, PhD, research. Research which, for the last month or so, has involved using OWL (the ontology language formerly known as DAML+OIL) and the Protege logical modelling tool – which has enabled me to produce stuff like this.

It’s not finished – boy, is it not finished. But it is rather lovely. (Perhaps I just like lines and boxes…)

[1] If you don’t know what this means, don’t worry about it. (And if you do, Hi!)

Started slow, long ago

The other day my son asked to borrow the booklet from my CD of Smile, because he wanted to check the lyrics of “Good Vibrations”; I’d burned it to a CD that I’d given him for Christmas, along with a bunch of other stuff (Oasis, the Shins, Iggy Pop…) I told him the lyrics of the song on Smile were different from the original, and (once I’d persuaded him I wasn’t winding him up) that was that. But it started me thinking.

In the next post I’ll say some more about Smile – which I’d classify as a glorious failure for a number of reasons, some of them (interestingly) out of the control of anyone involved. In this one I’m going to talk about vinyl. Then with any luck there’ll be a third post which will tie it all together, although exactly how as yet I know not. But hey, enough of my yakkin’.

We’re all normal and we want our freedom.
Freedom?
Freedom. Freedom. Freedom. Freedom…
All o’ God’s children gotta have their freedom!

The first time I heard Forever Changes, I got up instinctively at the end of “The red telephone” to turn the record over. Then I sat down again, because the first time I heard Forever Changes was about eighteen months ago and I was listening to it on CD. All the same, I knew the end of side one when I heard it.

Like Dan, I miss the LP format. What I miss isn’t the LPs, which haven’t entirely gone – I’ve got a functioning turntable & still occasionally buy new music on vinyl – but the album format which they gave us. Consider: a heavy cardboard sleeve twelve inches square. There are the visuals, for a start: 12″ x 12″ is a handy format for artwork, not to mention the 24″ x 12″ of the gatefold. There are track listings, production credits, details of who played what and who wrote what; then there’s an inner sleeve, which may have more artwork and may have more information, or perhaps song lyrics. The whole thing is a rich, dense artifact, in a similar scale to a glossy magazine – a handy size to hold and contemplate, whether you’re listening to the music or anticipating it as you come home on the bus. At the same time, as packaging it’s deeply functional: it’s wrapped – fairly closely in some cases – around the record itself.

The record, let’s not forget, embodies the music. The record has a certain irreducible fetishistic appeal, which alternative recording technologies (partly because they were alternatives) never really acquired. You hold a black vinyl disc in your hands, and you’re holding crystallised music: that physical object is your only way of hearing that music. It’s divided into two sides – two sets of tracks. In itself, the way the tracks divide up tells you something about the music you were going to hear – particularly if there’s only one track on each side (or, for that matter, if there are ten). Either way, there’s an inevitable pause between side one and side two, giving you the chance to gather your thoughts and renew your attention.

Maintenant c’est joué. There’s a lot that’s good about the music-listening experiences which have effectively supplanted the LP. Indeed, their sheer convenience makes it slightly academic to talk about their drawbacks: as George Orwell said, travelling from London to Brighton by walking alongside a mule would certainly give you a better experience of the country than taking the train, but that didn’t mean anybody would actually do it. Still, something has been lost. I think there are three main factors. There’s the sheer length of the 80-minute CD format; it may suit Beethoven, but it’s death to the album format. The tracks multiply or sprawl; without the minimal structure provided by that end-of-side-one breather, the album turns into a big bag of tracks, inviting the listener to skip or resequence. (Top tip for Beck’s Sea change: 1,2,3,5,4,6, then 8-12. Try it.)

This technological erosion of the album format has both followed and reinforced the rise of a radio-like, track-based way of listening to music and thinking about music. In the piece I referenced earlier (which really is superb, by the way – if you’re going to follow any of these links, follow this one) Dan laments the poverty of musical metadata offered by iTunes: you can put names to the track, the album and the artist, and, er, that’s it. For me this suggests a conceptual shift rather than (or as well as) skimpy work by software engineers. On the radio, after all, you never expected to hear the name of the producer or the dates of the session or the jokey credit for the studio runner. You heard the music, you got the basic information and you could go out and track it down – it in this case being the vehicle of the real musical experience, the graspable object of beauty and store of information that was the album. That extra stage has more or less vanished now: what you hear is what you get, there is nothing else. Which means that the album becomes less important than the mix – your own mix, mined from the music you’ve accumulated. (Last year CD album sales actually rose in the UK – but sales of compilation albums fell sharply.) Malcolm McLaren, of all people, saw most of this coming years ago.

The third big change is one that Uncle Mal didn’t foresee, and it’s the clincher. With a portable cassette player – by which I don’t necessarily mean a Walkman (although it certainly made life easier when you didn’t have to tote around something the size of a shoulderbag) – with a portable tape player, anyway, your music could become as portable and as ubiquitous as music on the radio, at the cost of also becoming as light, disembodied and information-poor as music on the radio. But there was another cost, suggested by the lyrics to that song:

hit it, pause it, record it and play
turn it, rewind, and rub it away

A spool of tape is an extraordinarily inefficient medium for storing a series of separate tracks – more so than a vinyl LP, in some ways. You get into storage/retrieval tradeoffs straight away: the easiest tape to play is the one that consists solely of stuff you’re into right now, but that’s also the hardest tape to maintain. Of course, you could take the tape out and put another one in, but that only delays dealing with the problem – after all, how many tapes are you going to want to carry around?

So the third big change has been the rise of digital players. Sure, Malcolm had Annabella sing

now I don’t need no album rack
I carry my collection over my back

but I think the significant word in that is ‘back’: if you seriously intended to carry your entire album collection around on tape you’d need a sizeable rucksack. (I once taped one track from every album I owned – a project of quite Shandean irrationality. I’ve still got the tapes – they take up a whole drawer of my cassette box.) Now you can get your entire CD collection into a box the size of a fag packet in considerably less time than it takes to play them; you don’t get the same economy of time with vinyl albums, but it’s still perfectly doable. And then that’s it – that is your music collection. You can plug it into your audio system, you can plug it into your car stereo, you can hang it round your neck and go rollerblading if you’ve a mind to. You don’t need no album rack – wherever you are, your music is there.

This effect is exacerbated by the way music now seems to stay in fashion indefinitely: the Led Zeppelin of the fourth album and the Pink Floyd of Dark Side are still there – still our contemporaries. This is a very recent phenomenon; at the time of Dark side it would have seemed absurd to talk in this way about music that was 33 years old, or 13 years old for that matter. ‘Progressive’ rock wasn’t a genre to us then – it represented rock that had progressed, had left the past behind. (A few years later, many of us had similar views about the New Wave.) Most pop music from before the late 1960s is still over the horizon – there’s no appetite for replica reissues of Herman’s Hermits albums, and very little appetite for anything by the likes of Vince Taylor or Cliff Bennett – but once you get to about 1967 the clock has effectively stopped. (The Smile sessions were in 1966.)

You can have your music anywhere; not only that, but you can have all the music there is. On Desert Island Discs recently, John Sutherland chose his eight records to salvage from a shipwreck, but spoiled the effect rather with his choice of luxury item: a 60GB iPod, which would potentially give him the choice of another 14,992 pieces of music. Everyone’s a librarian – but it’s not a library of albums, with all their freight of musical metadata and art-work and in-jokes and period design and misprints; it’s a collection of tracks, each one as light and bodyless as the stuff they play on the radio, each interchangeable with any other. (Ultimate realisation of this vision: the iPod Shuffle, which picks its own random route from track to track, and doesn’t even deign to tell you which track you’re listening to.)

All of which makes this a strange time to be hearing Smile for the first time… But I’ll come on to that.

Postscript I was halfway through writing this post when I discovered I wasn’t the first person to think along these lines: in the February 2004 Salon article quoted here, Paul Williams of Crawdaddy comments, “It’s ironic that we’re talking about the [first] great album that never was at a time that the very form of the pop album is itself falling on hard times.” Spooky. Another one of those reverse premonitions, I guess.

Put your head back in the clouds

OK, let’s talk about the Long Tail.

I’ve been promising a series of posts on the Long Tail myth for, um, quite a while. (What’s a month in blog time? A few of those.) The Long Tail posts begin here.

Here’s what we’re talking about, courtesy of our man Shirky:

We are all so used to bell curve distributions that power law distributions can seem odd. The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution. Of the 433 listed blogs, the top two sites accounted for fully 5% of the inbound links between them. (They were InstaPundit and Andrew Sullivan, unsurprisingly.) The top dozen (less than 3% of the total) accounted for 20% of the inbound links, and the top 50 blogs (not quite 12%) accounted for 50% of such links.


Figure #1: 433 weblogs arranged in rank order by number of inbound links.

It’s a popular meme, or it would be if there were any such thing as a meme (maybe I’ll tackle that one another time). Here’s one echo:

many web statistics don’t follow a normal distribution (the infamous bell curve), but a power law distribution. A few items have a significant percentage of the total resource (e.g., inbound links, unique visitors, etc.), and many items with a modest percentage of the resources form a long “tail” in a plot of the distribution. For example, a few websites have millions of links, more have hundreds of thousands, even more have hundreds or thousands, and a huge number of sites have just one, two, or a few.

Another:

if we measure the connectivity of a sample of 1000 web sites, (i.e. the number of other web sites that point to them), we might find a bell curve distribution, with an “average” of X and a standard deviation of Y. If, however, that sample happened to contain google.com, then things would be off the chart for the “outlier” and normal for every other one.If we back off to see the whole web’s connectivity, we find a very few highly connected sites, and very many nearly unconnected sites, a power law distribution whose curve is very high to the left of the graph with the highly connected sites, with a long “tail” to the right of the unconnected sites. This is completely different than the bell curve that folks normally assume

And another:

The Web, like most networks, has a peculiar behavior: it doesn’t follow standard bell curve distributions where most people’s activities are very similar (for example if you plot out people’s heights you get a bell curve with lots of five- and six-foot people and no 20-foot giants). The Web, on the other hand, follows a power law distribution where you get one or two sites with a ton of traffic (like MSN or Yahoo!), and then 10 or 20 sites each with one tenth the traffic of those two, and 100 or 200 sites each with 100th of the traffic, etc. In such a curve the distribution tapers off slowly into the sunset, and is called a tail. What is most intriguing about this long tail is that if you add up all the traffic at the end of it, you get a lot of traffic

All familiar, intuitive stuff. It’s entered the language, after all – we all know what the ‘long tail’ is. And when, for example, Ross writes about somebody who started blogging about cooking at the end of the tail and is now part of the fat head and has become a pro, we all know what the ‘fat head’ is, too – and we know what (and who) is and isn’t part of it.

Unfortunately, the Long Tail doesn’t exist.

To back up that assertion, I’m going to have to go into basic statistics – and trust me, I do mean ‘basic’. In statistics there are three levels of measurement, which is to say that there are three types of variable. You can measure by dividing the field of measurement into discrete partitions, none of which is inherently ranked higher than any other. This car is blue (could have been red or green); this conference speaker is male (could have been female); this browser is running under OS X (could have been Win XP). These are nominal variables. You can code up nominals like this as numbers – 01=blue, 02=red; 1=male, 2=female – but it won’t help you with the analysis. The numbers can’t be used as numbers: there’s no sense in which red is greater than blue, female is greater than male or OS X is – OK, bad example. Since nominals don’t have numerical value, you can’t calculate a mean or a median with them; the most you can derive is a mode (the most frequent value).

Then there are ordinal variables. You derive ordinal variables by dividing the field of measurement into discrete and ordered partitions: 1st, 2nd, 3rd; very probable, quite probable, not very probable, improbable; large, extra-large, XXL, SuperSize. As this last example suggests, the range covered by values of an ordinal variable doesn’t have to exhaust all the possibilities; all that matters is that the different values are distinct and can be ranked in order. Numeric coding starts to come into its own with ordinals. Give ‘large’ (etc) codes 1, 2, 3 and 4, and a statement that (say) ’50% of size observations are less than 3′ actually makes sense, in a way that it wouldn’t have made sense if we were talking about car colour observations. In slightly more technical language, you can calculate a mode with ordinal variables, but you can also calculate a median: the value which is at the numerical mid-point of the sample, when the entire sample is ordered low to high.

Finally, we have interval/ratio or I/R variables. You derive an I/R variable by measuring against a standard scale, with a zero point and equal units. As the name implies, an I/R variable can be an interval (ten hours, five metres) or a ratio (30 decibels, 30% probability). All that matters is that different values are arithmetically consistent: 3 units minus 2 units is the same as 5 minus 4; there’s a 6:5 ratio between 6 units and 5 units. Statistics starts to take off when you introduce I/R variables. We can still calculate a mode (the most common value) and a median (the midpoint of the distribution), but now we can also calculate a mean: the arithmetic average of all values. (You could calculate a mean for ordinals or even nominals, but the resulting number wouldn’t tell you anything: you can’t take an average of ‘first’, ‘second’ and ‘third’.)

You can visualise the difference between nominals, ordinals and I/R variables by imagining you’re laying out a simple bar chart. It’s very simple: you’ve got two columns, a long one and a short one. We’ll also assume that you’re doing this by hand, with two rectangular pieces of paper that you’ve cut out – perhaps you’re designing a poster, or decorating a float for the Statistical Parade. Now: where are you going to place those two columns? If they’re nominals (‘red cars’ vs ‘blue cars’), it’s entirely up to you: you can put the short one on the left or the right, you can space them out or push them together, you can do what you like. If they’re ordinals (‘second class degree awards’ vs ‘third class’) you don’t have such a free rein: spacing is still up to you, but you will be expected to put the ‘third’ column to the right of the ‘second’. If they’re I/R variables, finally – ’180 cm’, ’190 cm’ – you’ll have no discretion at all: the 180 column needs to go at the 180 point on the X axis, and similarly for the 190.

Almost finished. Now let’s talk curves. The ‘normal distribution’ – the ‘bell curve’ – is a very common distribution of I/R variables: not very many low values on the left, lots of values in the middle, not very many high values on the right. The breadth and steepness of the ‘hump’ varies, but all bell curves are characterised by relatively steep rising and falling curves, contrasting with the relative flatness of the two tails and the central plateau. The ‘power law distribution’ is a less common family of distributions, in which the number of values is inversely proportionate to the value itself or a power of the value. For example, deriving Y values from the inverse of the cube of X:

X value Y formula Y value
1 1000 / (1^3) 1000
2 1000 / (2^3) 125
3 1000 / (3^3) 37.037
4 1000 / (4^3) 15.625
5 1000 / (5^3) 8
6 1000 / (6^3) 4.63

As you can see, a power law curve begins high, declines steeply then ‘levels out’ and declines ever more shallowly (it tends towards zero without ever reaching it, in fact).

Got all that? Right. Quick question: how do you tell a normal distribution from a power-law distribution? It’s simple, really. In one case both low and high values have low numbers of occurrences, while most occurrences are in the central plateau of values around the mean. In the other, the lowest values have the highest numbers of occurrences; most values have low occurrence counts, and high values have the lowest counts of all. In both cases, though, what you’re looking at is the distribution of interval/ratio variables. The peaks and tails of those distribution curves can be located precisely, because they’re determined by the relative counts (Y axis) of different values (X axis) – just as in the case of our imaginary bar chart.

Back to a real bar chart.

Figure #1: 433 weblogs arranged in rank order by number of inbound links.

The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution.

As you can see, this actually isn’t a power law distribution – roughly or otherwise. It’s just a list. These aren’t I/R variables; they aren’t even ordinals. What we’ve got here is a graphical representation of a list of nominal variables (look along the X axis), ranked in descending order of occurrences. We can do a lot better than that – but it will mean forgetting all about the idea that low-link-count sites are in a ‘long tail’, while the sites with heavy traffic are in the ‘head’.

[Next post: how we could save the Long Tail, and why we shouldn't try.]

A place for everything

Or: what ethnoclassification is, and what folksonomy isn’t.

When it comes to tagging, I’m facing both ways. I think it’s fascinating and powerful and new – qualitatively new, that is: it’s worth writing about not just because it’s shiny, but because there’s still work to be done on understanding it. At the same time, I think it’s been massively oversold, often on the back of rhetorical framings which only have a glancing relationship with evidence or logic. Tagging is fascinating and powerful and new, but a lot of the talk about tagging has me tearing my hair.

I’ll pick on a recent post by Dave Weinberger. (Personal to DW: sorry, Dave. I’m emphatically not (is that emphatic enough?) suggesting that you’re the worst offender in this area.)

Let’s say you type in “africa,” “agriculture” and “grains” because that’s what you’re researching. You’ll get lots of results, but you may miss pages about “couscous” because Google is searching for the word “grain” and doesn’t know that that’s what couscous is made of. Google knows the words on the pages, but doesn’t know what the pages are about. That’s much harder for computers because what something is about really depends on what you’re looking for. That same page on couscous that to you is about economics could be about healthy eating to me or about words that repeat syllables to someone else. And that’s the problem with all attempts by experts and authorities to come up with neat organizations of knowledge: What something is about depends on who’s looking.

Let’s say you come across the Moroccan couscous web page and you want to remember it. So you upload its Web address to your free page at del.icio.us that lists all the pages you’ve saved. Then del.icio.us asks you to enter a word or two as tags so you can find the Moroccan page later. You might tag it with Morocco, recipe, couscous, and main course, and then later you can see all the pages you’ve tagged with any of those words.That’s a handy way to organize a large list of pages, but tagging at del.icio.us really took off because it’s a social activity: Everyone can see all the pages anyone has tagged with say, Morocco or main course or agriculture. This is a great research tool because just by checking the tag “agriculture” now and then, you’ll see every page everyone else at delicious has tagged that way. Some of those pages will be irrelevant to you, of course, but many won’t be. It’s like having the world of people who care about a topic tell you everything they’ve found of interest. And unlike at Google, you’ll find the pages that other humans have decided are ABOUT your topic.

What strikes me about this passage is that Dave changes scenarios in mid-stream: Let’s say you come across the Moroccan couscous web page… How? Google couldn’t find it. Let’s compare like with like, and say that you’re still looking for your couscous page: what do you do then, if not go to del.icio.us and type in “africa,” “agriculture” and “grains”? Once again, assuming that whole-site searches aren’t timing out, you’ll get lots of results (particularly since del.icio.us doesn’t seem to allow ANDing of search terms) but you may miss pages about “couscous” – and checking the tag “agriculture” now and then won’t necessarily help. Google will miss the page if the term ‘couscous’ doesn’t appear in the source (which doesn’t necessarily mean ‘appear on screen’, of course); del.icio.us will miss it if the term hasn’t been used to tag it (even if it is in the source).

Google vs del.icio.us is an odd comparison, in other words, and it’s not at all clear to me that the comparison favours del.icio.us. It’s great to get classificatory(?) input from the users of a document, of course – as I said above, tagging is fascinating and powerful and new – but in terms of information retrieval it can only score over a full-text search if

1. the page has been purposefully tagged by a user
2. the page has been tagged with a term which doesn’t appear in the page source
3. a second user is searching for information which is contained in the page, using the term with which the first user tagged it

I don’t think tagging advocates think enough about what those conditions imply. For example, at present I’m the only del.icio.us user to have tagged Mr Chichimichi’s Tags are not a panacea; I tagged it with ‘tagging’, ‘search’ and ‘ethnoclassification’. Until I did so, anyone looking for it would have been out of luck. Even Google wouldn’t be much help – the word ‘ethnoclassification’ doesn’t appear anywhere in the text. No, until a couple of days ago your only way of stumbling on that post would have been to run a clumsy, counter-intuitive Google search on terms like ‘tagging’, ‘tags’, ‘folksonomies’ and ‘social software’. (Google even knows that ‘folksonomies’ is the plural of ‘folksonomy’, so searching on the singular form would work just as well. That’s just not fair.)

Dave also contrasts the world of collective knowledge through distributed tagging with attempts by experts and authorities to come up with neat organizations of knowledge. Further along in the same piece, he writes:

This takes classification and about-ness out of the hands of authors and experts. Now it’s up to us readers to decide what something is about.Not only does this let us organize stuff in ways that make more sense to us, but we no longer have to act as if there’s only one right way of understanding everything, or that authors and other authorities are the best judges of what things are about.

One question: who ever said that there was only one right way of understanding everything? OK, too easy. I’ll rephrase that: before tagging came along, who was saying there was one right way, etc? Who are the tagging advocates actually arguing against? (It certainly isn’t librarians (context here).)

There’s a difference between classifications which have a single pre-determined set of definitions and classifications which are user-defined and user-extensible. But that’s not the same as the difference between having an underlying ontology and not having one, or the difference between hierarchical and flat organisations of knowledge, or the difference between single and multiple sets of classifications. A closed, expert-defined, locked-down controlled vocabulary may contain multiple sets of overlapping terms; it may be a flat list of categories rather than a ‘tree’; it may even be innocent of ontology. (Thanks to Jay for pointing this out, in comments here.) If tagging is better than top-down classification, it’s better because it’s user-defined and user-extensible – not because it’s free of the vices of ontology, hierarchy and uniformity. The idea that tagging – and only tagging – stands in opposition to a classifying universe built on hierarchical uniformity is a straw man. (But the librarians get it both ways – if a top-down classifying system is shown to be flat and plural, this can be put forward as a sign of the weakness of top-down systems; the fact that bottom-up systems are more, not less, vulnerable to Chinese Encyclopedia Syndrome is passed over.)

So, tagging systems make lousy search engines, and they don’t mark a qualitative leap in the organisation of human knowledge. What they’re really good for – and what makes them fascinating and powerful – is conversation. Tagging, I’m suggesting, isn’t there to tell us about stuff: it’s there to tell us about what people say about stuff. As such, it performs rather poorly when you’re asking “where is X?” or “what is X?”, and it comes into its own when you’re asking “what are people saying about X?” (Of course, much tag-advocacy is driven by the tacit belief that there’s no fundamental difference between what people say about X and expert knowledge of X – and that an aggregate of what people say would be equivalent, if not superior, to expert knowledge. But that’s an argument for another post.)

Tagging is good for telling us what people say about stuff, anyway – and when it’s good, it’s very good. To see what I’m talking about, have a look at Reader2 (via Thomas). It’s a book recommendation site, implemented on the basis of a del.icio.us-like user/tag system. It’s powerful stuff already, and it’s still being developed. Does it tell me what books are really like? No – but it tells me what people are saying about them, which is precisely what I want to know. And it couldn’t do this nearly as well, it seems to me, without tags – and tag clouds in particular. This, for me, is what tagging’s all about. Ethnoclassification: classification as a open-ended collective activity, as one element of the continual construction of social reality.

Not available before

Thanks to a couple of links posted by Thomas, I’ve just read Bryan Boyer’s Correspondance Romano (Corriere Romano, surely? never mind) closely followed by this post from February by Tom Evslin. Tom:

People don’t think hierarchically – at least most people don’t. We think in terms of associations. Our dreams give this away as they hyperlink through experiences of the day and memories of the distant past. A conversation meanders horizontally from one topic to the next.

Hierarchies like Lotus Notes or the Dewey Decimal System were necessary when computing power was non-existent or very expensive. As computing power has become relentlessly cheaper thanks to Moore’s law, hierarchies of information have become unnecessary. … So long as Google or its competitors can index almost everything I might ever want to find, why should any arbitrary order be imposed on information?

Once we didn’t need hierarchies to organize our approach to information, they became an impediment. It is very hard for one person to figure out which node in which folder tree another person would have put a particular piece of information. A document may be relevant to one researcher for entirely different reasons than it is relevant to another researcher.

The relationship between documents is actually dynamic depending on the needs of the reader. Not incidentally, open tagging and hyperlinking are both ways to impose particular relationships on documents to meet the need of some subset of readers.

In passing, this suggests that the contribution of tagging to the grunt work of actually finding stuff may not be all that significant. After all, “a document may be relevant to one researcher for entirely different reasons than it is relevant to another researcher”: in this respect the same strictures apply to tags as to folders, with the proviso that tagging does at least give you multiple chances to get it right. I’ve found useful and interesting stuff by browsing del.icio.us, but I’ve also found useful and interesting stuff by browsing library catalogues, running partial name searches on booksellers’ sites, googling common phrases and going to the eighth page of results, and so forth. But then, I’m a catalogue-hound and I like being surprised. If you’re looking for something specific, Tom’s argument (inadvertently?) suggests, you’re probably better off with Google.

Bryan’s post doesn’t discuss taxonomies, ontologies or search engines, largely because it’s a series of emails from 2002. But it does contain this beautiful piece of ethnoclassification:

Italy is about all of these things: cured meats, standing up to drink your coffee, stiffling heat, mid-day naps, skulls in churches, hot men in suits on scooters, Ananas, and cheap groceries.

This is very much the kind of freewheeling associational approach to knowledge that Tom describes – and very much the kind of ground-up, non-exclusive, plural, open-ended classifying process which has become known as ‘folksonomy’.

But what happens if we take that sentence and map it onto the current ‘folksonomic’ toolset? Is there an ‘Italy’ resource somewhere – a really really authoritative Web page, say – that we can tag with ‘curedmeat’, ‘coffeestandingup’, ‘stifflingheat’ and so on? (Never mind the problem of cross-matching with the tags ‘meat.cured’, ‘coffee.standing’ and ‘heat.stiffling’ – let alone ‘heat.stifling’.) Or are we going to use an ‘italy’ tag and apply it to single identifiable resources on ‘cured meat’, ‘hot men in suits on scooters’, etc? If so, did all those resources exist before we tried to tag them – and if not, are we going to have to create them?

The kind of association described by Tom – and exemplified by Bryan’s old mails – is actually a very bad fit for the Technorati/del.icio.us style of document tagging, for two reasons. One is that it’s two-way: if ‘Italy’ is associated with ‘skulls in churches’ then ‘skulls in churches’ is necessarily associated with ‘Italy’. (In the case of document-based tagging, the relationship is asymmetrical and the inverse relationship is weaker: Document 1 ‘is about’ T1, T2, T3; Topic 1 ‘has some relevant information in’ D1, D2, D3.) The other is that it’s descriptive rather than annotative: we’re not tagging stuff-about-stuff, we’re tagging… well, stuff, and tagging it with other stuff. These bi-directional relationships between concepts can be approximated by the associations between tags which emerge out of the cumulative process of document tagging, but this seems like going a very long way round. “We think in terms of associations”: should we have to say

this has been applied to resources which have also been classified as that

when what we want to say is

this is like that ?

There’s one glaring exception to this argument: Flickr. It’s easy to imagine an ‘italy’ photoset including images which were also tagged with ‘curedmeat’, ‘churchskull’ and so forth. Descriptive tagging, bi-directional associations, it’s all there – job done. This is deceptive, however. Flickr runs on discrete objects – individual images – and the relationships between Flickr tags really describe the images themselves, or at most the universe of Flickr images. If we didn’t have any images of stifling heat in Italy, that association wouldn’t exist; if we had three salami pictures and only one of a skull in a church, the ‘curedmeat’/’italy’ association would automatically be three times as strong as ‘churchskull’/italy’. Once again, we’d have to go to considerable lengths in order to represent the associations which Bryan effortlessly set out in 32 hastily-composed words.

Ethnoclassification: do we have the technology?

Tag tag tag

Tom Coates’ interesting post Two cultures of fauxonomies collide has been getting a lot of attention lately, mainly thanks to Dave. There’s a particularly interesting discussion running at Many-to-Many. The discussion has progressed quite rapidly, with several bright and articulate people pitching in to illustrate how Tom’s original insight can be developed. My problem is that I’m not sure what the discussion’s based on. For example, Emil Sotirov writes:

Seemingly, given the freedom of folksonomy, people tend to move from hierarchical “folder” modes of tag interpretation (one-to-many) towards more open “keyword” modes (many-to-many).

Keywords are flat, many-to-many, open; folders are hierarchical, one-to-many, closed. (In short, folders are bad, m’kay?) But what does this really mean? If I think that tags are ‘like’ keywords or that tags are ‘like’ folders, what difference does it actually make?

From Tom’s original piece:
Matt’s concept was quite close to the way tagging is used in del.icio.us – with an individual the only person who could tag their stuff and with an understanding that the act of tagging was kind of an act of filing. My understanding was heavily influenced by Flickr’s approach – which I think is radically different – you can tag other people’s photos for a start, and you’re clearly challenged to tag up a photo with any words that make sense to you. It’s less of a filing model than an annotative one.

Incidentally, “an individual the only person who could tag their stuff”? That’s Technorati rather than del.icio.us, surely?

But anyway – the main question is, what are you actually doing differently if you use a tag as an ‘annotative’ keyword rather than a ‘classifying’ folder? In either case, it seems to me, you’re pulling out a couple of characteristics of an object and using them to lay a trail back to it. The only real difference I can see is that you’d expect to have more ‘keywords’ than objects and fewer ‘folders’ than objects, but I can’t see how this changes the way you actually interact with the tags or the tag-holder services – or the objects, for that matter.

Perhaps I’m just not getting something – all enlightenment is welcome. But I suspect that, in practice, Flickr and del.icio.us and… er, all those other social tagging services… are converging on a model somewhere between ‘keyword’ and ‘folder’. The tag cloud is crucial here. Flickr may start by enabling you to “tag up a photo with any words that make sense to you”, but the tag cloud display “conceals the less popular [tags] and lets recurrence form emergent patterns” (as Tom notes here); it also prompts users to select from previously-used tags if possible. Conversely, the (more rudimentary) tag-cloud display in del.icio.us gives less-used tags more prominence than they had when they were left to scroll off the screen, prompting users to select more widely from previously-used tags. In effect, the tag cloud draws del.icio.us users away from big-tree-of-folders thinking, while also drawing flickr users away from the keyword-pebbledash approach.

[No, that wasn't my promised post about the Long Tail. (It doesn't exist, you know.) Yes, I will get round to it, some time.]

Semiological, or almost entirely?

Mike Harper:

Semiotics, which is clearly older than the semantic web, tells us you can’t always map signs to real world objects. You can do it for things like, say, the Taj Mahal, but not for things like democracy, justice etc. So they map to concepts. Trouble is, you’re talking really about what’s inside someone else’s head. And you can’t really be sure what that is. So, the argument goes, stuff like RDF is just “syntactic sugar”. It’s neatly structured but can’t escape the fact that the tags, urns etc have to have an agreed meaning … I can’t bring myself to agree with this completely. In practice people seem to get by. I think there must be a feedback loop involved. If you interpret a statement about X and act on it, and your interpretation is wrong, and the interpretation matters in this case, something bad will probably happen. You will then revise your understanding of what is meant by X.

This is all good phenomenological stuff – see the Schutz quote above. One of Schutz’s great arguments was that there is no definitional God’s eye view – there is only human social experience, including the experience of making and using signs.

So surely the semantic web can work in small ways where all parties are agreed on the meaning of the vocabulary.

The trouble is – as Clay pointed out back here – that if you’ve got that level of agreement among all participants you don’t need the semantics. If you’re all using the same schema anyway, your respective schemas don’t need to describe themselves – and if they do need to describe themselves, there needs to be a common language they can do it in, and hence a higher level of shared context.

What you can do is say “I’m using [x] to mean $FOO, which is a subtype of $BAR but does not overlap with $BAZ; how about you?” Or rather, “On 2005-06-03, writing in Manchester (England/UK/EU), I used [x] to mean $FOO…” and so on. That, to me, is (or rather will be) where it gets interesting – the point is not to encode semiotics but to encode semantics in such a way that the semiotics can be inferred.

Or rather, in such a way that the semiotics can’t not be inferred. Which they need to be. Once you get away from the physical sciences and their geek spinoffs, it’s very, very hard to reach a final level of granularity. You can map the physical contours of France in exactly the same way that you can map Britain – and with enough data you could map Britain 100 years ago and map France 100 years ago in exactly the same way. What you can’t do is chart the number of suicides or street thefts or families in poverty or users of illegal drugs or asylum applications or hospital admissions in Britain and compare them with the figures for Britain 100 years ago, let alone with French figures. This is not because the data isn’t there, but (in all those cases) because it’s the product of a complex set of social interactions – and, as such, it doesn’t have a stable meaning, in time or in space.

This is what I mean about inferring semiotics: figures on ‘drug use’, to take the most obvious example, are produced in particular ways and classified using particular criteria, which correspond to patterns of public health and law enforcement activity as well as to broader social attitudes. The data doesn’t contain or express those attitudes and patterns of activity – but if you don’t know about them it’s effectively meaningless. (“Hey, look, there are twice as many people using drugs! Oh, wait, there are twice as many substances classified as drugs. Never mind.”) The only way forward, it seems to me, is to (as it were) factory-stamp data with the conditions of its production, as far as they can be established: “this source on ‘drugs’ covers this period in this jurisdiction, and consequently uses definitions derived from this legislation, including this but excluding this and this“.

That’s what I’d like to do, anyway.

Greetings and salutations (and anomie)

I’ve started this blog as a place to collect my thoughts on user-centred ontologies, ethnoclassification, folksonomies, emergent semantics and so on. I’m looking at this area as part of a project for a repository of social science data sources at Manchester University. In my spare time I run another blog, Actually Existing; Chris at qwghlm has the rare distinction of being on the blogroll in both places. (Hi Chris!)

I wasn’t going to post anything else, but what do you know. I’ve just spent half an hour composing a comment on this entry on Many-to-Many – or rather, a reply to this comment – only to find that comments were closed. Not that it says so anywhere on the page. H’mph. Oh well, their loss is our gain.

Larry Sanger wrote: “There is nothing magical about how Wikipedia does things; it is just one system. I have every confidence that another system will arise, probably quite soon, that will blow Wikipedia out of the water, in terms of quality, while being equally productive and nearly as open”.

Does Wikipedia have a big (potentially insuperable) deficit? If so, what is it?

It seems to me that Wikipedia (or any collective, open repository) has two problems. One is enabling contributions to be challenged, debated and refined; the other is getting the articles to be written by, and the debates to be conducted among, people who know stuff (‘actual philosophers’ in Larry’s example). As I understand it, Wikipedia does the first of these very well, but can’t guarantee – and, more to the point, doesn’t necessarily promote – the second. The ideal is that, thanks to the process of open debate, good stuff will go into the repository and stay there, while errors are weeded out, weak entries are improved and gaps are filled. It may take a while for some of the more obscure areas to get populated adequately, but we trust that more topic area experts will come on board over time. I wonder, firstly, if that’s enough – whether the quality of Wikipedia is ever likely to be uniform, or (ware straw-man) near enough to uniform for the value of a Wikipedia citation to be fairly consistent.

Secondly, I wonder if there’s a risk of mistaking the goal (a repository of near-enough uniform quality) for what exists (a highly uneven repository with a few local areas of uniformly high quality). In other words, even for those people who believe that the goal is realisable and the Wikipedia framework will ultimately allow it to be realised, it’s important to bear in mind – and to make it known – that we’re not there yet.

Thirdly, I wonder (having read Danah’s comments and looked at the ‘anomie’ page discussed there) if the quality problem, in some areas, is more fundamental – if the problem isn’t that Wikipedia’s got ground to make up but that it’s facing the wrong way. For what I’d want to know about a concept like that, that page is pretty dreadful. It veers wildly between essentialism (there is a thing called ‘anomie’ and we know what it is, across time and space) and nominalism (different people have used this combination of letters to mean different things, who knew?). What’s not there is any sense of the history of the concept: it derives from a Greek word meaning [x] (if there was such a word – the wording is unclear); it was coined by Durkheim (or it already existed and was given more prominence – again, it’s unclear); he used it to mean [y]; Hayek later got hold of it (from Durkheim? from the Greeks?) and used it to mean [z]; that’s different from Durkheim’s conceptualisation in this way and this way; and it’s since entered common parlance, probably because of [well, how? I'm not sure].

I’m not saying this to slag off some Wikipedia contributors I don’t even know. My point – or rather, my tentative suspicion – is that Wikipedia may actually lend itself to problems like these, by starting from the question “What does $FOO mean?” For any value of $FOO there are two easy ways to answer that question – essentialist (‘here’s what it really means’) and nominalist (‘here are all the different ways people use it’) – and if you’re asking about the OSI 7-layer model, say, that’s precisely what you want. (Essentialist answer: “Level 1 is defined as [a]…” Nominalist answer: “‘OSI’ also stands for ‘Open Source Initiative’…”) Philosophy, and many of the social sciences, need a very different approach – which I’ll try to describe another day, maybe.

Follow

Get every new post delivered to your Inbox.

Join 241 other followers

%d bloggers like this: