A trick of the eye

A long time ago on a Web site far, far away, Clay Shirky wrote:

“We are all so used to bell curve distributions that power law distributions can seem odd.”

He then traced Pareto-like ‘power law’ curves operating in a number of domains where large numbers of people make unconstrained choices – most memorably, inbound link counts for blogs. The inverse ‘power law’ curve dives steeply, then levels out, glides downwards almost to zero and peters out slowly. And thus was born the ‘Long Tail’.

As I wrote here, there’s a problem with this article, and hence with the ‘Long Tail’ image itself. Despite repeated references to ‘power law distributions’, none of the curves Clay presented were distributions. They were histograms representing ranked lists: in other words series of numbers ordered from high to low.

What’s the difference? A short answer is that the data Clay presents makes his own comparison with ‘bell curve’ (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you’ll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that’s ‘x’ as in horizontal axis, low to high values running left to right; ‘y’ values are on the vertical axis, low to high running bottom to top).

1 1
2 30
3 100
4 240
5 400
6 600
7 750
8 900
9 960
10 1000
11 1000
12 960
13 900
14 750
15 600
16 400
17 240
18 100
19 30
20 1

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

1 1000
2 444
3 250
4 160
5 111
6 82
7 63
8 49
9 40
10 33
11 28
12 24
13 20
14 18
15 16
16 14
17 12
18 11
19 10
20 9

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

1 9
2 10
3 11
4 12
5 14
6 16
7 18
8 20
9 24
10 28
11 33
12 40
13 49
14 63
15 82
16 111
17 160
18 250
19 444
20 1000

Finally, by way of contrast, here’s a series of numbers.

1000
444
250
160
111
82
63
49
40
33
28
24
20
18
16
14
12
11
10
9

I’ve sorted these numbers high to low, but – unlike the other three examples – there’s nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It’s up to you.

I’m not saying that a ranked listing – arranging numbers like these high to low – is meaningless. The ranked histogram is quite a good graphic – it’s informative (within limits) and easy to grasp. What I am saying is that it’s an arbitrary ordering rather than a distribution. Which is to say, it’s not the best way of representing this data – let alone the only way. It’s a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this – and a couple of constructive suggestions – next time I post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: