Patterns in the World Wide Web
By Bernardo Huberman
“If one concentrates on the number of visitors to sites, a proxy for their commercial value, it turns out that the top 0.1% of all sites in the World Wide Web capture a whopping 32.36% of user volume.”
Riding on top of this transformation and actually accelerating its pace is the phenomenon of the Web, which is the medium through which much of the information carried by the Internet is accessed and displayed. Originally conceived in a physics laboratory in Geneva as a mechanism for distributing information among physicists trying to unravel the ultimate structure of matter, the web quickly spread through the globe to all sorts of people and business, making it the de-facto interactive medium of the information age. This spread is nothing but phenomenal; from a few pages available in 1993 and several million in 1998, to two billion pages indexed by Google last year. And the user population has also increased in astonishing fashion. Whereas in 1996 there were 61 million users, at the close of 1998 over 147 million people had Internet access worldwide. And as of May of this year, the user population was estimated to be 580 million worldwide.
While a casual examination of a few web pages, or of traffic along the Internet, would confirm one’s intuition that such gigantic informational web is essentially random, a systematic and scientific study of the nature of the web has revealed the existence of order in the midst of this chaos. And the discovery of this order is surprising because no one would have anticipated finding anything regular when considering the chaotic fashion in which the web grows. No central planner tells its users how to design their sites, forage for information, or even how to organize the links that allow for users to surf from one page or site to the other. The structure and content of the web is the result of actions by millions of people who seldom, if ever, think of the global implications of their adding an extra page or a link to their sites. That these actions can lead to an organized whole in spite of their independent nature was articulated in the context of societies by Hayek in his paper “The Use of Knowledge in Society”, which he published in 1945 in the American Economic Review.
The order that one finds in the Web manifests itself in the form of lawful patterns that persist over different geographical locations, millions of pages or even the nature of electronic commerce throughout the world. I describe these lawful patterns in my book The Laws of the Web: Patterns in the Ecology of Information published by MIT Press last year. It turns out that there are many small elements contained within the Web, but few large ones. A few sites consist of millions of pages, but millions of sites only contain a handful of pages. Also, few sites contain millions of links, but many sites have one or two. Millions of users flock to a few select sites, giving little attention to millions of others. In many of these cases, these patterns can be expressed in mathematical fashion as a so-called power law, meaning that the probability of attaining a certain size x is inversely proportional to x to some power, whose numerical value is greater or equal to 1.
The reason that power laws are interesting is that unlike the more familiar bell-shaped Gaussian distribution, a power law distribution has no ‘typical’ scale and is hence frequently called ‘scale-free’. To understand the notion of scale-free, imagine for a moment that the order found in the Web was described by a Gaussian, or normal distribution, rather than the power law one. In that case most of the sites on the web, for example, would be of a given size, given by the peak of the bell-shaped curved, and that size, which is the most common one found among all sites, would set the ‘scale’ of the distribution. But a power law distribution, which is the one that accurately describes the properties of the web, does not have a peak, and therefore most of the sites do not have a given size, but come in all sorts of sizes, with few having many pages and many having few. That is why power law distributions are called scale-free, which means that if one were to look at the distribution of site sizes, for one arbitrary range, say between 10,000 and 20,000 pages, that distribution would look the same as that for a different range, say between 10 to 100 pages. In other words, zooming in or out in the scale at which one studies the web, one keeps obtaining the same result, i.e. an inverse power law in the probability of finding a given feature. It also means that if one can determine how something is distributed over a given range, one can then predict what the distribution will be for another range.
A power law also gives a finite probability to very large elements, whereas the exponential tail in a Normal, or Gaussian, distribution makes the probability of finding elements much larger than the mean extremely unlikely. Another way of saying this is that power law distributions have very long tails, which means that there is a finite probability of finding sites extremely large, compared to the average. That occurrences many times larger than the average are striking can be illustrated by the example of heights of individuals, which follow the familiar normal distribution. It would be very surprising to find someone measuring 2 or 3 times the average U.S. male height of 5’10”. On the other hand, a power law distribution makes it quite possible to find a site many times larger than average.
The implications of this finding are far reaching. If one concentrates on the number of visitors to sites, a proxy for their commercial value, it turns out that the top
0.1% of all sites in the World Wide Web capture a whopping 32.36% of user volume. Moreover the top 1% of sites capture more than half of the total volume.
This concentration of visitors into a few sites cannot be due solely to the fact that people find some types of sites more interesting than others. Together with my colleague Lada Adamic we verified this very skewed distribution by performing the same analysis for two categories of sites: adult sites and sites within the .edu domain.
Adult sites were assumed to offer a selection of images and optionally video and chat. Educational domain sites were assumed to contain information about academics and research as well as personal homepages of students, staff, and faculty, which could cover any range of human interest. Again, the distribution of visits among sites was unequal. 6,615 adult sites were sampled by keywords in their name. The top site captured 1.4% of the volume to adult sites, while the top 10% accounted for 60% of the volume. Similarly, of the .edu sites, the top site, umich.edu, held 2.81% of the volume, while the top 5% accounted for over 60 percent of the visitor traffic.
This result is interesting both to the economist studying the efficiency of markets in electronic commerce and to providers contemplating the number of customers the business will attract. From an economics point of view, such a disproportionate distribution of user volume among sites is characteristic of winner-take-all markets, wherein the top few contenders capture a significant part of the market share. In a winner-take-all market the rewards are proportional to relative performance rather than absolute one, and imply a very skewed distribution of income to those participating in the market.
The ubiquitous patterns that appear in the informational ecosystem of the Web are manifestations of the underlying dynamics through which people interact with information and each other. While the details of these interactions are complicated and incorporate a number of intentional factors that are often hard to identify, they give rise to large-scale phenomena that are regular and reproducible on many scales. This is yet another example of the fact that a social system, while complicated and diverse in appearance, can display orderly patterns when observed with the proper tools and time scales.
The Laws of the Web: Patterns in the Ecology of Information, by Bernardo A. Huberman, MIT Press, 2001.