(Redirected from Wikipedia Mining)


Characteristics of Wikipedia

"Wikipedia mining" is the novel research area we are proposing. Wikipedia has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In a number of early experiments, our conviction that Wikipedia is a notable Web corpus for knowledge extraction has been strongly confirmed.

Dense link structure

The "dense link structure" is one of the most interesting characteristics of Wikipedia. "Dense" means that it has a lot of "inner links," links from pages in Wikipedia to other pages in Wikipedia. This means that articles are strongly connected by many hyperlinks. We believe that Wikipedia have topic locality and the connectivities among articles are much stronger than on ordinary Web sites because of the dense link structure.

Let us show some results of the link structure analysis for Wikipedia (Sept. 2006). We analyzed the link structure of Wikipedia and the statistics unveiled that Wikipedia (Esp. the distribution of backward links) has a typical "power-law" distribution, containing a few nodes with a very high degree and many with a low degree of links. Specifically, 196 pages have more than 10,000 backward links/page, 3,198 pages have more than 1,000 backward links/page, and 67,515 pages have more than 100 backward links/page.

This characteristic, the dense link structure, shows us the potential of Wikipedia mining and we believe that it is possible to extract valuable knowledge by analyzing the link structure.

URL as an Identifier

As a Web corpus for knowledge extraction, URL identification for concepts is one of the most notable characteristics of Wikipedia. Ordinary (electric) dictionaries have indexes to find the concepts the user wants to know. However, several concepts are put into one index in most cases. This means that ambiguous terms are listed in one article. This is no problem for humans because it is is human readable, but it is not machine understandable.

For example, if a sentence "Golden delicious is a kind of apple" exists in an article in a dictionary, humans can immediately understand that "apple" means a fruit. However, it is difficult to analyze for a machine because "apple" is an ambiguous term and there is no identification information for it. To make this sentence machine understandable, we need some identifier.

On Wikipedia, almost every concept (article/page) has an own URL as an identifier. In other words, each concept can be identified by its URL individually. This means that it is possible to analyze term relations avoiding ambiguous term problems or context problems.

Brief Link Texts

Link texts in Wikipedia have a quite brief, clear and simple form compared with those of ordinary Web sites. Link texts in Web sites usually contain wordy information like "Click here for more detailed information." Sometimes, the link text does not contain any important information about the linked page and that is one of the common causes of accuracy problems on thesaurus construction based on Web mining. As opposed to that, link texts on Wikipedia are refined very well.

Among the authors of Wikipedia, it is a common practice to use the title of an article for the link text but users also have the possibility to give other link texts to an article. This feature makes another important characteristic; the "variety of link text," which can be used to extract valuable information. However, what seems interesting is that it does not contain any wordy information in most cases.

Live Update

The content management work flow of ordinary dictionaries made by human effort is top-down approach in most cases. The top-down approach is to the advantage of the quality, but to the disadvantage of the topic coverage. This means that general concepts will be covered first, and domain specific terms/new concepts will be covered later (or never). For instance, almost all paper-based dictionaries have no entry for iPod even though Wikipedia has several entries for iPod-nano with detailed information and pictures.

The work flow of Wikipedia is totally based on a bottom-up approach. Since Wikipedia is based on Wiki, it allows users to edit articles easily and timely. This feature leads to several advantages; wide-range concept coverage, new concept coverage, and collaborative modification.

As an example, after the announcement of a latest product, an article on the product with detailed information is usually uploaded a lot faster than on ordinary paper-based dictionaries. One of the most difficult issues of thesaurus construction is the coverage of new terms, but this characteristic shows that Wikipedia has the potential to overcome this problem.

After all, the ease of use was the dominant factor to success in wide-range topic coverage. Wikipedia allows users to edit the content via Web browsers. Since authorities on specific topics are not always not good at using complicated computer systems, this critical feature helped to gather so many contributers and to cover wide-range topics.

Impact of Wikipedia Mining

"Wikipedia mining" is the novel research area we are proposing. As we described above, Wikipedia is an invaluable Web corpus for knowledge extraction. Therefore, we launched Wikipedia Lab., a Web site for special interest group of Wikipedia mining, to prove the conviction.


The detailed characteristics of Wikipedia are described in our research papers.

  • M. Erdmann, K.Nakayama, T.Hara, and S.Nishio: An Approach for Extracting Bilingual Terminology from Wikipedia, Proc. of International Conference on Database Systems for Advanced Applications (DASFAA), (Mar. 2008).
  • K. Nakayama, T. Hara, and S. Nishio: Wikipedia Mining for An Association Web Thesaurus Construction, Proc. of International Conference on Web Information Systems Engineering (WISE), pp, 322-334 (Dec. 2007).
  • K. Nakayama, T. Hara, and S. Nishio: A Thesaurus Construction Method from Large Scale Web Dictionaries, Proc. of International Conference on Advanced Information Networking and Applications (IEEE AINA), pp. 932-939 (May 2007).


Tool Box