Wikipedia Bilingual Dictionary

Bilingual dictionaries hold great potential for emerging research areas such as machine translation and human-aided translation. Unfortunately, the manual construction of bilingual dictionaries is expensive, and new or domain-specific terminology is difficult to cover. Therefore, a lot of research has been conducted on the automatic extraction of bilingual dictionaries. Especially the extraction from large parallel corpora (bitexts) has achieved impressive results. However, parallel corpora are available for only selected text domains and language pairs. For that reason, the potential of other resources is being explored as well.

We propose the extraction of bilingual terminology from large multilingual encyclopedias such as Wikipedia in order to complement bilingual dictionaries with accurate term-translation pairs for languages and text domains where no parallel corpora exist. Wikipedia is a very promising resource as the continuously growing encyclopedia already contains more than 16 million articles in over 270 languages, has a dense link structures and covers a wide variety of topics.

In Wikipedia, there are many links between articles in different languages. If we regard the titles of Wikipedia articles as terminology, it is easy to extract translation relations by analyzing the interlanguage links, assuming that if two articles are connected by an interlanguage link, their titles are translations of each other.

We have developed a method that analyzes not only interlanguage links in Wikipedia but also redirect pages and anchor texts to extend the number of term-translation pairs in the dictionary while maintaining a relatively high accuracy. Since not all term-translation pairs extracted by our method are correct, we use supervised learning to analyze the correctness of each extracted term-translation pair based on various characteristics (features).

Wikipedia Bilingual Dictionary


References

  • K.Nakayama, M.Ito, M.Erdmann, M.Shirakawa, T.Michishita, T.Hara, S.Nishio: Wikipedia Mining - A Survey on Wikipedia Research (Japanese), IPSJ Journal of Information Processing (Dec. 2009)
  • M.Erdmann, K.Nakayama, T.Hara, S.Nishio: Improving the Extraction of Bilingual Terminology from Wikipedia, ACM Transactions on Multimedia Computing, Communications and Applications (Oct. 2009)
  • K.Nakayama, M.Ito, M.Erdmann, M.Shirakawa, T.Michishita, T.Hara, S.Nishio: Wikipedia Mining - Challenge for Realizing Early Profits (Japanese), JSAI Journal (Oct. 2009)
  • M.Erdmann, K.Nakayama, T.Hara, S.Nishio: Using an SVM Classifier to Improve the Extraction of Bilingual Terminlogy from Wikipedia, Proc. of IJCAI workshop (Jul. 2009)
  • M.Erdmann, K.Nakayama, T.Hara, and S.Nishio: Extraction of Bilingual Terminology from a Multilingual Web-based Encyclopedia, The Information Processing Society of Japan (IPSJ) Journal (Jul. 2008)
  • K.Nakayama, M.Pei, M.Erdmann, M.Ito, M.Shirakawa, T.Hara, S.Nishio: Wikipedia Mining - Wikipedia as a Corpus for Knowledge Extraction, Proc. of Wikimania (Jul. 2008)

Navigation

Tool Box

Search