Final Thesis: Improving the Wikipedia Parser

Abstract: The MediaWiki software has a built-in parser for parsing Wikitext, however, its capabilities are limited and it only performs a conversion to HTML [12]. The OSR group has successfully developed an alternative Wikitext parser called Sweble which is able to parse Wikitext into a high-level (well-defined, machine-readable) representation e.g. an Abstract Syntax Tree (AST). Moreover, the extra function of this alternative parser allows users to check the Wikitext for sloppy syntax and undertake auto correction of these errors. Hence, it’s practical to integrate this parser into MediaWiki to let users make use of this parser to check their Wikitext for sloppy syntax. Through this method, useful data can be logged and analyzed from user edited Wikitext, to assist us understand the Wikitext as well as user’s behavior of using Wikitext. An improved Wikitext grammar can therefore be proposed, based on this understanding. To achieve this goal, the new parser will be made as web-based service. A MediaWiki extension will be created for user to request for these services. The primary objective for this thesis is about the integration of the alternative parser into the MediaWiki infrastructure to support the research on Wikitext and improvement of the Sweble parser.

Keywords: MediaWiki, Extension, Parser, Wikitext, Sloppy syntax, Rest-API, Spring, Dependency Injection, Hibernate, Wicket

PDFs: Diplomarbeit (in English)

Reference: Jing Tang. Improving the Wikipedia Parser. Diplomarbeit, Friedrich-Alexander University of Erlangen-Nürnberg: 2011.