Monday, 12 May 2008

"First Post!"

This page is the worklog for a GSoC '08 project that aims to add a set of dependency parsers to the NLTK. The sole purpose of this page is really just to provide some documentation of the progress I'm making, and to discuss some of the hurdles and design decisions openly with my Mentors (Sebastian Riedel & Jason Baldridge) and anyone in the development community that's interested in this project and happens to find their way here. Here's the project's official listing from Google, for a slightly more in-depth explanation of what this project is about:

The Natural Language ToolKit (NLTK) is a popular, open source collection of tools, resources, and corpora for the development and education of NLP techniques. While the toolkit features a large selection of parsers (recursive descent, shift-reduce, chunk, chart, feature-based, etc), it lacks any form of dependency parsing. Dependency grammar is an increasingly popular syntactic framework which lacks the constituent nodes or phrases found in competing theories. A derivation tree in DG is an acyclic graph rooted in the sentence head (typically the main verb) and arcing recursively down to each dependent. Projective dependency grammar makes the additional constraint that these arcs cannot cross, while non-projective dependency grammar does not.

I propose the development of a suite of four dependency parsers for the NLTK comprising statistical and rule-based variants of both projective and non-projective parsing models. This would also include the necessary tools to interface between the parsers and the corpora included in the current NLTK release.

This project will be based primarily on the work of Eisner (1996) and McDonald (2005/2006). I will build the projective parsers using a modified version of a traditional chart parsing algorithm, and will likely use Eisner's "Model C" probability model for the probabilistic parser. The non-projective parsers will be based on McDonald's work and traverse a directed weighted graph data structure to build the most likely parse via the Chu-Liu-Edmonds MST algorithm. This should yield parsers that are O(n^3) projective, and O(n^2) non-projective.

No comments: