Saturday, 24 May 2008

Input on input?

I've been thinking a lot lately about different user needs and different views of dependency grammar in regards to the rule-based parsing. Here are the current issues to resolve (probably through non-blog means):
1. Dependency grammar syntax / somewhat conflicting views of DG
2. Arity of children in dependency rules
3. Arrow direction
4. Start Symbols

Issue 1: Syntax ------------------------------------------
I've seen two basic forms of dependency grammar, and I'm torn on which to support in the parser, or how to support. In the computational literature, words are often marked with their part-of-speech tags, and the arcs between head words are actually arcs between their head tags. And then there's the view where the tags are omitted, and the arcs link the words directly. The latter view seems to be the predominant one, and I think most of the people inclined to use a rule-based dependency parser want to emphasize the power of word-to-word relations that would be more salient in the latter model.
Of course, the statistical model conditions on tags and words, so presumably the underlying grammar should be compatible with both views. The typical constituency grammar (CFG style) is not far off:

>>> grammar = nltk.parse_cfg("""
... S -> NP V NP
... NP -> NP Sbar
... Sbar -> NP V
... NP -> 'fish'
... V -> 'fish'
... """)

the syntax for a dependency grammar could be similar, making the distinction between tags and words with single quotes:

User 1: CFG-style lexicon distinction
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... VBD -> 'fell'
.... NN -> 'price'
.... DT -> 'the'
"""")

User 2: Words only
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> 'price'
.... 'price' -> 'the'
""")

User 3: Crazy user
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> NN
.... NN -> 'price'
.... 'price' -> DT
.... DT -> 'the'
""")

Issue 2: Arity ------------------------------------------
Another issue is arity, and whether explicitly defining the number of children a word can take is something we want to support:
User 1: piece-wise
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... NN -> IN
""")

User 2: explicit arity
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT IN # where the right-hand side is assumed to be an unordered set
# though I suppose duplicates should be allowed
""")

Issue 3: Arrow direction -------------------------------------
I've been drawing the arrows in a top-down manner, that I think is more
consistent with the CFG syntax. The literature isn't consistent on it, and
graphically I can understand the desire to draw them bottom-up, but is there
any objective reason why one would be preferred over the other here?

Issue 4: Start Symbols -------------------------------------
I can't say I've really thought this one through, but there's also the issue
of whether or not the start-symbol, or root word/const, should be marked in
the grammar definition. I think I can get by without it, but as I've said,
I haven't thought it through, and maybe there are contexts where the
user has one specific root in mind.

Tuesday, 20 May 2008

So far, so good

Here's the tentative schedule for this project:
05/26: Research review, setup subversion, check for reusable code
06/10: Establish base classes (graph, spans, etc)
06/25: Finish Rule-based Projective Parser
07/04: Make statistical augmentations to parser / classifier interface
07/18: Finish Rule-based Non-projective Parser
07/30: Statistical Non-projective Parser, including MaxEnt tie-ins
08/09: Work-ins with contrib.dependency & corpora formatting
08/11: Evaluation
08/11: to Submission: Clean and package code

Monday, 12 May 2008

"First Post!"

This page is the worklog for a GSoC '08 project that aims to add a set of dependency parsers to the NLTK. The sole purpose of this page is really just to provide some documentation of the progress I'm making, and to discuss some of the hurdles and design decisions openly with my Mentors (Sebastian Riedel & Jason Baldridge) and anyone in the development community that's interested in this project and happens to find their way here. Here's the project's official listing from Google, for a slightly more in-depth explanation of what this project is about:

The Natural Language ToolKit (NLTK) is a popular, open source collection of tools, resources, and corpora for the development and education of NLP techniques. While the toolkit features a large selection of parsers (recursive descent, shift-reduce, chunk, chart, feature-based, etc), it lacks any form of dependency parsing. Dependency grammar is an increasingly popular syntactic framework which lacks the constituent nodes or phrases found in competing theories. A derivation tree in DG is an acyclic graph rooted in the sentence head (typically the main verb) and arcing recursively down to each dependent. Projective dependency grammar makes the additional constraint that these arcs cannot cross, while non-projective dependency grammar does not.

I propose the development of a suite of four dependency parsers for the NLTK comprising statistical and rule-based variants of both projective and non-projective parsing models. This would also include the necessary tools to interface between the parsers and the corpora included in the current NLTK release.

This project will be based primarily on the work of Eisner (1996) and McDonald (2005/2006). I will build the projective parsers using a modified version of a traditional chart parsing algorithm, and will likely use Eisner's "Model C" probability model for the probabilistic parser. The non-projective parsers will be based on McDonald's work and traverse a directed weighted graph data structure to build the most likely parse via the Chu-Liu-Edmonds MST algorithm. This should yield parsers that are O(n^3) projective, and O(n^2) non-projective.