The Span Plan

I'm well into the word-to-word projective parser now; All of the base classes used in this particular parser, like the dependency grammar, productions, spans, and chart entries are done and tested. I spend most of my time debating over a few design issues, the most pertinent of which is how to construct the larger spans from the smaller ones. This is really simple at first glance, but there are some aspects of concatenating together the larger spans that require some special attention (vs. typical CYK w/ CFG chart parsing) and I'm having trouble coming to terms with a solution to them that I'm content with from an aesthetic perspective.

1. Null spans
Eisner defines a span (paraphrasing - believe at your own risk) as two or more words or nodes, connected via directed arcs. The problem with this is when you're filling up the initial chart entries along the axis where each cell only spans itself, there obviously wouldn't be any spans at all in these cells. I could fill them with something equivalent to a 'word' or 'node' class, add methods for concatenating these single points onto a span, and have mixed data structures in the cells of the chart, but I don't really like that approach.

What I was originally planning on was using an alternate interpretation of a span, where I'll allow single word spans with a 'null' entry for the head_index. Chart entries would then consist of a set of spans and the corresponding "parent" span/cell-indices that created them (for the backtracing). I suppose this is really just arguing semantics (when I should be arguing syntax!), but it'll irk me to do it in the former method. On mulling this idea over further, there were somethings I didn't like about this as well.
Like at the third level of construction (spans of length 3 being constructed), and using Eisner's idea of concatenation as creating up to three possible new spans (rightward cover, leftward cover, no cover), it would be possible to concat a null span with a 2-word span via no cover and have this unusual dangling node considered in the span, when it's existence there is really only a placeholder for future constructions. I didn't like that. I also didn't like in some instances of construction via no-cover, spans would have multiple heads at the "highest" level. i.e. If there's been no cover added between two adjacent covered spans of two or more words. In this case there really isn't a single concise grammatical entity being represented by the span, but it's necessary for the chart parsing approach. I would have to change my dependency span class away from Eisner's simplifying assumption (well, more of an observation, really) of grammatical inertness, and I would have to store a list of all the arcs in use, and iterate over them all the time during span concatenation. It'd just be gross.

2. Chart Spans vs. Dependency Spans & the notion of grammatical inertness
It then occurred to me that I think a lot of my difficulty in making these decisions comes from not making a distinction between the syntactically-motivated dependency spans, and the purely algorithmic span-construction of the chart-filling. So I think I'm going to change my notion of a chart entry to consist of an ordered list of dependency spans, which will be allowed to overlap on one word. Dependency spans can then uphold Eisner's idea of grammatical inertness - that all the sub arcs under the highest covering arc are irrelevant to the construction of larger spans - and I can keep the concatenation code pretty concise and efficient. I think it might also allow for a better future tie-in with the already-present parsing interfaces. So dependency spans are subject to all the typical syntactic constraints, whereas the chart spans are really just markers for what part of the input string has been covered so far, and everything that was aesthetically annoying would been cleanly separated out.

I may also forgo the use of back pointers to recover the parse/smaller spans, because unlike CFG parsing it's really simple to track all the links that are currently in play. I might just have a int array of n size in each span instance, where n is the number of words in the span, and each cell is simply the index of its head, and say -1 for no head. Well, it's Python, I suppose I should have said 'list'. The difference between this is and what I was whining about in section 1 is that this list would not be necessary for the construction of spans, just obtaining the final parse(s). Any span in the final cell holds an array with the parse.

3. This isn't as much of an open problem as a distinction I wanted to make a formal note of. Chart parsing for dependency parsing needs a modified window length in order to create spans that are not only adjacent to each other, but also overlapping on one word. I'll have to see in the future whether this will prevent the code I'm building up now from being smoothly integrated into an implementation of the chart parsing interface already in the NLTK.

GSoC 08 - NLTK Worklog

Monday, 9 June 2008

The Span Plan

No comments:

Resources

Blog Archive

About Me