GSoC 08 - NLTK Worklog

Wednesday, 6 August 2008

O'Terror International

I've been silent the past few days because I've had a really wicked flight into NY (via Chicago) which included a disasterous sequence in which a guy broke his overhead luggage bin trying to jam in an oversized backpack, which caused us to sit in the plane while a crew came to fix it, which postponed us long enough for a thunderstorm to settle overhead and delay the flight. Said thunderstorm then struck our plane with a bolt of lightning (with us in it!) and left our plane "fried" as the captain so eloquently put it. So I had to stay the night there, had my luggage lost, and all-in-all a lot of time was wasted.

The good news is that the MST parser appears to be working fine and correctly, which just leaves the nonprojective rule-based parser and the classifier interface as far as meeting the goals outlined in the proposal. I suppose a corpus reader would also be included in that, depending on how you want to interpret some of what I said in the proposal. That said, the code and the way that users interface to it is still a little unpolished, and there's a lot I would like to clean up. Since this, right now, is supposed to be time for review, I'm going to check in what I have as soon as this post is done, and basically open the floor for comments and suggestions. Because I haven't had time yet to address all these concerns (was just trying to get everything functional first), there are obviously a lot of issues to address that have already been raised. I'm going to enumerate them here, and edit them once I've fixed/investigated each. I think most of the higher-level suggestions have been raised, but if you see some bit of code that could be replaced with a more elegant way or could take advantage of some of python's features that I'm probably not familiar with, please let me know.

1. No need to use set.Set()
- Done.

2. Accessor functions shouldn't have a "get_" prefix (in regards to get_prob)
- changed it to compute_prob. As it takes a DepGraph object and computes its probability, would it still be refered to as an accessor function? Either way, I probably made this mistake in a few areas elsewhere.

3. Is the ChartCell class motivated?
- Not really. but on second thought if the statistical projective parser were to support filtering
out unlikely parses during the parsing process this would be the class that would take place in. It seems it'd be worth keeping around simply because it does not interfere with anything else and there's pretty low overhead involved, but could be extended to do interesting things.

4. Change contains() to __contains__() in DependencyGrammar
- Added a __contains__() variant, though I'm still not sure how to call it using the "in"
syntax since it takes two arguments.

5. Consider indexing the rhs of productions
- good point. Possibly address after GSoC

6. Is DependencyGrammar.contains_exactly() correct?
- No, it was at one point. Will be fixed with reintroduction of arity-based filtering

7. DependencyGrammar superclass?
- On second thought, they only share a list of productions, and the contains() function
which is the only function that operates on them works in a different way. Would this
be useful for anything besides showing a conceptual link?

8. Replace dict with defaultdict, has_key() with in
- I used them everywhere, so this will take a while

9. Include a dependency-parsed corpus?
- Have it, just don't know where to put it

10. Corpus reader?
- The code's there, can package it properly when the corpus is commited

11. Spaces for tabs
- I have a quick script to fix this, will run it before the move to nltk

12. Standard demo code wrapper
- Done.

In addition to that there are some interfaces to write, and some commenting to do. I'm hoping to have all that down by the 11th, and use the remaining time for doctests. Also, I've noticed some files (cfg.py is a good example) contain a lot of different classes. It's hard for me to develop like that so I've split my main parser file into two, but I would typically divide even further if I was coding in Java etc. Is there a motivation for having it all in one file?

Another issue I wanted to quickly address was why I'm choosing to implement the rule-based nonprojective. I've had some discussions about this, and it's my own opinion as well, that the rule-based nonprojective parser is..not exactly in high demand. While I thought this time would be better spent on visualization for the current parsers, I don't think there's enough time to actually deliver anything in the way of a polished GUI in the remaining time. I decided I'd rather implement the nonprojective rule-based simply so that the proposal goals are entirely met, and some obvious progress is actually made with my remaining GSoC time. I sometimes forget that there's nothing stopping me from contributing that after GSoC, and if it would be pedagogically useful (as it seems it would be), I would certainly spend some off-the-clock, non-stressed time on visualization and user-guides post-GSoC.
Cheers,
Jason

Tuesday, 29 July 2008

Retractions...

On my last commit I had stated that the MST parser was, in at least some sense, "functional", and I'm afraid I have to make a small retraction to that. The problem is that I'm following Keith Hall's 'k-best Spanning Tree Parsing' paper, and I quickly noticed what Sebastian and I both agree is probably a typo (on the edge scoring moving from S1 to S2 in fig. 2). There was a second discrepancy between my parser's output and the final graph presented in this paper, where one edge score was calculated to be -8 instead of -7 as presented in this paper. I had chalked this up to another typo, and I was a bit presumptuous in doing so - Sebastian pointed out an alternate explanation for this mismatch and I'll have to adjust the algorithm to correct the edge re-scoring procedure.

In better news, I have the training procedure working, and I'm using the MaxentClassifier to handle the initial edge scoring. Thanks to Edward to pointing out how to make use of that class properly. This means that, aside from the previously-mentioned mistake, the only thing that needs to be done is to map that graph output into a true tree structure, and the parser will be finished. Well, that and I have to do a quick tie-in to tag the input sequence before beginning the graph constructions, but I think all the difficult sections are behind me. At least, I hope so, because I'd really like to produce an exceptional MSc thesis!

Also, our plan is to submit more code for public review on the 3rd or so, and hope to have feedback by the 7th or 8th, so that there will be time to make adjustments by the early pencil's down date - Aug 11th.
Cheers!

Monday, 16 June 2008

Narcolepsy

The electrical system in Mylne's Court, the dorm hall where I happen to reside, is getting a complete overhaul today and as a result all tenants have been kicked out and not allowed to return before 8am. As a result, I'm currently hauled up in Appleton Tower at 6 am, where I've been for the past 11 hours, tired, hungry, and quite possibly the only person in this entire building.

What does that mean to you, the avid NLTK GSoC project follower? It means that dependency grammars now support explicit arity/arguments, and that the parser behaves accordingly. This wasn't actually in the proposal so I didn't want to throw too much time at it, but I think that it's an important feature to have in this particular parser.
The ChartCell has been changed to a set, and the duplicate parses that existed in the previous commit are no longer a problem. Until I figure out to hash a list, there are still problems that could occur in more complex examples due to different spans being deemed equal by __eq__, but it's not tough fix, and I haven't even tested to see if this error actually exists - I'm just going on a hunch. There's also grammar parsing features similar to CFG, where the user can specify the grammar in the form of a string like " 'a' --> 'b' 'c' ", and get the set of productions automatically.

So I don't want to say that this parser is "done", but I think it's safe now to move on to the statistical projective parser for the moment, before coming back to put the finishing touches on this one. I still haven't done tests with more elaborate test sentences, but I'll do that shortly. I'll commit this once I get my laptop back on the internet, probably after a really long nap.

Thursday, 12 June 2008

NJ > MIT?

Whelp, the first iteration of the "Worse is Better" design philosophy is complete. That is to say, I have a working projective parser that *at the very least* parses my trivial five-word example sentence. While I put a lot of thought into the design of the classes and how the algorithm would proceed, during the coding of this I was surprised to find it working without some of the pieces I planned on using. For instance, a data structure to keep track of multiple dependency spans per set of chart indices, not to mention that my span concatenation algorithm only accounts for left and rightward covered concatenation between adjacent spans, but accidentally performs concatenation amongst spans centered at one point as well. The only problem with this is it introduces some duplicates into the chart, but I always wanted to make the chart cell a set-based data structure anyway, so that'll be fixed shortly. Now that something's working, I'm going to try and go through tonight, comment everything nicely, and commit my first real commit, before attempting to tackle the problem of arity by Monday, the discussion day. As it stands, integration with the current chart parsing interface looks hopeful.

Monday, 9 June 2008

The Span Plan

I'm well into the word-to-word projective parser now; All of the base classes used in this particular parser, like the dependency grammar, productions, spans, and chart entries are done and tested. I spend most of my time debating over a few design issues, the most pertinent of which is how to construct the larger spans from the smaller ones. This is really simple at first glance, but there are some aspects of concatenating together the larger spans that require some special attention (vs. typical CYK w/ CFG chart parsing) and I'm having trouble coming to terms with a solution to them that I'm content with from an aesthetic perspective.

1. Null spans
Eisner defines a span (paraphrasing - believe at your own risk) as two or more words or nodes, connected via directed arcs. The problem with this is when you're filling up the initial chart entries along the axis where each cell only spans itself, there obviously wouldn't be any spans at all in these cells. I could fill them with something equivalent to a 'word' or 'node' class, add methods for concatenating these single points onto a span, and have mixed data structures in the cells of the chart, but I don't really like that approach.

What I was originally planning on was using an alternate interpretation of a span, where I'll allow single word spans with a 'null' entry for the head_index. Chart entries would then consist of a set of spans and the corresponding "parent" span/cell-indices that created them (for the backtracing). I suppose this is really just arguing semantics (when I should be arguing syntax!), but it'll irk me to do it in the former method. On mulling this idea over further, there were somethings I didn't like about this as well.
Like at the third level of construction (spans of length 3 being constructed), and using Eisner's idea of concatenation as creating up to three possible new spans (rightward cover, leftward cover, no cover), it would be possible to concat a null span with a 2-word span via no cover and have this unusual dangling node considered in the span, when it's existence there is really only a placeholder for future constructions. I didn't like that. I also didn't like in some instances of construction via no-cover, spans would have multiple heads at the "highest" level. i.e. If there's been no cover added between two adjacent covered spans of two or more words. In this case there really isn't a single concise grammatical entity being represented by the span, but it's necessary for the chart parsing approach. I would have to change my dependency span class away from Eisner's simplifying assumption (well, more of an observation, really) of grammatical inertness, and I would have to store a list of all the arcs in use, and iterate over them all the time during span concatenation. It'd just be gross.

2. Chart Spans vs. Dependency Spans & the notion of grammatical inertness
It then occurred to me that I think a lot of my difficulty in making these decisions comes from not making a distinction between the syntactically-motivated dependency spans, and the purely algorithmic span-construction of the chart-filling. So I think I'm going to change my notion of a chart entry to consist of an ordered list of dependency spans, which will be allowed to overlap on one word. Dependency spans can then uphold Eisner's idea of grammatical inertness - that all the sub arcs under the highest covering arc are irrelevant to the construction of larger spans - and I can keep the concatenation code pretty concise and efficient. I think it might also allow for a better future tie-in with the already-present parsing interfaces. So dependency spans are subject to all the typical syntactic constraints, whereas the chart spans are really just markers for what part of the input string has been covered so far, and everything that was aesthetically annoying would been cleanly separated out.

I may also forgo the use of back pointers to recover the parse/smaller spans, because unlike CFG parsing it's really simple to track all the links that are currently in play. I might just have a int array of n size in each span instance, where n is the number of words in the span, and each cell is simply the index of its head, and say -1 for no head. Well, it's Python, I suppose I should have said 'list'. The difference between this is and what I was whining about in section 1 is that this list would not be necessary for the construction of spans, just obtaining the final parse(s). Any span in the final cell holds an array with the parse.

3. This isn't as much of an open problem as a distinction I wanted to make a formal note of. Chart parsing for dependency parsing needs a modified window length in order to create spans that are not only adjacent to each other, but also overlapping on one word. I'll have to see in the future whether this will prevent the code I'm building up now from being smoothly integrated into an implementation of the chart parsing interface already in the NLTK.

Saturday, 24 May 2008

Input on input?

I've been thinking a lot lately about different user needs and different views of dependency grammar in regards to the rule-based parsing. Here are the current issues to resolve (probably through non-blog means):
1. Dependency grammar syntax / somewhat conflicting views of DG
2. Arity of children in dependency rules
3. Arrow direction
4. Start Symbols

Issue 1: Syntax ------------------------------------------
I've seen two basic forms of dependency grammar, and I'm torn on which to support in the parser, or how to support. In the computational literature, words are often marked with their part-of-speech tags, and the arcs between head words are actually arcs between their head tags. And then there's the view where the tags are omitted, and the arcs link the words directly. The latter view seems to be the predominant one, and I think most of the people inclined to use a rule-based dependency parser want to emphasize the power of word-to-word relations that would be more salient in the latter model.
Of course, the statistical model conditions on tags and words, so presumably the underlying grammar should be compatible with both views. The typical constituency grammar (CFG style) is not far off:

>>> grammar = nltk.parse_cfg("""
... S -> NP V NP
... NP -> NP Sbar
... Sbar -> NP V
... NP -> 'fish'
... V -> 'fish'
... """)

the syntax for a dependency grammar could be similar, making the distinction between tags and words with single quotes:

User 1: CFG-style lexicon distinction
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... VBD -> 'fell'
.... NN -> 'price'
.... DT -> 'the'
"""")

User 2: Words only
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> 'price'
.... 'price' -> 'the'
""")

User 3: Crazy user
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> NN
.... NN -> 'price'
.... 'price' -> DT
.... DT -> 'the'
""")

Issue 2: Arity ------------------------------------------
Another issue is arity, and whether explicitly defining the number of children a word can take is something we want to support:
User 1: piece-wise
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... NN -> IN
""")

User 2: explicit arity
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT IN # where the right-hand side is assumed to be an unordered set
# though I suppose duplicates should be allowed
""")

Issue 3: Arrow direction -------------------------------------
I've been drawing the arrows in a top-down manner, that I think is more
consistent with the CFG syntax. The literature isn't consistent on it, and
graphically I can understand the desire to draw them bottom-up, but is there
any objective reason why one would be preferred over the other here?

Issue 4: Start Symbols -------------------------------------
I can't say I've really thought this one through, but there's also the issue
of whether or not the start-symbol, or root word/const, should be marked in
the grammar definition. I think I can get by without it, but as I've said,
I haven't thought it through, and maybe there are contexts where the
user has one specific root in mind.

Tuesday, 20 May 2008

So far, so good

Here's the tentative schedule for this project:
05/26: Research review, setup subversion, check for reusable code
06/10: Establish base classes (graph, spans, etc)
06/25: Finish Rule-based Projective Parser
07/04: Make statistical augmentations to parser / classiÔ¨Åer interface
07/18: Finish Rule-based Non-projective Parser
07/30: Statistical Non-projective Parser, including MaxEnt tie-ins
08/09: Work-ins with contrib.dependency & corpora formatting
08/11: Evaluation
08/11: to Submission: Clean and package code