O'Terror International

I've been silent the past few days because I've had a really wicked flight into NY (via Chicago) which included a disasterous sequence in which a guy broke his overhead luggage bin trying to jam in an oversized backpack, which caused us to sit in the plane while a crew came to fix it, which postponed us long enough for a thunderstorm to settle overhead and delay the flight. Said thunderstorm then struck our plane with a bolt of lightning (with us in it!) and left our plane "fried" as the captain so eloquently put it. So I had to stay the night there, had my luggage lost, and all-in-all a lot of time was wasted.

The good news is that the MST parser appears to be working fine and correctly, which just leaves the nonprojective rule-based parser and the classifier interface as far as meeting the goals outlined in the proposal. I suppose a corpus reader would also be included in that, depending on how you want to interpret some of what I said in the proposal. That said, the code and the way that users interface to it is still a little unpolished, and there's a lot I would like to clean up. Since this, right now, is supposed to be time for review, I'm going to check in what I have as soon as this post is done, and basically open the floor for comments and suggestions. Because I haven't had time yet to address all these concerns (was just trying to get everything functional first), there are obviously a lot of issues to address that have already been raised. I'm going to enumerate them here, and edit them once I've fixed/investigated each. I think most of the higher-level suggestions have been raised, but if you see some bit of code that could be replaced with a more elegant way or could take advantage of some of python's features that I'm probably not familiar with, please let me know.

1. No need to use set.Set()
- Done.

2. Accessor functions shouldn't have a "get_" prefix (in regards to get_prob)
- changed it to compute_prob. As it takes a DepGraph object and computes its probability, would it still be refered to as an accessor function? Either way, I probably made this mistake in a few areas elsewhere.

3. Is the ChartCell class motivated?
- Not really. but on second thought if the statistical projective parser were to support filtering
out unlikely parses during the parsing process this would be the class that would take place in. It seems it'd be worth keeping around simply because it does not interfere with anything else and there's pretty low overhead involved, but could be extended to do interesting things.

4. Change contains() to __contains__() in DependencyGrammar
- Added a __contains__() variant, though I'm still not sure how to call it using the "in"
syntax since it takes two arguments.

5. Consider indexing the rhs of productions
- good point. Possibly address after GSoC

6. Is DependencyGrammar.contains_exactly() correct?
- No, it was at one point. Will be fixed with reintroduction of arity-based filtering

7. DependencyGrammar superclass?
- On second thought, they only share a list of productions, and the contains() function
which is the only function that operates on them works in a different way. Would this
be useful for anything besides showing a conceptual link?

8. Replace dict with defaultdict, has_key() with in
- I used them everywhere, so this will take a while

9. Include a dependency-parsed corpus?
- Have it, just don't know where to put it

10. Corpus reader?
- The code's there, can package it properly when the corpus is commited

11. Spaces for tabs
- I have a quick script to fix this, will run it before the move to nltk

12. Standard demo code wrapper
- Done.

In addition to that there are some interfaces to write, and some commenting to do. I'm hoping to have all that down by the 11th, and use the remaining time for doctests. Also, I've noticed some files (cfg.py is a good example) contain a lot of different classes. It's hard for me to develop like that so I've split my main parser file into two, but I would typically divide even further if I was coding in Java etc. Is there a motivation for having it all in one file?

Another issue I wanted to quickly address was why I'm choosing to implement the rule-based nonprojective. I've had some discussions about this, and it's my own opinion as well, that the rule-based nonprojective parser is..not exactly in high demand. While I thought this time would be better spent on visualization for the current parsers, I don't think there's enough time to actually deliver anything in the way of a polished GUI in the remaining time. I decided I'd rather implement the nonprojective rule-based simply so that the proposal goals are entirely met, and some obvious progress is actually made with my remaining GSoC time. I sometimes forget that there's nothing stopping me from contributing that after GSoC, and if it would be pedagogically useful (as it seems it would be), I would certainly spend some off-the-clock, non-stressed time on visualization and user-guides post-GSoC.
Cheers,
Jason

GSoC 08 - NLTK Worklog

Wednesday, 6 August 2008

O'Terror International

No comments:

Resources

Blog Archive

About Me