Input on input?

Saturday, 24 May 2008

Input on input?

I've been thinking a lot lately about different user needs and different views of dependency grammar in regards to the rule-based parsing. Here are the current issues to resolve (probably through non-blog means):
1. Dependency grammar syntax / somewhat conflicting views of DG
2. Arity of children in dependency rules
3. Arrow direction
4. Start Symbols

Issue 1: Syntax ------------------------------------------
I've seen two basic forms of dependency grammar, and I'm torn on which to support in the parser, or how to support. In the computational literature, words are often marked with their part-of-speech tags, and the arcs between head words are actually arcs between their head tags. And then there's the view where the tags are omitted, and the arcs link the words directly. The latter view seems to be the predominant one, and I think most of the people inclined to use a rule-based dependency parser want to emphasize the power of word-to-word relations that would be more salient in the latter model.
Of course, the statistical model conditions on tags and words, so presumably the underlying grammar should be compatible with both views. The typical constituency grammar (CFG style) is not far off:

>>> grammar = nltk.parse_cfg("""
... S -> NP V NP
... NP -> NP Sbar
... Sbar -> NP V
... NP -> 'fish'
... V -> 'fish'
... """)

the syntax for a dependency grammar could be similar, making the distinction between tags and words with single quotes:

User 1: CFG-style lexicon distinction
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... VBD -> 'fell'
.... NN -> 'price'
.... DT -> 'the'
"""")

User 2: Words only
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> 'price'
.... 'price' -> 'the'
""")

User 3: Crazy user
>>>> grammar = nltk.parse_depgrammar("""
.... 'fell' -> NN
.... NN -> 'price'
.... 'price' -> DT
.... DT -> 'the'
""")

Issue 2: Arity ------------------------------------------
Another issue is arity, and whether explicitly defining the number of children a word can take is something we want to support:
User 1: piece-wise
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT
.... NN -> IN
""")

User 2: explicit arity
>>>> grammar = nltk.parse_depgrammar("""
.... VBD -> NN
.... NN -> DT IN # where the right-hand side is assumed to be an unordered set
# though I suppose duplicates should be allowed
""")

Issue 3: Arrow direction -------------------------------------
I've been drawing the arrows in a top-down manner, that I think is more
consistent with the CFG syntax. The literature isn't consistent on it, and
graphically I can understand the desire to draw them bottom-up, but is there
any objective reason why one would be preferred over the other here?

Issue 4: Start Symbols -------------------------------------
I can't say I've really thought this one through, but there's also the issue
of whether or not the start-symbol, or root word/const, should be marked in
the grammar definition. I think I can get by without it, but as I've said,
I haven't thought it through, and maybe there are contexts where the
user has one specific root in mind.

3 comments:

Jason Narad said...: Based on some email banter with Sebastian, the definitive answer to issue #3 will be that the arrow direction is head -> modifier, based on a CoNLL 06 vote. It seems appropriate to stick with that decision.; 25 May 2008 at 08:28
Unknown said...: Issue 1: I don't really have any experience as a user of pure rule based dependency parsers so I don't have a proper answer, rather some questions:

a) Will people use this to build small toy grammars or for actual systems?
b) Will they typically have the grammar automatically extracted?

If I was to develop a small toy grammar then just having word to word arcs would be fine for me and it would emphasize the power of word-to-word relations as you said. If I was to manually design a grammar for some actual systems I would find it very tiring to enumerate all possible word-to-word arcs if many of them could be generalized using tags. If the grammar was to be automatically extracted then it wouldn't really matter to me.

It seems to me that having something for User 3 would be the safe solution (would it also be possible to use word-to-word relations?). However, there one thing that might be a bit confusing: in your example for User 1 and 3 the arrow has different semantics depending on whether both sides are POS tags, only the left side or only on the
right side. And how would I say that, for example, "the" modifies a NN? NN -> "the" would mean that NN is expanded to "the". Would we need something like "the" <- NN?

Issue 2: I think being able to define the arity would be nice although not mandatory. How would the parsing algorithm take this into account/would it be much work to take it into account? Or would it be syntactic sugar and the rules are just gonna be expanded into multiple rules in a preprocessing step?

Issue 3: See Jason's comment above.

Issue 4: I think that if it's not strictly necessary and doesn't make life easier to the grammar developer (and it looks that way) than it shouldn't be done.; 25 May 2008 at 09:55
Jason said...: The statistical dependency parsers all use both lexical items and POS tags in their (implicit) rules. Perhaps something like:

likes:V -> John:NP * apples:N

I put in the * to indicate the linear position of the head (I recall seeing this somewhere). Look at Joakim Nivre's overview on DG:

http://www.vxu.se/msi/~nivre/papers/05133.pdf

(Can't check now -- somehow the site seems to be down...)

I wouldn't worry about supporting anything other than toy grammar development, so it's okay if there is lots of redundancy and such in the rules. So, I think it would be fine just to do the word to word based rules. Supporting word:tag format as suggested above would allow the grammar writer to make sure you get the "book:V" and not the "book:N" in a sentence like "I saw John book the flight". The statistical parser will support word:tag, so it might make sense to build this in from the start, allowing the tag to be left unspecified in rules.

Jason B; 6 June 2008 at 05:55

GSoC 08 - NLTK Worklog

Saturday, 24 May 2008

Input on input?

3 comments:

Resources

Blog Archive

About Me