ACL Wiki - User contributions [en]

Hypergraph Format

2010-11-11T22:54:47Z

David Chiang: /* Proposed extensions (yea or nay?) / Open questions */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''
** David: yea

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.

*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
** David: on the fence about this one

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?
** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product.

* Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent:
<pre>
{ nodes: [ { label: "a" } ],
edges: [ ] }
{ nodes : [ { label: "a" } ],
edges: [ { head: 0, tails: [ ] } ] }
</pre>
** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.

* A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids?

* Can/should we require that '''edges''' come after '''nodes'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-11T21:37:17Z

David Chiang: /* Proposed extensions (yea or nay?) / Open questions */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''
** David: yea

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.

*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
** David: on the fence about this one

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?
** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product.

* Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent:
<pre>
{ nodes: [ { label: "a" } ],
edges: [ ] }
{ nodes : [ { label: "a" } ],
edges: [ { head: 0, tails: [ ] } ] }
</pre>
** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.

* A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-11T21:37:03Z

David Chiang: /* JSON */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''
** David: yea

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.

*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
** David: on the fence about this one

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?
** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product.

* Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent:
<pre>
{ nodes: [ { label: "a" } ],
edges: [ ] }
{ nodes : [ { label: "a" } ],
edges: [ { head: 0, tails: [ ] } ] }
</pre>
** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.

* A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-11T21:36:30Z

David Chiang: /* Proposed extensions (yea or nay?) / Open questions */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''
** David: yea

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.

*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
** David: on the fence about this one

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?
** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product.

* Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent:
<pre>
{ nodes: [ { label: "a" } ],
edges: [ ] }
{ nodes : [ { label: "a" } ],
edges: [ { head: 0, tails: [ ] } ] }
</pre>
** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.

* A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-11T21:33:15Z

David Chiang: /* Proposed extensions (yea or nay?) / Open questions */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''
** David: yea

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.

*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
** David: on the fence about this one

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?
** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product.

* Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent:
<pre>
{ nodes: [ { label: "a" } ],
edges: [ ] }
{ nodes : [ { label: "a" } ],
edges: [ { head: 0, tails: [ ] } ] }
</pre>
** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T21:53:14Z

David Chiang: /* JSON */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T21:52:32Z

David Chiang: /* JSON */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
** For Python, yajl + (<a href="http://pypi.python.org/pypi/ijson/">ijson</a> or <a href="http://pykler.github.com/yajl-py/">yajl-py</a>) might address these concerns.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T21:37:45Z

David Chiang: /* Serialization library options */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T21:36:13Z

David Chiang: /* Serialization library options */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

* Interesting study under Python: http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T17:43:38Z

David Chiang: /* JSON */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example: http://www.isi.edu/~chiang/software/forest/example

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T17:40:36Z

David Chiang: /* Bindings/Libraries */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example (40M): http://www.isi.edu/~chiang/software/forest/example.gz

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries/Software ==

Python
* (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

Software
* (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T17:37:00Z

David Chiang: /* JSON */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.

A FeatureVector object has arbitrary fields with float values.

Example (40M): http://www.isi.edu/~chiang/software/forest/example.gz

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

* Should Edge have an optional '''weight''' field? '''logweight'''?

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries ==

Python
* this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-09T17:32:38Z

David Chiang: /* Proposed extensions (yea or nay?) / Open questions */

= Overall goal =

Make it easy to share packed representations across NLP applications.
Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages.
A memory efficient and fast representation is also useful.

= Serialization library options =

== JSON ==

[http://www.json.org/ JSON Description]

Pro:
* Implementations in every language (often packaged with language).
* Human readable
* Already used in CDec for forest output

Con:
* Space inefficient
* Requires custom parser for speed
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
* Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.

Proposed schema:

A Forest object has the following required fields:
* '''nodes''': a list of Node objects
* '''edges''': a list of Edge objects
* '''root''': a node id, which is an integer index into the '''nodes''' list

An Edge object has the following required fields:
* '''head''': a node id
* '''tails''': a (possibly empty) list of node ids

A Node or Edge object has the following optional fields:
* '''label''': string
* '''features''': a FeatureVector object
and any other application-specific fields.
* (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?

A FeatureVector object has arbitrary fields with float values.

Example (40M): http://www.isi.edu/~chiang/software/forest/example.gz

=== Proposed extensions (yea or nay?) / Open questions ===

*When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}'''

*In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
** David Chiang: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.).
** ChrisD: nay. agree with David.
*When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids.
** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.

*Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
<pre>
{ head: 123, tails: [456, 789],
french: ["le", 456, "que", 789],
english: ["the", 456, "that", 789],
chinese: [789, "de", "456"] }
</pre>
I don't think the standard needs to specify exactly how this is done.

*What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
** David: yes, I don't think the format should care

== Protocol Buffers ==

[http://code.google.com/p/protobuf/ Protocol Buffer Description]

[http://github.com/srush/hypergraph Implementation Sketch]

Pro:
* Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json])
* Very fast to read (particularly in C++ and Java, hopefully soon in python)
* Very space efficient
* Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:
* No implementations for Perl, C#, or other languages commonly used by NLP folks
* Requires a separate library; adds an external dependency to spec
* "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page].
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

== Variation of SLF (Standard Lattice Format) ==

[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]

Pro:
* Blindingly fast.
* Could be implemented to work lazy/streaming.

Con:
* Requires a custom format
* Probably need specialized language bindings.

== Tiburon Format ==

[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

== Bindings/Libraries ==

Python
* this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.

C++
* (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.

= See also =

* [[Machine translation]]
* [[Machine translation software]]

[[Category:Machine translation]]

Hypergraph Format

2010-11-08T19:43:20Z

David Chiang: /* JSON */

Hypergraph Format

2010-11-08T19:43:03Z

David Chiang: /* Bindings/Libraries */

Hypergraph Format

2010-11-08T19:42:52Z

David Chiang: /* JSON */

Hypergraph Format

2010-11-08T18:53:55Z

David Chiang:

Hypergraph Format

2010-11-08T18:42:19Z

David Chiang: /* JSON */

Hypergraph Format

2010-11-08T18:40:52Z

David Chiang: /* JSON */

Hypergraph Format

2010-11-08T18:40:13Z

David Chiang: /* JSON */

Hypergraph Format

2010-11-08T18:17:11Z

David Chiang: proposed schema for JSON format

People

2007-04-13T21:12:47Z

David Chiang: /* C */

This is a list of homepages of researchers in Computational Linguistics, in the form '''last name, first name - affiliation'''. Most of the early additions have been moved here from the [http://www.aclweb.org/universe ACL NLP/CL Universe]. For more information about people in Computational Linguistics and Artificial Intelligence, explore their [[Academic genealogy|academic genealogy]].

== A ==

*[http://littera.deusto.es/prof/abaitua Abaitua, Joseba] - Universidad de Deusto
*[http://tony.abou-assaleh.net Abou-Assaleh, Tony] - Dalhousie University
*[http://www-personal.umich.edu/~ladamic/ Adamic, Lada] - University of Michigan
*[http://www.cond.org/ Adar, Eytan] - University of Washington
*[http://www.dfki.de/~janal/ Alexandersson, Jan] - German Research Center for Artificial Intelligence
*[http://www-scf.usc.edu/~alcazar/ Alcázar, Asier] - University of Southern California
*[http://www.cs.rochester.edu/u/james/ Allen, James] - University of Rochester
*[http://www.dc.fi.udc.es/~alonso/ Alonso, Miguel A.]
*[http://www.linguist.jussieu.fr/~amsili/ Amsili, Pascal] - University of Paris 7 - Denis Diderot
*[http://www.aueb.gr/users/ion/ Androutsopoulos, Ion] - Athens University of Economics and Business
*[http://www.carleton.ca/~asudeh/ Asudeh, Ash] - Carleton University
*[http://www.coli.uni-sb.de/~tania/ Avgustinova, Tania] - Universität des Saarlandes

== B ==

*[http://www.cs.cmu.edu/~klb Baker, Kathryn] - Carnegie Mellon University
*[http://uts.cc.utexas.edu/~jbaldrid/ Baldridge, Jason] - University of Texas at Austin
*[http://www.cs.mu.oz.au/~tim/ Baldwin, Timothy] - University of Melbourne
*[http://www.georgetown.edu/cball/cball.html Ball, Catherine] - Georgetown University
*[http://www.lsi.upc.es/~batalla Batalla,Jordi Atserias] - UPC, Spain
*[http://www5.informatik.uni-erlangen.de/Personen/batliner/ Batliner, Anton] - Friedrich-Alexander-Universität Erlangen-Nürnberg
*[http://www.dfki.de/~becker Becker, Tilman] - DFKI Saarbruecken, Germany
*[http://faculty.washington.edu/ebender Bender, Emily] - University of Washington
*[http://www.cs.mu.oz.au/~sb/ Bird, Steven] - University of Melbourne
*[http://seneca.uab.es/filfrirom/Blanco.html Blanco, Xavier] - Autonomous University of Barcelona
*[http://www.pdg.cnb.uam.es/blaschke/personalPage.html Blaschke, Christian]
*[http://www.dcs.shef.ac.uk/~kalina/ Boncheva, Kalina] - Univ. of Sheffield
*[http://www.kecl.ntt.co.jp/icl/mtg/members/bond/ Bond, Francis] - NTT Communication Science Laboratories
*[http://homepages.inf.ed.ac.uk/jbos/ Bos, Johan] - University of Rome "La Sapienza"
*[http://www.iro.umontreal.ca/~boufaden/ Boufaden, Narjès] - University of Montreal
*[http://www.let.rug.nl/~gosse Bouma, Gosse] - RU Groningen
*[http://www.di.fc.ul.pt/~ahb/ Branco, Antonio] - University of Lisbon
*[http://www.karlbranting.net Branting, Karl]
*[http://coli.uni-sb.de/~thorsten Brants, Thorsten] - University of Saarland
*[http://www.coli.uni-sb.de/~brawer Brawer, Sascha] - University of the Saarland
*[http://www.cs.cornell.edu/~ebreck Breck, Eric]
*[http://www.csse.monash.edu.au/~jwb/ Breen, Jim] - Monash University
*[http://www.informatik.uni-leipzig.de/~brewka/ Brewka] - Gerhard, University of Leipzig
*[http://research.microsoft.com/%7Ebrill/ Brill, Eric]
*[http://www.cl.cam.ac.uk/users/ejb/ Briscoe, Ted] - University of Cambridge
*[http://www.dfki.de/~paulb Buitelaar, Paul] - DFKI
*[http://www.cs.utexas.edu/users/razvan/ Bunescu, Razvan] - University of Texas at Austin

== C ==

*[http://www.hinocatv.ne.jp/~price/ Caldwell, Price] - Meisei University
*[http://www.acs.ilstu.edu/faculty/mecalif/calif.htm Califf,Mary Elaine] - Illinois State University
*[http://ilk.uvt.nl/~sander/ Canisius, Sander] - Tilburg University
*[http://www.cis.upenn.edu/~cliff-group/94/carberry.html Carberry, Sandra] - Univ. of Delaware, Univ. of Pennsylvania
*[http://www.cs.cornell.edu/Info/Faculty/Claire_Cardie.html Cardie, Claire] - Cornell University
*[http://www.cogs.susx.ac.uk/lab/nlp/carroll/carroll.html Carroll, John] - University of Sussex
*[http://jones.ling.indiana.edu/~dcavar Cavar, Damir] - Indiana University, Bloomington
*[http://tantek.com/map.html Celik, Tantek] - Technorati
*[http://cer.freeshell.org Cer, Daniel] - University of Colorado at Boulder
*[http://www.cs.uleth.ca/~chali Chali, Yllias] - University of Lethbridge
*[http://www.cs.brown.edu/people/ec/home.html Charniak, Eugene] - Brown University
*[http://www.ciscl.unisi.it/persone/chesi.htm Chesi, Cristiano] - CISCL, University of Siena
*[http://www.isi.edu/~chiang Chiang, David] - USC Information Sciences Institute
*[http://www.alphabit.net/Docente/docente_eng.htm Chiari, Isabella] - University "La Sapienza" of Rome
*[http://korterm.kaist.ac.kr/kschoi/ Choi, Key-Sun] - Korea Advanced Institute of Science and Technology
*[http://web.mit.edu/afs/athena.mit.edu/org/l/linguistics/www/chomsky.home.html Chomsky, Noam] - MIT
*[http://research.microsoft.com/users/church/ Church, Kenneth] - Microsoft
*[http://www.dcs.shef.ac.uk/~fabio/ Ciravegna, Fabio] - University of Sheffield
*[http://www.iccs.informatics.ed.ac.uk/~stephenc Clark, Stephen] - University of Edinburgh
*[http://compbio.uchsc.edu/Hunter_lab/Cohen Cohen, Kevin Bretonnel] - U. Colorado School of Medicine
*[http://people.csail.mit.edu/u/m/mcollins/public_html/ Collins, Michael] - MIT Computer Science and Artificial Intelligence Laboratory
*[http://www.cl.cam.ac.uk/users/aac10/ Copestake, Ann] - University of Cambridge
*[http://lands.let.kun.nl/TSpublic/coppen Coppen, Peter-Arno] - University of Nijmegen, The Netherlands
*[http://plg.uwaterloo.ca/~gvcormac/ Cormack, Gordon] - University of Waterloo
*[http://www.psych.qub.ac.uk/staff/teaching/cowie/index.aspx Cowie, Roddy] - Queen's University, Belfast
*[http://www.biostat.wisc.edu/~craven/ Craven, Mark] - University of Wisconsin
*[http://www2.ulster.ac.uk/staff/n.creaney.html Creaney, Norman] - University of Ulster
*[http://www.dia.uniroma3.it/~crescenz/ Crescenzi, Valter] - Università Roma Tre
*[http://www.harlequin.com/ Crowe, Jeremy] - Harlequin Ltd.
*[http://www.dcs.shef.ac.uk/~hamish Cunningham, Hamish] - University of Sheffield
*[http://www-users.cs.york.ac.uk/~jc/ Cussens, James] - University of York

== D ==

*[http://conversational-technologies.com Dahl, Deborah] - Conversational Technologies
*[http://stl.recherche.univ-lille3.fr/sitespersonnels/dal/index.html Dal, Georgette] - Universite de Lille
*[http://www.ics.mq.edu.au/~rdale Dale, Robert] - Centre for Language Technology, Macquarie University
*[http://www.cs.utah.edu/~hal/ Daumé III, Hal] - University of Utah
*[http://davies-linguistics.byu.edu Davies, Mark] - Brigham Young University
*[http://www.csi.uottawa.ca/~delannoy Delannoy, Jean-Francois] - University of Ottawa
*[http://comp.ling.utexas.edu/denis Denis, Pascal] - University of Texas at Austin
*[http://www.math.bas.bg/~iad/ Derzhanski, Ivan] - Bulgarian Academy of Sciences
*[http://www.ling.ohio-state.edu/~dm/ Detmar Meurers, Walt] - The Ohio State University Linguistics Dept.
*[http://www.limsi.fr/Individu/devil/ Devillers, Laurence] - LIMSI
*[http://www.cs.umd.edu/users/bonnie/ Dorr, Bonnie] - University of Maryland
*[http://www.nyu.edu/pages/linguistics/doughert.html Dougherty, Ray] - New York University
*[http://www.ai.sri.com/~dowding Dowding, John] - SRI

== E ==

*[http://www.uni-bielefeld.de/lili/personen/cebert/ Ebert, Christian] - University of Bielefeld
*[http://www.ims.uni-stuttgart.de/~eckle/ Eckle-Kohler, Judith]
*[http://www.philipedmonds.com/ Edmonds, Philip] - University of Toronto
*[http://www.cs.bgu.ac.il/~elhadad/ Elhadad, Michael] - Ben-Gurion University of the Negev
*[http://www.cogsci.ed.ac.uk/~marke/ Ellison, T. Mark] - University of Edinburgh
*[http://www.sciences.univ-nantes.fr/info/perso/permanents/enguehard/ Enguehard, Chantal] - Laboratoire d'Informatique de Nantes Atlantique
*[http://coli.uni-sb.de/~erbach/ Erbach, Gregor] - Universität des Saarlandes
*[http://nl.ijs.si/et/ Erjavec, Tomaz]
*[http://comp.ling.utexas.edu/erk/ Erk, Katrin] - University of Texas at Austin
*[http://www.cogsci.uni-osnabrueck.de/~severt/ Evert, Stefan] - University of Osnabrück

== F ==

*[http://slt.wcl.ee.upatras.gr/Fakotakis/personal.htm Fakotakis, Nikos] - University of Patras
*[http://www.phon.ucl.ac.uk/home/alex/home.htm Fang, Alex Chengyu] - University College London
*[http://wordnet.princeton.edu/~fellbaum/ Fellbaum, Christiane] - Princeton University
*[http://ling.cuc.edu.cn/htliu/feng/feng.htm Feng, Zhiwei] - IAL of China
*[http://www.cs.umbc.edu/~finin/ Finin, Tim] - University of Maryland
*[http://lingo.stanford.edu/dan/ Flickinger, Dan] - CSLI, Stanford University
*[http://www.icsi.berkeley.edu/~fosler Fosler, Eric] - ICSI, University of California at Berkeley
*[http://www.coli.uni-saarland.de/~fouvry/ Fouvry, Frederik]
*[http://www.cs.technion.ac.il/~francez Francez, Nissim] - Technion, Israel
*[http://www.cs.cmu.edu/~ref/ Frederking, Robert] - Carnegie-Mellon University
*[http://www.ee.ust.hk/~pascale/ Fung, Pascale] - Hong Kong University of Science and Technology

== G ==

*[http://www.cs.technion.ac.il/~gabr Gabrilovich, Evgeniy]
*[http://www.dcs.shef.ac.uk/~robertg/ Gaizauskas, Rob] - University of Sheffield
*[http://www.sics.se/~gamback/ Gamback, Bjorn] - Swedish Institute of Computer Science
*[http://www.gelbukh.com/ Gelbukh, Alexander] - CIC-IPN
*[http://www.isi.edu/natural-language/people/germann/ Germann, Ulrich] - ISI
*[https://netfiles.uiuc.edu/girju/index.html Girju, Roxana] - University of Illinois, Urbana-Champaign
*[http://tcc.itc.it/people/giuliano.html Giuliano, Claudio] - ITC-irst
*[http://www.uni-salzburg.at/portal/page?_pageid=425,405845&_dad=portal&_schema=PORTAL Goebl, Hans] - Univeristät Salzburg
*[http://www.esi.uem.es/~jmgomez Gomez-Hidalgo, Jose-Maria] - UEM
*[http://www.linguistics.ucsb.edu/faculty/stgries/ Gries, Stefan Th.] - UCSB
*[http://cs.nyu.edu/cs/faculty/grishman/ Grishman, Ralph] - New York University
*[http://das-www.harvard.edu/users/faculty/Barbara_Grosz/Barbara_Grosz.html Grosz, Barbara] - Harvard University
*[http://www-ksl.stanford.edu/people/gruber/ Gruber, Tom] - Stanford University
*[http://www.cs.duke.edu/~cig Guinn, Curry I.] - Duke U.
*[http://www.ukp.tu-darmstadt.de/ Gurevych, Iryna] - Darmstadt University of Technology
*[http://www.cs.bilkent.edu.tr/~guvenir/guvenir.html Guvenir, Altay] - Bilkent University

== H ==

*[http://www.swan.ac.uk/french/web-content/staff/p-ten-hacken.html Hacken, Pius ten] - Swansea University
*[http://www.coling.uni-freiburg.de/~hahn/hahn.html Hahn, Udo] - University of Freiburg
*[http://www.comp.nus.edu.sg/~cuihang Hang, Cui] - National University of Singapore
*[http://www.coli.uni-sb.de/~hansen Hansen-Schirra, Silvia] - Universität des Saarlandes
*[http://www.cognia.com/ Harkema, Henk] - Cognia EU
*[http://pi7.fernuni-hagen.de/hartrumpf/ Hartrumpf, Sven] - University of Hagen, Germany
*[http://www.cis.udel.edu/~harvey/ Harvey, Terry]
*[http://www.linguistik.uni-erlangen.de/~rrh/ Hausser, Roland] - University of Erlangen, Germany
*[http://www.sims.berkeley.edu/~hearst Hearst, Marti] - UC Berkeley
*[http://www.cse.ogi.edu/~heeman Heeman, Peter] - OGI
*[http://homepages.inf.ed.ac.uk/jhender6/ Henderson, James] - University of Edinburgh
*[http://www.asp.ogi.edu/~hynek/ Hermansky, Hynek] - Oregon Graduate Institute of Science and Technology
*[http://www.isi.edu/~ulf/ Hermjakob, Ulf] - USC/ISI
*[http://www.esi.uem.es/~jmgomez/ Hidalgo, José María Gómez] - Universidad Europea de Madrid
*[http://www.ifi.unizh.ch/staff/hess.html Hess, Michael] - Univ. of Zurich, Switzerland
*[http://www.cs.toronto.edu/~gh Hirst, Graeme] - University of Toronto
*[http://www.isi.edu/~hobbs/ Jerry Hobbs] - USC/ISI
*[http://www.cs.cmu.edu/~chogan Hogan, Christopher] - Carnegie-Mellon University
*[http://www.isi.edu/natural-language/people/hovy.html Hovy, Eduard] - ISI
*[http://ist-socrates.berkeley.edu/~jcl2/churen.htm Huang, Chu-Ren] - Academica Sinica
*[http://www.cs.ucf.edu/~hull Hull, Richard] - University of Central Florida
*[http://compbio.uchsc.edu/Hunter_lab/Hunter Hunter, Larry] - U. Colorado School of Medicine
*[http://datamining.typepad.com/data_mining/ Hurst, Matthew] - BuzzMetrics
*[http://ourworld.compuserve.com/homepages/WJHutchins/ Hutchins, John]

== I ==

== J ==

*[http://www.cis.upenn.edu/~cliff-group/94/pjacobs.html Jacobs, Paul] - General Electric
*[http://www.stanford.edu/~tiflo Jaeger, T. Flroian] - Stanford University
*[http://ist.psu.edu/faculty_pages/jjansen/ Jansen, Jim] - Penn State
*[http://www.ida.liu.se/~arnjo/ Jönsson, Arne] - Linkoping University
*[http://www.cog.brown.edu/~mj Johnson, Mark] - Brown University

== K ==

*[http://www.ai.sri.com/~megumi Kameyama, Megumi] - SRI International
*[http://www.comp.nus.edu.sg/~kanmy Kan, Min-Yen] - National University of Singapore
*[http://users.utu.fi/karhumak/ Karhumaki] - Juhani University of Turku
*[http://www.sics.se/~jussi/ Karlgren, Jussi] - SICS, Sweden
*[http://www2.parc.com/istl/members/karttune/ Karttunen, Lauri]
*[http://elex.amu.edu.pl/ifa/staff/kaszubski.html Kaszubski, Przemysław] - Adam Mickiewicz University
*[http://www.cs.utexas.edu/users/rjkate/ Kate, Rohit J.] - University of Texas at Austin
*[http://www-users.cs.york.ac.uk/~kazakov/ Kazakov, Dimitar] - University of York
*[http://homepages.inf.ed.ac.uk/keller/ Keller, Frank] - University of Edinburgh
*[http://www.mabidkhan.com/ Khan, Abid] - University of Peshawar, Pakistan
*[http://www.itri.bton.ac.uk/~Adam.Kilgarriff Kilgarriff, Adam] - University of Brighton
*[http://www.cs.wisc.edu/~sklein/sklein.html Klein, Sheldon] - University of Wisconsin
*[http://www.isi.edu/~knight/ Knight, Kevin] - ISI
*[http://www.iccs.inf.ed.ac.uk/~pkoehn/ Koehn, Philipp] - University of Edinburgh
*[http://svenska.gu.se/~svedk Kokkinakis, Dimitrios] - Göteborg University
*[http://www.coli.uni-saarland.de/~kordoni/ Kordoni, Valia] - Universität des Saarlandes
*[http://www.kornai.com/ Kornai, Andras]
*[http://www.ling.helsinki.fi/~koskenni/ Koskenniemi, Kimmo] - University of Helsinki
*[http://users.encs.concordia.ca/~kosseim/ Kosseim, Leila] - Concordia University, Montreal
*[http://www.dlsi.ua.es/~zkozareva/ Kozareva, Zornitsa] - University of Alicante
*[http://dis.tpd.tno.nl/mmts/wessel_kraaij.html Kraaij, Wessel] - TNO
*[http://www-sk.let.uu.nl Krauwer, Steven, ELSNET] - Utrecht University
*[http://www.peter-kuehnlein.net/ Kuehnlein, Peter] - Bielefeld University
*[http://jones.ling.indiana.edu/~skuebler/ Kuebler, Sandra] - Indiana University, Bloomington
*[http://www.cs.ucd.ie/staff/nick/ Kushmerick, Nicholas] - University College, Dublin

== L ==

*[http://www.ling.gu.se/~lager/ Lager, Torbjörn] - Göteborg University
*[http://www.ict.csiro.au/staff/Andrew.Lampert/ Lampert, Andrew] - CSIRO ICT Centre / Macquarie University
*[http://tcc.itc.it/people/lavelli/ Lavelli, Alberto] - ITC-IRST
*[http://www-personal.umich.edu/~jlawler/index.html Lawler, John] - University of Michigan
*[http://nlp.postech.ac.kr/~gblee Lee, Geunbae] - POSTECH
*[http://www.cs.bham.ac.uk/~mgl Lee, Mark] - University of Birmingham
*[http://www.ling.lancs.ac.uk/staff/geoff/geoff.htm Leech, Geoffrey] - Professor LAMEL, Lancaster University, UK
*[http://www.iccs.inf.ed.ac.uk/~s0239229/ Leidner, Jochen L.]
*[http://homepages.inf.ed.ac.uk/olemon Lemon, Oliver]
*[http://www.ilc.cnr.it/~lenci/ Lenci, Alessandro] - Università di Pisa
*[http://people.cs.uchicago.edu/~levow/ Levow, Gina-Anne] - University of Chicago
*[http://www.ling.upenn.edu/~myl/ Liberman, Mark] -University of Pennsylvania
*[http://www.cs.cornell.edu/home/llee Lee, Lillian] - Cornell University
*[http://www.cs.umanitoba.ca/~lindek/ Lin, Dekang] - University of Manitoba
*[http://htliu.yeah.net/ Liu, Haitao] - Communication University of China
*[http://www.langnat.com/~loupy/index-en.html Loupy, Claude de] - Universite de Paris X Nanterre
*[http://www.personal.psu.edu/xxl13 Lu, Xiaofei] - Pennsylvania State University
*[http://mtgroup.ict.ac.cn/~liuyang/ Liu, Yang] - Institute of Computing Technology, CAS

== M ==

*[http://www.soi.city.ac.uk/~andym/ MacFarlane, Andrew] - City University of London
*[http://www-cs-students.Stanford.EDU/~magerman Magerman, David] - Stanford University
*[http://tcc.itc.it/people/magnini.html Magnini, Bernardo] - ITC-IRST
*[http://www.karacaymalkar.com Malkar, Karacay] - Webportal
*[http://www.rohan.sdsu.edu/~malouf Malouf, Rob] - San Diego State University
*[http://www.sultry.arts.usyd.edu.au/ Manning, Christopher] - University of Sydney
*[http://www.isi.edu/~marcu/ Marcu, Daniel] - USC/ISI
*[http://overstated.net/about Marlow, Cameron] - Yahoo! Research
*[http://www.limsi.fr/Individu/martin/ Martin,Jean-Claude] - LIMSI
*[http://www.let.rug.nl/~begona/ Moirón, Begoña Villada] - University of Groningen
*[http://www.ics.mq.edu.au/~mpawel Mazur, Pawel] - Wroclaw University of Technology and Macquarie University
*[http://www.informatics.susx.ac.uk/research/nlp/mccarthy/mccarthy.html McCarthy, Diana] - University of Sussex
*[http://homepages.inf.ed.ac.uk/mmcconvi McConville, Mark] - University of Edinburgh
*[http://www.eecis.udel.edu/~mccoy/ McKoy, Kathy] - University of Delaware
*[http://alum.mit.edu/www/davidmcdonald/ McDonald, David] - BBN Technologies
*[http://stp.lingfil.uu.se/~bea Megyesi, B. Beata] - Uppsala University
*[http://cs.nyu.edu/~melamed Melamed, I. Dan] - New York University
*[http://www.cs.unt.edu/~rada Mihalcea, Rada] - University of North Texas
*[http://www.cis.upenn.edu/~elenimi/ Miltsakaki, Eleni] - University of Pennsylvania
*[http://imaginarycartography.com/work.html Minor, Joshua T.] - Cataphora, Inc.
*[http://staff.science.uva.nl/~gilad/ Mishne, Gilad] - University of Amsterdam
*[http://www.wlv.ac.uk/~le1825/main.html Mitkov, Ruslan] - University of Wolverhampton
*[http://www.ifi.unizh.ch/~molla/ Molla-Aliod, Diego] - University of Zurich
*[http://www.dcs.qmul.ac.uk/~christof/ Monz, Christof] - University of Amsterdam (ILLC)
*[http://www.cs.utexas.edu/users/mooney/ Mooney, Raymond J.] - University of Texas at Austin
*[http://www.signiform.com/erik/ Mueller, Erik] - IBM Research
*[http://www.xn--stefan-mller-klb.net/ Müler, Stefan] - Universität Bremen
*[http://www.dlsi.ua.es/eines/membre.cgi?id=eng&nom=rafael&tipus=pdi Muñoz, Rafael] - University of Alicante

== N ==

*[http://www.cs.utexas.edu/users/ai-lab/people/grad/nahm.html Nahm, Un Yong] - University of Texas, Austin
*[http://www.univ-nancy2.fr/pers/namer/ Namer, Fiammetta] - University of Nancy
*[http://www.lr.pi.titech.ac.jp/~nanno/index.cgi?page=Tomoyuki+NANNO Nanno, Tomoyuki] - Tokyo Institute of Technology
*[http://www.dlsi.ua.es/~borja/ Navarro, Borja] - University of Alicante, Spain
*[http://tcc.itc.it/people/negri.html Negri, Matteo] - ITC-irst
*[http://www.let.rug.nl/~nerbonne Nerbonne, John] - RU Groningen
*[http://cl-www.dfki.uni-sb.de/~neumann Neumann, Guenter] - DFKI, Saarbrücken
*[http://www.comp.nus.edu.sg/~nght Ng, Hwee Tou] - National University of Singapore
*[http://www.slt.atr.co.jp/~night/ Nightingale, Stephen] - ATR Institute International
*[http://homepages.inf.ed.ac.uk/mnissim/ Nissim, Malvina] - University of Bologna
*[http://www.comp.nus.edu.sg/~niuzheng Niu, Zheng-Yu] - NU Singapore
*[http://w3.msi.vxu.se/~nivre/ Nivre, Joakim] - Växjö University
*[http://www.cs.berkeley.edu/~russell/norvig.html Norvig, Peter]

== O ==

*[http://www.ltg.ed.ac.uk/~jon/ Oberlander, Jon] - U. Edinburgh
*[http://people.sabanciuniv.edu/oflazer/ Oflazer, Kemal] - Sabanci University, Istanbul, Turkey
*[http://www.loa-cnr.it/oltramari.html Oltramari, Alessandro] - Laboratory for Applied Ontology, Italian National Research Council
*[http://www.wlv.ac.uk/~in6093/ Orasan, Constantin] - University of Wolverhampton
*[http://www.bultreebank.org/petya/OsenovaPub.html Osenova, Petya] - Bulgarian Academy of Sciences

== P ==

*[http://cst.dk/patrizia/ Paggio, Patrizia] - University of Copenhagen
*[http://www.slt.atr.co.jp/~kpaik/ Paik, Kyonghee] - ATR Spoken Language Translation Research Laboratories
*[http://www.cs.cornell.edu/People/pabo Pang, Bo] - Cornell University
*[http://verbs.colorado.edu/~mpalmer/ Palmer, Martha] - University of Colorado
*[http://www.isi.edu/~pantel/ Pantel, Patrick] - ISI/University of Southern California
*[http://www.l2f.inesc-id.pt/~joana/english.html Paulo Pardal] - Joana L²F] - INESC-ID
*[http://www.d.umn.edu/~tpederse Pedersen, Ted] - University of Minnesota, Duluth
*[http://ai-nlp.info.uniroma2.it/pennacchiotti Pennacchiotti, Marco] - University of Roma Tor Vergata
*[http://www.perry.com/ Perry, John] - UCLA
*[http://tcc.itc.it/people/pianesi.html Pianesi, Fabio] - ITC-irst
*[http://www.resegone.com/mapb/ Piccolino Boniforti, Marco Aldo] - Rovira i Virgili University
*[http://cswww.essex.ac.uk/staff/poesio Poesio, Massimo] - University of Essex
*[http://www.fas.umontreal.ca/ling/olst/polguereE Polguere, Alain] - Université de Montréal
*[http://fas.sfu.ca/0h/cs/people/Faculty/Popowich/popowich Popowich, Fred] - Simon Fraser University
*[http://nlp.ipipan.waw.pl/~adamp/ Przepiórkowski, Adam] - Polish Academy of Sciences, Warsaw
*[http://www.ling-phil.ox.ac.uk/people/staff/pulman/ Pulman, Stephen] - Oxford University
*[http://www.cs.brandeis.edu/~jamesp Pustejovsky, James] - Brandeis University

== Q ==

== R ==

*[http://www.eecs.umich.edu/~radev/ Radev, Dragomir] - University of Michigan
*[http://www.fask.uni-mainz.de/user/rapp Rapp, Reinhard] - Johannes Gutenberg-Universitaet Mainz
*[http://www.cs.buffalo.edu/pub/WWW/faculty/rapaport/rapaport.html Rapaport, William J.] - SUNY Buffalo
*[http://www.cis.upenn.edu/~cliff-group/94/lrau.html Rau, Lisa]
*[http://www.comp.lancs.ac.uk/computing/users/paul/ Rayson, Paul] - Lancaster University
*[http://www.csd.abdn.ac.uk/~ereiter Reiter, Ehud] - University of Aberdeen
*[http://www.dfki.uni-sb.de/~bert Reithinger, Norbert] - Universität des Saarlandes
*[http://www.reitter-it-media.de/ Reitter, David] - University of Edinburgh
*[http://www.ai.mit.edu/~jrennie/ Rennie, Jason] - MIT
*[http://umiacs.umd.edu/~resnik Resnik, Philip] - University of Maryland, College Park
*[http://www.cs.utah.edu/~riloff/ Riloff, Ellen] - University of Utah
*[http://www.cs.rochester.edu/u/ringger/ Ringger, Eric,] - University of Rochester
*[http://www.di.ufpe.br/~jr Robin, Jacques, Federal] - University of Pernambuco, Brazil.
*[http://www.univ-ab.pt/~vjr/ Rocio, Vitor] - Open University, Lisbon
*[http://jones.ling.indiana.edu/~prrodrig/ Rodrigues, Paul] - Indiana University, Bloomington
*[http://www.cs.uiuc.edu/directory/directory.php?name=roth Roth, Dan 0 University of Illinois, Urbana-Champaign]
*[http://www.public.asu.edu/~droussi/ Roussinov, Dmitri] - Arizona State University
*[http://www.hi.is/~eirikur/ Rögnvaldsson, Eiríkur] - University of Iceland
*[http://www.uteroemer.de/ Römer, Ute] - University of Hanover

== S ==

*[http://www.cis.upenn.edu/~anoop/ Sarkar, Anoop] - University of Pennsylvania
*[http://personalpages.manchester.ac.uk/staff/yutaka.sasaki/ Sasaki, Yutaka] - University of Manchester
*[http://www.cog.jhu.edu/~savova/ Savova, Virginia] - MIT
*[http://www.dfki.de/~uschaefer Schaefer, Ulrich] - German Research Center for Artificial Intelligence
*[http://www7.informatik.tu-muenchen.de/~scheler Scheler] - Gabriele, TU München
*[http://www.ims.uni-stuttgart.de/~mike/ Schiehlen, Michael] - University of Stuttgart
*[http://www.ims.uni-stuttgart.de/~schmid/ Schmid, Helmut] - University of Stuttgart
*[http://www.kde.cs.uni-kassel.de/schmitz Schmitz, Christoph] - Universität Kassel
*[http://www.schulteimwalde.de/ Schulte, Sabine, im Walde]
*[http://www.ics.mq.edu.au/~rolfs Schwitter, Rolf] - Macquarie University
*[http://mcs.open.ac.uk/ds5473/ Scott, Donia] - The Open University
*[http://nlp.cs.nyu.edu/sekine Sekine, Satoshi] - New York University
*[http://www.eecs.harvard.edu/~shieber/ Shieber, Stuart] - Harvard University
*[http://www.cs.rochester.edu/u/sikorski/ Sikorski, Teresa] - University of Rochester
*[http://www.lingsoft.fi/~silvonen/ Silvonen, Mikko] - Lingsoft, Inc.
*[http://www.bultreebank.org/kivs/ Simov, Kiril] - Bulgarian Academy of Sciences
*[http://www.utexas.edu/cola/centers/lrc/general/facultyhomes/jonathan.html Slocum, Jonathan] - The University of Texas at Austin
*[http://www.cog.jhu.edu/faculty/smolensky.html Smolensky, Paul] - Johns Hopkins University
*[http://www.ece.uiuc.edu/faculty/faculty.asp?rws Sproat, Richard] - University of Illinois, Urbana-Champaign
*[http://www.coling.uni-freiburg.de/~staab/staab.html Staab, Steffen] - Freiburg University
*[http://www.humnet.ucla.edu/humnet/linguistics/people/stabler/stabler.htm Stabler, Edward] - UCLA
*[http://slt.wcl.ee.upatras.gr/stamatatos/personal.html Stamatatos, Efstathios] - University of Patras
*[http://www.cs.toronto.edu/~suzanne/ Suzanne Stevenson] - University of Toronto
*[http://isl.ira.uka.de/~stiefel Stiefelhagen, Rainer] - Universität Karlsruhe
*[http://www.coling.uni-freiburg.de/~strube/strube.html Strube, Michael] - University of Freiburg
*[http://www.csi.uottawa.ca/~szpak/ Szpakowicz, Stan] - University of Ottawa

== T ==

*[http://www.sfu.ca/~mtaboada Taboada, Maite] - Simon Fraser University
*[http://hnk.ffzg.hr/mt/ Tadic, Marko] - Faculty of Philosophy, University of Zagreb
*[http://www.ling.helsinki.fi/~tapanain Tapanainen, Pasi] - University of Helsinki
*[http://www8.informatik.uni-erlangen.de/inf8/en/thabet.html Thabet, Iman] - University of Erlangen-Nuremberg
*[http://www.siit.tu.ac.th/dirctory/ft_fac/thanaruk.html Theeramunkong, Thanaruk] - Sirindhorn International Institute of Technology, Thammasat University
*[http://www.objs.com/thompson.htm Thompson, Craig] - Object Services and Consulting, Inc.
*[http://www.let.rug.nl/~tiedeman/blog/index.php?category=1 Tiedemann, Jörg] - University of Groningen
*[http://tecfa.unige.ch/tecfa-people/traum.html Traum, David] - TECFA, Universite de Geneve
*[http://www.hum.uit.no/a/trond/ Trosterud, Trond] - University of Tromsø
*[http://www.racai.ro/~tufis/ Tufis, Dan] - Research Institute for Artificial Intelligence, Romanian Academy
*[http://www.apperceptual.com/ Turney, Peter] - National Research Council of Canada

== U ==

*[http://www.coli.uni-sb.de/~hansu Uszkoreit, Hans] - University of the Saarland and DFKI Saarbrücken

== V ==

*[http://www.q-go.com/ van de Burgt, Stan P.] - Q-go.com
*[http://ilk.uvt.nl/~antalb/ van den Bosch, Antal] - Tilburg University
*[http://www.media.mit.edu/~nwv/ Van Dyke, Neil] - MIT Media Lab
*[http://www.ua.es/personal/chelo.vargas Vargas, Chelo Sierra] - Universidad de Alicante
*[http://www.cs.brandeis.edu/~marc/home.html Verhagen, Marc] - Brandeis University
*[http://www.up.univ-mrs.fr/veronis/ Véronis, Jean] - Université de Provence
*[http://www.dlsi.ua.es/~vicedo/vicedo_en.html Vicedo, Jose Luis] - Alicante University
*[http://www.inf.unisinos.br/~renata/ Vieira, Renata] - Universidade do Vale do Rio dos Sinos, Brazil
*[http://www.cl.cam.ac.uk/~av208/ Villavicencio, Aline] - Federal University of Rio Grande do Sul, Brazil
*[http://www.ling.helsinki.fi/~avoutila/ Voutilainen, Atro] - University of Helsinki

== W ==

*[http://www.dfki.de/~wahlster/ Wahlster, Wolfgang] - Universität des Saarlandes
*[http://www.uindy.gr/faculty/cv/wallace_manolis/ Wallace, Manolis] - National Technical University of Athens
*[http://www.nigelward.com/ Ward, Nigel]
*[http://www.nick-webb.net Webb, Nick] - SUNY Albany
*[http://www.pages.drexel.edu/~rw37/ Weber, Rosina] - Drexel University
*[http://www.ucsc.cmb.ac.lk/people/arw Weerasinghe, Ruvan] - University of Colombo School of Computing
*[http://www.cs.tu-berlin.de/~ww/ Weisweber, Wilhelm] - Technical University of Berlin
*[http://www.ukp.tu-darmstadt.de Weimer, Markus] - University of Technology Darmstadt
*[http://www.cis.upenn.edu/~bonnie Webber, Bonnie Lynn] - University of Pennsylvania
*[http://www.dcs.shef.ac.uk/~yorick Wilks, Yorick] - University of Sheffield
*[http://cs.haifa.ac.il/~shuly Wintner, Shuly] - University of Haifa, Israel
*[http://www.se.cuhk.edu.hk/~kfwong/ Wong, Kam-Fai] - Chinese University of Hong Kong
*[http://www.cs.utexas.edu/users/ywwong/ Wong, Yuk Wah] - University of Texas at Austin
*[http://www.cs.man.ac.uk/~wroec/ Wroe, Chris] - University of Manchester
*[http://www.cs.ust.hk/faculty/dekai/bio.html Wu, Dekai] - HKUST

== X ==

== Y ==

*[http://www.cs.helsinki.fi/u/yangarbe/ Yangarber, Roman] - University of Helsinki
*[http://www.cis.upenn.edu/~cliff-group/94/yarowsky.html Yarowsky, David] - University of Pennsylvania
*[http://www.ai.mit.edu/people/deniz Yuret, Deniz] - MIT Artificial Intelligence Laboratory

== Z ==

*[http://ai-nlp.info.uniroma2.it/zanzotto Zanzotto, Fabio Massimo] - University of Roma Tor Vergata
*[http://www.ukp.tu-darmstadt.de/ Zesch, Torsten] - Darmstadt University of Technology
*[http://www.csse.monash.edu.au/~ingrid/ Zukerman, Ingrid] - Monash University