Difference between revisions of "Hypergraph Format"

From ACL Wiki
Jump to navigation Jump to search
m
(proposed schema for JSON format)
Line 20: Line 20:
 
* Requires custom parser for speed
 
* Requires custom parser for speed
 
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
 
* Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
 +
 +
Proposed schema:
 +
 +
A Forest object has the following required fields:
 +
* '''nodes''': a list of Node objects
 +
* '''edges''': a list of Edge objects
 +
* '''root''': a node id, which is an integer index into the '''nodes''' list
 +
 +
A Node object has the following optional fields:
 +
* '''label''': string
 +
* '''features''': a FeatureVector object
 +
 +
An Edge object has the following required fields:
 +
* '''head''': a node id
 +
* '''tails''': a list of node ids
 +
and the following optional fields:
 +
* '''label''': string
 +
* '''features''': a FeatureVector object
 +
 +
A FeatureVector object has arbitrary fields with float values.
  
 
== Protocol Buffers ==
 
== Protocol Buffers ==

Revision as of 12:17, 8 November 2010

Overall goal

Make it easy to share packed representations across NLP applications. Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages. A memory efficient and fast representation is also useful.

Serialization library options

JSON

JSON Description

Pro:

  • Implementations in every language (often packaged with language).
  • Human readable
  • Already used in CDec for forest output

Con:

  • Space inefficient
  • Requires custom parser for speed
  • Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects

Proposed schema:

A Forest object has the following required fields:

  • nodes: a list of Node objects
  • edges: a list of Edge objects
  • root: a node id, which is an integer index into the nodes list

A Node object has the following optional fields:

  • label: string
  • features: a FeatureVector object

An Edge object has the following required fields:

  • head: a node id
  • tails: a list of node ids

and the following optional fields:

  • label: string
  • features: a FeatureVector object

A FeatureVector object has arbitrary fields with float values.

Protocol Buffers

Protocol Buffer Description

Implementation Sketch

Pro:

  • Conversion to and from JSON (protobuf-json)
  • Very fast to read (particularly in C++ and Java, hopefully soon in python)
  • Very space efficient
  • Implementations in Java, C++ and Python; generates typed stubs in those languages

Con:

  • No implementations for Perl, C#, or other languages commonly used by NLP folks
  • Requires a separate library; adds an external dependency to spec
  • "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on this page.
  • "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

Variation of SLF (Standard Lattice Format)

SLF Specification

Pro:

  • Blindingly fast.
  • Could be implemented to work lazy/streaming.

Con:

  • Requires a custom format
  • Probably need specialized language bindings.

Tiburon Format

Tiburon Specification

See also