Wednesday, December 21, 2011

The Freebase node

In the [Freebase] quad dump, every line is a freebase node. Think of a node as an atomic fact. A fact in this context is a single proposition that relates an entity (an object) to a predicate.
The entity is represented by a mid, a machine-generated id, which is in the first column (the "source" column). The source column always contains a value.
The predicate is represented by at least two, and sometimes three of the remaining columns.
The second column is the "property". Values in this column are names like "/type/object/name". These represent a particular kind of quality of the entity mentioned in the "source" column. Like the source column, the property column always contains a value.
The value of the property is held in the remaining columns ("destination" and "value"). Depending on the kind of property, either or both the destination and value columns have a value. Also depending on the kind of property, a single property name can appear multiple times for a particular mid in the source column. In this latter case, the property is multivalued or represents a 1:m relationship with a set of other entities.

Well, I can tell what's going on. They don't have a real graph layer. The subsegment bounds are hidden in node values, have to be extracted to use them. No graph layer, they will have to adopt triplets, or scrap the technology. Gotta have a real graph layer or you gonna get lost.

Here are some general instructions: The node has to expose a node pointer so the graph machine can get at node graphs without opening any string up. The predicate is broken up into the graph layer and the BSON layer, wit two available bytes at the moment. The graph layer never opens the key value. So we have the key,predicate,pointer, the general triplet is more useful to the graph layer than to the humans, and engine designers have to think about machines talking to each other in triplets. Any compound object, like name/value pairs should be composed of triplet sequences, the predicates will indicate what's going on. Do not have two variable triplet formats, really screws up the graph layer. Generally the graph layer does not look up symbols, but with the named schema that might change.

No comments: