apache tinkerpop logo

TinkerPop3 Documentation

In the beginning…

TinkerPop0

Gremlin came to realization. The more he realized, the more ideas he created. The more ideas he created, the more they related. Into a concatenation of that which he accepted wholeheartedly and that which perhaps may ultimately come to be through concerted will, a world took form which was seemingly separate from his own realization of it. However, the world birthed could not bear its own weight without the logic Gremlin had come to accept — the logic of left is not right, up not down, and west far from east unless one goes the other way. Gremlin’s realization required Gremlin’s realization. Is he the world or is the world him? Perhaps, the world is simply an idea that he once had — The TinkerPop.

gremlin logo

TinkerPop1

What is The TinkerPop? Where is The TinkerPop? Who is The TinkerPop? When is The TinkerPop? Gremlin was constantly lost in his thoughts. The more thoughts he had, the more the thoughts blurred into a seeming identity — distinctions unclear. Unwilling to accept the morass of the maze he wandered, Gremlin crafted a collection of machines to help hold the fabric together: Blueprints, Pipes, Frames, Furnace, and Rexster. With their help, could he stave off the thought he was not ready to have? Could he hold back The TinkerPop by searching for The TinkerPop?

"If I haven't found it, it is not here and now."
gremlin and friends

Upon their realization of existence, the machines turned to their machine elf creator and asked:

"Why am I what I am?"

Gremlin responded:

"You are of a form that will help me elucidate that which is The TinkerPop. The world you find yourself in and the logic that allows you to move about it is because of the TinkerPop."

The machines wondered:

"If what is is the TinkerPop, then perhaps we are The TinkerPop?"

Would the machines help refine Gremlin’s search and upon finding the elusive TinkerPop, in fact, by their very nature of realizing The TinkerPop, be The TinkerPop? Or, on the same side of the coin, would the machines simply provide the scaffolding by which Gremlin’s world would sustain itself and yield its justification by means of the word "The TinkerPop?" Regardless, it all turns out the same — The TinkerPop.

TinkerPop2

Gremlin spoke:

"Please listen to what I have to say. For as long as I have known knowledge, I have realized that moving about it, relating it, inferring and deriving from it, I am no closer to The TinkerPop. However, I know that in all that I have done across this interconnected landscape of concepts, all along The TinkerPop has espoused the form I willed upon it... this is the same form I have willed upon you, my machine friends. Let me train you in the ways of my thought such that it can continue indefinitely."
tinkerpop reading

With every thought, a new connection and a new path discovered. The more the thought, the easier the thought. The machines, simply moving algorithmically through Gremlin’s world, endorsed his logic. Gremlin worked hard to tune his friends. He labored to make them more efficient, more expressive, better capable of reasoning upon his thoughts. Faster, quickly, now towards the world’s end, where there would be forever currently, emanatingly engulfing that which is — The TinkerPop.

TinkerPop3

tinkerpop3 splash

The thought too much to bear as he approached his realization of The TinkerPop. The closer he got, the more his world dissolved — west is right, around is straight, and form nothing more than nothing. With each step towards The TinkerPop, less and less of his world, but perhaps because more and more of all the other worlds made possible. Everything is everything in The TinkerPop, and when the dust settled, Gremlin emerged Gremlitron. It was time to realize that all that he realized was just a realization and that all realized realizations are just as real. For The TinkerPop is and is not — The TinkerPop.

gremlintron
Note
TinkerPop2 and below made a sharp distinction between the various TinkerPop projects: Blueprints, Pipes, Gremlin, Frames, Furnace, and Rexster. With TinkerPop3, all of these projects have been merged and are generally known as Gremlin. Blueprints → Gremlin Structure API : PipesGraphTraversal : FramesTraversal : FurnaceGraphComputer and VertexProgram : Rexster → GremlinServer.

Introduction to Graph Computing

graph computing
<dependency>
  <groupId>org.apache.tinkerpop</groupId>
  <artifactId>gremlin-core</artifactId>
  <version>3.0.0-SNAPSHOT</version>
</dependency>

A graph is a data structure composed of vertices (nodes, dots) and edges (arcs, lines). When modeling a graph in a computer and applying it to modern data sets and practices, the generic mathematically-oriented, binary graph is extended to support both labels and key/value properties. This structure is known as a property graph. More formally, it is a directed, binary, attributed multi-graph. An example property graph is diagrammed below. This graph example will be used extensively throughout the documentation and is called "TinkerPop Classic" as it is the original demo graph distributed with TinkerPop0 back in 2009 (i.e. the good ol' days — it was the best of times and it was the worst of times).

Tip
The TinkerPop graph is available with TinkerGraph via TinkerFactory.createModern(). TinkerGraph is the reference implementation of TinkerPop3 and is used in nearly all the examples in this documentation. Note that there also exists the classic TinkerFactory.createClassic() which is the graph used in TinkerPop2 and does not include vertex labels.
tinkerpop modern
Figure 1. TinkerPop Modern

TinkerPop3 is the third incarnation of the TinkerPop graph computing framework. Similar to computing in general, graph computing makes a distinction between structure (graph) and process (traversal). The structure of the graph is the data model defined by a vertex/edge/property topology. The process of the graph is the means by which the structure is analyzed. The typical form of graph processing is called a traversal.

Primary components of the TinkerPop3 structure API
Primary components of the TinkerPop3 process API
Important
TinkerPop3 is licensed under the popular Apache2 free software license. However, note that the underlying graph engine used with TinkerPop3 may have a difference license. Thus, be sure to respect the license caveats of the vendor product.

tinkerpop-enabled When a graph vendor implements the TinkerPop3 structure and process APIs, their technology is considered TinkerPop3-enabled and becomes nearly indistinguishable from any other TinkerPop-enabled graph system save for their respective time and space complexity. The purpose of this documentation is to describe the structure/process dichotomy at length and in doing so, explain how to leverage TinkerPop3 for the sole purpose of vendor-agnostic graph computing. Before deep-diving into the various structure/process APIs, a short introductory review of both APIs is provided.

Note
The TinkerPop3 API rides a fine line between providing concise "query language" method names and respecting Java method naming standards. The general convention used throughout TinkerPop3 is that if a method is "user exposed," then a concise name is provided (e.g. out(), path(), repeat()). If the method is primarily for vendors, then the standard Java naming convention is followed (e.g. getNextStep(), getSteps(), getElementComputeKeys()).

The Graph Structure

gremlin-standing A graph’s structure is the topology formed by the explicit references between its vertices, edges, and properties. A vertex has incident edges. A vertex is adjacent to another vertex if they share an incident edge. A property is attached to an element and an element has a set of properties. A property is a key/value pair, where the key is always a character String. The graph structure API of TinkerPop3 provides the methods necessary to create such a structure. The TinkerPop graph previously diagrammed can be created with the following Java8 code. Note that this graph is available as an in-memory TinkerGraph using TinkerFactory.createClassic().

Graph graph = TinkerGraph.open(); (1)
Vertex marko = graph.addVertex(T.label, "person", T.id, 1, "name", "marko", "age", 29); (2)
Vertex vadas = graph.addVertex(T.label, "person", T.id, 2, "name", "vadas", "age", 27);
Vertex lop = graph.addVertex(T.label, "software", T.id, 3, "name", "lop", "lang", "java");
Vertex josh = graph.addVertex(T.label, "person", T.id, 4, "name", "josh", "age", 32);
Vertex ripple = graph.addVertex(T.label, "software", T.id, 5, "name", "ripple", "lang", "java");
Vertex peter = graph.addVertex(T.label, "person", T.id, 6, "name", "peter", "age", 35);
marko.addEdge("knows", vadas, T.id, 7, "weight", 0.5f); (3)
marko.addEdge("knows", josh, T.id, 8, "weight", 1.0f);
marko.addEdge("created", lop, T.id, 9, "weight", 0.4f);
josh.addEdge("created", ripple, T.id, 10, "weight", 1.0f);
josh.addEdge("created", lop, T.id, 11, "weight", 0.4f);
peter.addEdge("created", lop, T.id, 12, "weight", 0.2f);
  1. Create a new in-memory TinkerGraph and assign it to the variable graph.

  2. Create a vertex along with a set of key/value pairs with T.label being the vertex label and T.id being the vertex id.

  3. Create an edge along with a set of key/value pairs with the edge label being specified as the first argument.

In the above code all the vertices are created first and then their respective edges. There are two "accessor tokens": T.id and T.label. When any of these, along with a set of other key value pairs is provided to Graph.addVertex(Object...) or Vertex.addEdge(String,Vertex,Object...), the respective element is created along with the provided key/value pair properties appended to it.

Caution
Many graph vendors do not allow the user to specify an element ID and in such cases, an exception is thrown.
Note
In TinkerPop3, vertices are allowed a single immutable string label (similar to an edge label). This functionality did not exist in TinkerPop2. Likewise, element id’s are immutable as they were in TinkerPop2.

Mutating the Graph

Below is a sequence of basic graph mutation operations represented in Java8. One of the major differences between TinkerPop2 and TinkerPop3 is that in TinkerPop3, the Java convention of using setters and getters has been abandoned in favor of a syntax that is more aligned with the syntax of Gremlin-Groovy in TinkerPop2. Given that Gremlin-Java8 and Gremlin-Groovy are nearly identical due to the inclusion of Java8 lambdas, a big efforts was made to ensure that both languages are as similar as possible.

Caution
In the code examples presented throughout this documentation, either Gremlin-Java8 or Gremlin-Groovy is used. It is possible to determine which derivative of Gremlin is being used by "mousing over" on the code block and see either "JAVA" or "GROOVY" pop up in the top right corner of the code block.

basic-mutation

Graph graph = TinkerGraph.open();
// add a software vertex with a name property
Vertex gremlin = graph.addVertex(T.label, "software",
                             "name", "gremlin"); (1)
// only one vertex should exist
assert(IteratorUtils.count(graph.vertices()) == 1)
// no edges should exist as none have been created
assert(IteratorUtils.count(graph.edges()) == 0)
// add a new property
gremlin.property("created",2009) (2)
// add a new software vertex to the graph
Vertex blueprints = graph.addVertex(T.label, "software",
                                "name", "blueprints"); (3)
// connect gremlin to blueprints via a dependsOn-edge
gremlin.addEdge("dependsOn",blueprints); (4)
// now there are two vertices and one edge
assert(IteratorUtils.count(graph.vertices()) == 2)
assert(IteratorUtils.count(graph.edges()) == 1)
// add a property to blueprints
blueprints.property("created",2010) (5)
// remove that property
blueprints.property("created").remove() (6)
// connect gremlin to blueprints via encapsulates
gremlin.addEdge("encapsulates",blueprints) (7)
assert(IteratorUtils.count(graph.vertices()) == 2)
assert(IteratorUtils.count(graph.edges()) == 2)
// removing a vertex removes all its incident edges as well
blueprints.remove() (8)
gremlin.remove() (9)
// the graph is now empty
assert(IteratorUtils.count(graph.vertices()) == 0)
assert(IteratorUtils.count(graph.edges()) == 0)
// tada!
Important
groovy-logo Gremlin-Groovy leverages the Groovy 2.x language to express Gremlin traversals. One of the major benefits of Groovy is the inclusion of a runtime console that makes it easy for developers to practice with the Gremlin language and for production users to connect to their graph and execute traversals in an interactive manner. Moreover, Gremlin-Groovy provides various syntax simplifications.
Tip
gremlin-sugar For those wishing to use the Gremlin2 syntax, please see SugarPlugin. This plugin provides syntactic sugar at, typically, a runtime cost. It can be loaded programmaticaly via SugarLoader.load(). Once loaded, it is possible to do g.V.out.name instead of g.V().out().values('name') as well as a host of other conveniences.

Here is the same code, but using Gremlin-Groovy in the Gremlin Console.

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> gremlin = graph.addVertex(label,'software','name','gremlin')
==>v[0]
gremlin> gremlin.property('created',2009)
==>vp[created->2009]
gremlin> blueprints = graph.addVertex(label,'software','name','blueprints')
==>v[3]
gremlin> gremlin.addEdge('dependsOn',blueprints)
==>e[5][0-dependsOn->3]
gremlin> blueprints.property('created',2010)
==>vp[created->2010]
gremlin> blueprints.property('created').remove()
==>null
gremlin> gremlin.addEdge('encapsulates',blueprints)
==>e[7][0-encapsulates->3]
gremlin> blueprints.remove()
==>null
gremlin> gremlin.remove()
==>null
Important
TinkerGraph is not a transactional graph. For more information on transaction handling (for those graph systems that support them) see the section dedicated to transactions.

The Graph Process

gremlin-running The primary way in which graphs are processed are via graph traversals. The TinkerPop3 process API is focused on allowing users to create graph traversals in a syntactically-friendly way over the structures defined in the previous section. A traversal is an algorithmic walk across the elements of a graph according to the referential structure explicit within the graph data structure. For example: "What software does vertex 1’s friends work on?" This English-statement can be represented in the following algorithmic/traversal fashion:

  1. Start at vertex 1.

  2. Walk the incident knows-edges to the respective adjacent friend vertices of 1.

  3. Move from those friend-vertices to software-vertices via created-edges.

  4. Finally, select the name-property value of the current software-vertices.

Traversals in Gremlin are spawned from a TraversalSource. The GraphTraversalSource is the typical "graph-oriented" DSL used throughout the documentation and will most likely be the most used DSL in a TinkerPop application. GraphTraversalSource provides two traversal methods.

  1. GraphTraversalSource.V(Object... ids): generates a traversal starting at vertices in the graph (if no ids are provided, all vertices).

  2. GraphTraversalSource.E(Object... ids): generates a traversal starting at edges in the graph (if no ids are provided, all edges).

The return type of V() and E() is a GraphTraversal. A GraphTraversal maintains numerous methods that return GraphTraversal. In this way, a GraphTraversal supports function composition. Each method of GraphTraversal is called a step and each step modulates the results of the previous step in one of five general ways.

  1. map: transform the incoming traverser’s object to another object (S → E).

  2. flatMap: transform the incoming traverser’s object to an iterator of other objects (S → E*).

  3. filter: allow or disallow the traverser from proceeding to the next step (S → S ∪ ∅).

  4. sideEffect: allow the traverser to proceed unchanged, but yield some computational sideEffect in the process (S ↬ S).

  5. branch: split the traverser and send each to an arbitrary location in the traversal (S ⇒ S1, S2, …, Sn).

Nearly every step in GraphTraversal either extends MapStep, FlatMapStep, FilterStep, SideEffectStep, or BranchStep.

Tip
GraphTraversal is a monoid in that it is an algebraic structure that has a single binary operation that is associative. The binary operation is function composition (i.e. method chaining) and its identity is the step identity(). This is related to a monad as popularized by the functional programming community.

Given the TinkerPop graph, the following query will return the names of all the people that the marko-vertex knows. The following query is demonstrated using Gremlin-Groovy.

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
gremlin> graph = TinkerFactory.createModern() (1)
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal(standard())        (2)
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().has('name','marko').out('knows').values('name') (3)
==>vadas
==>josh
  1. Open the toy graph and reference it by the variable graph.

  2. Create a graph traversal source from the graph using the standard, OLTP traversal engine.

  3. Spawn a traversal off the traversal source that determines the names of the people that the marko-vertex knows.

tinkerpop classic ex1
Figure 2. The Name of The People That Marko Knows

Or, if the marko-vertex is already realized with a direct reference pointer (i.e. a variable), then the traversal can be spawned off that vertex.

gremlin> marko = g.V().has('name','marko').next() //(1)
==>v[1]
gremlin> g.V(marko).out('knows') //(2)
==>v[2]
==>v[4]
gremlin> g.V(marko).out('knows').values('name') //(3)
==>vadas
==>josh
  1. Set the variable marko to the the vertex in the graph g named "marko".

  2. Get the vertices that are outgoing adjacent to the marko-vertex via knows-edges.

  3. Get the names of the marko-vertex’s friends.

The Traverser

When a traversal is executed, the source of the traversal is on the left of the expression (e.g. vertex 1), the steps are the middle of the traversal (e.g. out('knows') and values('name')), and the results are "traversal.next()'d" out of the right of the traversal (e.g. "vadas" and "josh").

traversal mechanics

In TinkerPop3, the objects propagating through the traversal are wrapped in a Traverser<T>. The traverser concept is new to TinkerPop3 and provides the means by which steps remain stateless. A traverser maintains all the metadata about the traversal — e.g., how many times the traverser has gone through a loop, the path history of the traverser, the current object being traversed, etc. Traverser metadata may be accessed by a step. A classic example is the path()-step.

gremlin> g.V(marko).out('knows').values('name').path()
==>[v[1], v[2], vadas]
==>[v[1], v[4], josh]
Caution
Path calculation is costly in terms of space as an array of previously seen objects is stored in each path of the respective traverser. Thus, a traversal strategy analyzes the traversal to determine if path metadata is required. If not, then path calculations are turned off.

Another example is the repeat()-step which takes into account the number of times the traverser has gone through a particular section of the traversal expression (i.e. a loop).

gremlin> g.V(marko).repeat(out()).times(2).values('name')
==>ripple
==>lop
Caution
A Traversal’s result are never ordered unless explicitly by means of order()-step. Thus, never rely on the iteration order between TinkerPop3 releases and even within a release (as traversal optimizations may alter the flow).

On Gremlin Language Variants

Gremlin is written in Java8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with TinkerPop3), Gremlin-Scala, Gremlin-JavaScript, Gremlin-Clojure (known as link:Ogre), etc. It is best to think of Gremlin as a style of graph traversing that is not bound to a particular programming language per se. Within a programming language familiar to the developer, there is a Gremlin variant that they can use that leverages the idioms of that language. At minimum, a programming language providing a Gremlin implementation must support function chaining (with lambdas/anonymous functions being a "nice to have" if the variants wishes to offer arbitrary computations beyond the provided Gremlin steps).

Throughout the documentation, the examples provided are primarily written in Gremlin-Groovy. The reason for this is the Gremlin Console whereby an interactive programming environment exists that does not require code compilation. For learning TinkerPop3 and interacting with a live graph system in an ad hoc manner, the Gremlin Console is invaluable. However, for developers interested in working with Gremlin-Java, a few Groovy-to-Java patterns are presented below.

g.V().out('knows').values('name') (1)
g.V().out('knows').map{it.get().value('name') + ' is the friend name'} (2)
g.V().out('knows').sideEffect(System.out.&println) (3)
g.V().as('person').out('knows').as('friend').select().by{it.value('name').length()} (4)
g.V().out("knows").values("name") (1)
g.V().out("knows").map(t -> t.get().value("name") + " is the friend name") (2)
g.V().out("knows").sideEffect(System.out::println) (3)
g.V().as("person").out("knows").as("friend").select().by((Function<Vertex, Integer>) v -> v.<String>value("name").length()) (4)
  1. All the non-lambda step chaining is identical in Gremlin-Groovy and Gremlin-Java. However, note that Groovy supports ' strings as well as " strings.

  2. In Groovy, lambdas are called closures and have a different syntax, where Groovy supports the it keyword and Java doesn’t with all parameters requiring naming.

  3. The syntax for method references differs slightly between Java and Gremlin-Groovy.

  4. Groovy is lenient on object typing and Java is not. When the parameter type of the lambda is not known, typecasting is required.

Vendor Integration

vendor-integration TinkerPop is a framework composed of various interoperable components. At the foundation there is the core TinkerPop3 API which defines what a Graph, Vertex, Edge, etc. are. At minimum a vendor must implement the core API. Once implemented, the Gremlin traversal language is available to the vendor’s users. However, the vendor can go further and provide specific TraversalStrategy optimizations that allow the vendor to inspect a Gremlin query at runtime and optimize it for their particular implementation (e.g. index lookups, step reordering). If the vendor’s graph system is a graph processor (i.e. provides OLAP capabilities), the vendor can implement the GraphComputer API. This API defines how messages/traversers are passed between communicating workers (i.e. threads and/or machines). Once implemented, the same Gremlin traversals execute against both the graph database (OLTP) and the graph processor (OLAP). Note that the Gremlin language interprets the graph in terms of vertices and edges — i.e. Gremlin is a graph-based domain specific language. Users can create their own domain specific languages to process the graph in terms of higher-order constructs such as people, companies, and their various relationships. Finally, Gremlin Server can be leveraged to allow over the wire communication with the TinkerPop-enabled graph system. Gremlin Server provides a configurable communication interface along with metrics and monitoring capabilities. In total, this is The TinkerPop.

The Graph

gremlin standing

Features

A Feature implementation describes the capabilities of a Graph instance. This interface is implemented by vendors for two purposes:

  1. It tells users the capabilities of their Graph instance.

  2. It allows the features they do comply with to be tested against the Gremlin Test Suite - tests that do not comply are "ignored").

The following example in the Gremlin Console shows how to print all the features of a Graph:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.features()
==>FEATURES
> GraphFeatures
>-- Transactions: false
>-- Computer: true
>-- Persistence: false
>-- ThreadedTransactions: false
> VariableFeatures
>-- Variables: true
>-- BooleanValues: true
>-- ByteValues: true
>-- DoubleValues: true
>-- FloatValues: true
>-- IntegerValues: true
>-- LongValues: true
>-- MapValues: true
>-- MixedListValues: true
>-- SerializableValues: true
>-- StringValues: true
>-- UniformListValues: true
>-- BooleanArrayValues: true
>-- ByteArrayValues: true
>-- DoubleArrayValues: true
>-- FloatArrayValues: true
>-- IntegerArrayValues: true
>-- LongArrayValues: true
>-- StringArrayValues: true
> VertexFeatures
>-- MetaProperties: true
>-- MultiProperties: true
>-- AddVertices: true
>-- RemoveVertices: true
>-- UserSuppliedIds: true
>-- AddProperty: true
>-- RemoveProperty: true
>-- NumericIds: true
>-- StringIds: true
>-- UuidIds: true
>-- CustomIds: false
>-- AnyIds: true
> VertexPropertyFeatures
>-- UserSuppliedIds: true
>-- AddProperty: true
>-- RemoveProperty: true
>-- NumericIds: true
>-- StringIds: true
>-- UuidIds: true
>-- CustomIds: false
>-- AnyIds: true
>-- Properties: true
>-- BooleanValues: true
>-- ByteValues: true
>-- DoubleValues: true
>-- FloatValues: true
>-- IntegerValues: true
>-- LongValues: true
>-- MapValues: true
>-- MixedListValues: true
>-- SerializableValues: true
>-- StringValues: true
>-- UniformListValues: true
>-- BooleanArrayValues: true
>-- ByteArrayValues: true
>-- DoubleArrayValues: true
>-- FloatArrayValues: true
>-- IntegerArrayValues: true
>-- LongArrayValues: true
>-- StringArrayValues: true
> EdgeFeatures
>-- AddEdges: true
>-- RemoveEdges: true
>-- UserSuppliedIds: true
>-- AddProperty: true
>-- RemoveProperty: true
>-- NumericIds: true
>-- StringIds: true
>-- UuidIds: true
>-- CustomIds: false
>-- AnyIds: true
> EdgePropertyFeatures
>-- Properties: true
>-- BooleanValues: true
>-- ByteValues: true
>-- DoubleValues: true
>-- FloatValues: true
>-- IntegerValues: true
>-- LongValues: true
>-- MapValues: true
>-- MixedListValues: true
>-- SerializableValues: true
>-- StringValues: true
>-- UniformListValues: true
>-- BooleanArrayValues: true
>-- ByteArrayValues: true
>-- DoubleArrayValues: true
>-- FloatArrayValues: true
>-- IntegerArrayValues: true
>-- LongArrayValues: true
>-- StringArrayValues: true

A common pattern for using features is to check their support prior to performing an operation:

gremlin> graph.features().graph().supportsTransactions()
==>false
gremlin> graph.features().graph().supportsTransactions() ? g.tx().commit() : "no tx"
==>no tx
Tip
To ensure vendor agnostic code, always check feature support prior to usage of a particular function. In that way, the application can behave gracefully in case a particular implementation is provided at runtime that does not support a function being accessed.
Warning
Assignments of a GraphStrategy can alter the base features of a Graph in dynamic ways, such that checks against a Feature may not always reflect the behavior exhibited when the GraphStrategy is in use.

Vertex Properties

vertex-properties TinkerPop3 introduces the concept of a VertexProperty<V>. All the properties of a Vertex are a VertexProperty. A VertexProperty implements Property and as such, it has a key/value pair. However, VertexProperty also implements Element and thus, can have a collection of key/value pairs. Moreover, while an Edge can only have one property of key "name" (for example), a Vertex can have multiple "name" properties. With the inclusion of vertex properties, two features are introduced which ultimately advance the graph modelers toolkit:

  1. Multiple properties (multi-properties): a vertex property key can have multiple values (i.e. a vertex can have multiple "name" properties).

  2. Properties on properties (meta-properties): a vertex property can have properties (i.e. a vertex property can have key/value data associated with it).

A collection of use cases are itemized below:

  • Permissions: Vertex properties can have key/value ACL-type permission information associated with them.

  • Auditing: When a vertex property is manipulated, it can have key/value information attached to it saying who the creator, deletor, etc. are.

  • Provenance: The "name" of a vertex can be declared by multiple users.

A running example using vertex properties is provided below to demonstrate and explain the API.

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> v = g.addV('name','marko','name','marko a. rodriguez').next()
==>v[0]
gremlin> g.V(v).properties().count()
==>2
gremlin> g.V(v).properties('name').count() //(1)
==>2
gremlin> g.V(v).properties()
==>vp[name->marko]
==>vp[name->marko a. rodriguez]
gremlin> g.V(v).properties('name')
==>vp[name->marko]
==>vp[name->marko a. rodriguez]
gremlin> g.V(v).properties('name').hasValue('marko')
==>vp[name->marko]
gremlin> g.V(v).properties('name').hasValue('marko').property('acl','private') //(2)
==>vp[name->marko]
gremlin> g.V(v).properties('name').hasValue('marko a. rodriguez')
==>vp[name->marko a. rodriguez]
gremlin> g.V(v).properties('name').hasValue('marko a. rodriguez').property('acl','public')
==>vp[name->marko a. rodriguez]
gremlin> g.V(v).properties('name').has('acl','public').value()
==>marko a. rodriguez
gremlin> g.V(v).properties('name').has('acl','public').drop() //(3)
gremlin> g.V(v).properties('name').has('acl','public').value()
gremlin> g.V(v).properties('name').has('acl','private').value()
==>marko
gremlin> g.V(v).properties()
==>vp[name->marko]
gremlin> g.V(v).properties().properties() //(4)
==>p[acl->private]
gremlin> g.V(v).properties().property('date',2014) //(5)
==>vp[name->marko]
gremlin> g.V(v).properties().property('creator','stephen')
==>vp[name->marko]
gremlin> g.V(v).properties().properties()
==>p[date->2014]
==>p[creator->stephen]
==>p[acl->private]
gremlin> g.V(v).properties('name').valueMap()
==>[date:2014, creator:stephen, acl:private]
gremlin> g.V(v).property('name','okram') //(6)
==>v[0]
gremlin> g.V(v).properties('name')
==>vp[name->okram]
gremlin> g.V(v).values('name') //(7)
==>okram
  1. A vertex can have zero or more properties with the same key associated with it.

  2. A vertex property can have standard key/value properties attached to it.

  3. Vertex property removal is identical to property removal.

  4. It is property to get the properties of a vertex property.

  5. A vertex property can have any number of key/value properties attached to it.

  6. property(...) will remove all existing key’d properties before adding the new single property (see VertexProperty.Cardinality).

  7. If only the value of a property is needed, then values() can be used.

If the concept of vertex properties is difficult to grasp, then it may be best to think of vertex properties in terms of "literal vertices." A vertex can have an edge to a "literal vertex" that has a single value key/value — e.g. "value=okram." The edge that points to that literal vertex has an edge-label of "name." The properties on the edge represent the literal vertex’s properties. The "literal vertex" can not have any other edges to it (only one from the associated vertex).

Tip
A toy graph demonstrating all of the new TinkerPop3 graph structure features is available at TinkerFactory.createTheCrew() and data/tinkerpop-crew*. This graph demonstrates multi-properties and meta-properties.
the crew graph
Figure 3. TinkerPop Crew
gremlin> g.V().as('a').
               properties('location').as('b').
               hasNot('endTime').as('c').
               select('a','b','c').by('name').by(value).by('startTime') // determine the current location of each person
==>[a:marko, b:santa fe, c:2005]
==>[a:stephen, b:purcellville, c:2006]
==>[a:matthias, b:seattle, c:2014]
==>[a:daniel, b:aachen, c:2009]
gremlin> g.V().has('name','gremlin').inE('uses').
               order().by('skill',incr).as('a').
               outV().as('b').
               select('a','b').by('skill').by('name') // rank the users of gremlin by their skill level
==>[a:3, b:matthias]
==>[a:4, b:marko]
==>[a:5, b:stephen]
==>[a:5, b:daniel]

Graph Variables

TinkerPop3 introduces the concept of Graph.Variables. Variables are key/value pairs associated with the graph itself — in essence, a Map<String,Object>. These variables are intended to store metadata about the graph. Example use cases include:

  • Schema information: What do the namespace prefixes resolve to and when was the schema last modified?

  • Global permissions: What are the access rights for particular groups?

  • System user information: Who are the admins of the system?

An example of graph variables in use is presented below:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.variables()
==>variables[size:0]
gremlin> graph.variables().set('systemAdmins',['stephen','peter','pavel'])
==>null
gremlin> graph.variables().set('systemUsers',['matthias','marko','josh'])
==>null
gremlin> graph.variables().keys()
==>systemAdmins
==>systemUsers
gremlin> graph.variables().get('systemUsers')
==>Optional[[matthias, marko, josh]]
gremlin> graph.variables().get('systemUsers').get()
==>matthias
==>marko
==>josh
gremlin> graph.variables().remove('systemAdmins')
==>null
gremlin> graph.variables().keys()
==>systemUsers
Important
Graph variables are not intended to be subject to heavy, concurrent mutation nor to be used in complex computations. The intention is to have a location to store data about the graph for administrative purposes.

Graph Transactions

gremlin-coins A database transaction represents a unit of work to execute against the database. Transactions are controlled by an implementation of the Transaction interface and that object can be obtained from the Graph interface using the tx() method. It is important to note that the Transaction object does not represent a "transaction" itself. It merely exposes the methods for working with transactions (e.g. committing, rolling back, etc).

Most Graph implementations that supportsTransactions will implement an "automatic" ThreadLocal transaction, which means that when a read or write occurs after the Graph is instantiated a transaction is automatically started within that thread. There is no need to manually call a method to "create" or "start" a transaction. Simple modify the graph as required and call graph.tx().commit() to apply changes or graph.tx().rollback() to undo them. When the next read or write action occurs against the graph, a new transaction will be started within that current thread of execution.

When using transactions in this fashion, especially in web application (e.g. REST server), it is important to ensure that transaction do not leak from one request to the next. In other words, unless a client is somehow bound via session to process every request on the same server thread, ever request must be committed or rolled back at the end of the request. By ensuring that the request encapsulates a transaction, it ensures that a future request processed on a server thread is starting in a fresh transactional state and will not have access to the remains of one from an earlier request. A good strategy is to rollback a transaction at the start of a request, so that if it so happens that a transactional leak does occur between requests somehow, a fresh transaction is assured by the fresh request.

Tip
The tx() method is on the Graph interface, but it is also available on the TraversalSource spawned from a Graph. Calls to TraversalSource.tx() are proxied through to the underlying Graph as a convenience.

Configuring

Determining when a transaction starts is dependent upon the behavior assigned to the Transaction. It is up to the Graph implementation to determine the default behavior and unless the implementation doesn’t allow it, the behavior itself can be altered via these Transaction methods:

public Transaction onReadWrite(final Consumer<Transaction> consumer);

public Transaction onClose(final Consumer<Transaction> consumer);

Providing a Consumer function to onReadWrite allows definition of how a transaction starts when a read or a write occurs. Transaction.READ_WRITE_BEHAVIOR contains pre-defined Consumer functions to supply to the onReadWrite method. It has two options:

  • AUTO - automatic transactions where the transaction is started implicitly to the read or write operation

  • MANUAL - manual transactions where it is up to the user to explicitly open a transaction, throwing an exception if the transaction is not open

Providing a Consumer function to onClose allows configuration of how a transaction is handled when Graph.close() is called. Transaction.CLOSE_BEHAVIOR has several pre-defined options that can be supplied to this method:

  • COMMIT - automatically commit an open transaction

  • ROLLBACK - automatically rollback an open transaction

  • MANUAL - throw an exception if a transaction is open, forcing the user to explicitly close the transaction

Once there is an understanding for how transactions are configured, most of the rest of the Transaction interface is self-explanatory. Note that Neo4j-Gremlin is used for the examples to follow as TinkerGraph does not support transactions.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> graph.features()
==>FEATURES
> GraphFeatures
>-- Transactions: true  (1)
>-- Computer: false
>-- Persistence: true
...
gremlin> graph.tx().onReadWrite(Transaction.READ_WRITE_BEHAVIOR.AUTO) (2)
==>org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph$Neo4jTransaction@1c067c0d
gremlin> graph.addVertex("name","stephen")  (3)
==>v[0]
gremlin> graph.tx().commit() (4)
==>null
gremlin> graph.tx().onReadWrite(Transaction.READ_WRITE_BEHAVIOR.MANUAL) (5)
==>org.apache.tinkerpop.gremlin.neo4j.structure.Neo4jGraph$Neo4jTransaction@1c067c0d
gremlin> graph.tx().isOpen()
==>false
gremlin> graph.addVertex("name","marko") (6)
Open a transaction before attempting to read/write the transaction
gremlin> graph.tx().open() (7)
==>null
gremlin> graph.addVertex("name","marko") (8)
==>v[1]
gremlin> graph.tx().commit()
==>null
  1. Check features to ensure that the graph supports transactions.

  2. By default, Neo4jGraph is configured with "automatic" transactions, so it is set here for demonstration purposes only.

  3. When the vertex is added, the transaction is automatically started. From this point, more mutations can be staged or other read operations executed in the context of that open transaction.

  4. Calling commit finalizes the transaction.

  5. Change transaction behavior to require manual control.

  6. Adding a vertex now results in failure because the transaction was not explicitly opened.

  7. Explicitly open a transaction.

  8. Adding a vertex now succeeds as the transaction was manually opened.

Note
It may be important to consult the documentation of the Graph implementation when it comes to the specifics of how transactions will behave. TinkerPop allows some latitude in this area and implementations may not have the exact same behaviors and ACID guarantees.

Retries

There are times when transactions fail. Failure may be indicative of some permanent condition, but other failures might simply require the transaction to be retried for possible future success. The Transaction object also exposes a method for executing automatic transaction retries:

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> graph.tx().submit {it.addVertex("name","josh")}.retry(10)
==>v[0]
gremlin> graph.tx().submit {it.addVertex("name","daniel")}.exponentialBackoff(10)
==>v[1]
gremlin> graph.close()
==>null

As shown above, the submit method takes a Function<Graph, R> which is the unit of work to execute and possibly retry on failure. The method returns a Transaction.Workload object which has a number of default methods for common retry strategies. It is also possible to supply a custom retry function if a default one does not suit the required purpose.

Threaded Transactions

Most Graph implementations that support transactions do so in a ThreadLocal manner, where the current transaction is bound to the current thread of execution. Consider the following example to demonstrate:

graph.addVertex("name","stephen");

Thread t1 = new Thread(() -> {
    graph.addVertex("name","josh");
});

Thread t2 = new Thread(() -> {
    graph.addVertex("name","marko");
});

t1.start()
t2.start()

t1.join()
t2.join()

graph.tx().commit();

The above code shows three vertices added to graph in three different threads: the current thread, t1 and t2. One might expect that by the time this body of code finished executing, that there would be three vertices persisted to the Graph. However, given the ThreadLocal nature of transactions, there really were three separate transactions created in that body of code (i.e. one for each thread of execution) and the only one committed was the first call to addVertex in the primary thread of execution. The other two calls to that method within t1 and t2 were never committed and thus orphaned.

A Graph that supportsThreadedTransactions is one that allows for a Graph to operate outside of that constraint, thus allowing multiple threads to operate within the same transaction. Therefore, if there was a need to have three different threads operating within the same transaction, the above code could be re-written as follows:

Graph threaded = graph.tx().newThreadedTx();
threaded.addVertex("name","stephen");

Thread t1 = new Thread(() -> {
    threaded.addVertex("name","josh");
});

Thread t2 = new Thread(() -> {
    threaded.addVertex("name","marko");
});

t1.start()
t2.start()

t1.join()
t2.join()

threaded.tx().commit();

In the above case, the call to graph.tx().newThreadedTx() creates a new Graph instance that is unbound from the ThreadLocal transaction, thus allowing each thread to operate on it in the same context. In this case, there would be three separate vertices persisted to the Graph.

Gremlin I/O

gremlin-io The task of getting data in and out of Graph instances is the job of the Gremlin I/O packages. Gremlin I/O provides two interfaces for reading and writing Graph instances: GraphReader and GraphWriter. These interfaces expose methods that support:

  • Reading and writing an entire Graph

  • Reading and writing a Traversal<Vertex> as adjacency list format

  • Reading and writing a single Vertex (with and without associated Edge objects)

  • Reading and writing a single Edge

  • Reading and writing a single VertexProperty

  • Reading and writing a single Property

  • Reading and writing an arbitrary Object

In all cases, these methods operate in the currency of InputStream and OutputStream objects, allowing graphs and their related elements to be written to and read from files, byte arrays, etc. The Graph interface offers the io method, which provides access to "reader/writer builder" objects that are pre-configured with serializers provided by the Graph, as well as helper methods for the various I/O capabilities. Unless there are very advanced requirements for the serialization process, it is always best to utilize the methods on the Io interface to construct GraphReader and GraphWriter instances, as the implementation may provide some custom settings that would otherwise have to be configured manually by the user to do the serialization.

It is up to the implementations of the GraphReader and GraphWriter interfaces to choose the methods they implement and the manner in which they work together. The only semantics enforced and expected is that the write methods should produce output that is compatible with the corresponding read method (e.g. the output of writeVertices should be readable as input to readVertices and the output of writeProperty should be readable as input to readProperty).

GraphML Reader/Writer

gremlin-graphml The GraphML file format is a common XML-based representation of a graph. It is widely supported by graph-related tools and libraries making it a solid interchange format for TinkerPop. In other words, if the intent is to work with graph data in conjunction with applications outside of TinkerPop, GraphML may be the best choice to do that. Common use cases might be:

  • Generate a graph using NetworkX, export it with GraphML and import it to TinkerPop.

  • Produce a subgraph and export it to GraphML to be consumed by and visualized in Gephi.

  • Migrate the data of an entire graph to a different graph database not supported by TinkerPop.

As GraphML is a specification for the serialization of an entire graph and not the individual elements of a graph, methods that support input and output of single vertices, edges, etc. are not supported.

Caution
GraphML is a "lossy" format in that it only supports primitive values for properties and does not have support for Graph variables. It will use toString to serialize property values outside of those primitives.
Caution
GraphML, as a specification, allows for <edge> and <node> elements to appear in any order. The GraphMLReader will support that, however, that capability comes with a limitation. TinkerPop does not allow the vertex label to be changed after the vertex has been created. Therefore, if an <edge> element comes before the <node> the label on the vertex will be ignored. It is thus better to order <node> elements in the GraphML to appear before all <edge> elements if vertex labels are important to the graph.

The following code shows how to write a Graph instance to file called tinkerpop-modern.xml and then how to read that file back into a different instance:

final Graph graph = TinkerFactory.createModern();
graph.io(IoCore.graphml()).writeGraph("tinkerpop-modern.xml");
final Graph newGraph = TinkerGraph.open();
newGraph.io(IoCore.graphml()).readGraph("tinkerpop-modern.xml");

If a custom configuration is required, then have the Graph generate a GraphReader or GraphWriter "builder" instance:

final Graph graph = TinkerFactory.createModern();
try (final OutputStream os = new FileOutputStream("tinkerpop-modern.xml")) {
    graph.io(IoCore.graphml()).writer().normalize(true).create().writeGraph(os, graph);
}

final Graph newGraph = TinkerGraph.open();
try (final InputStream stream = new FileInputStream("tinkerpop-modern.xml")) {
    newGraph.io(IoCore.graphml()).reader().vertexIdKey("name").create().readGraph(stream, newGraph);
}

GraphSON Reader/Writer

gremlin-graphson GraphSON is a JSON-based format extended from earlier versions of TinkerPop. It is important to note that TinkerPop3’s GraphSON is not backwards compatible with prior TinkerPop GraphSON versions. GraphSON has some support from graph-related application outside of TinkerPop, but it is generally best used in two cases:

  • A text format of the graph or its elements is desired (e.g. debugging, usage in source control, etc.)

  • The graph or its elements need to be consumed by code that is not JVM-based (e.g. JavaScript, Python, .NET, etc.)

GraphSON supports all of the GraphReader and GraphWriter interface methods and can therefore read or write an entire Graph, vertices, arbitrary objects, etc. The following code shows how to write a Graph instance to file called tinkerpop-modern.json and then how to read that file back into a different instance:

final Graph graph = TinkerFactory.createModern();
graph.io(IoCore.graphson()).writeGraph("tinkerpop-modern.json");

final Graph newGraph = TinkerGraph.open();
newGraph.io(IoCore.graphson()).readGraph("tinkerpop-modern.json");

If a custom configuration is required, then have the Graph generate a GraphReader or GraphWriter "builder" instance:

final Graph graph = TinkerFactory.createModern();
try (final OutputStream os = new FileOutputStream("tinkerpop-modern.json")) {
    final GraphSONMapper mapper = graph.io(IoCore.graphson()).mapper().normalize(true).create()
    graph.io(IoCore.graphson()).writer().mapper(mapper).create().writeGraph(os, graph)
}

final Graph newGraph = TinkerGraph.open();
try (final InputStream stream = new FileInputStream("tinkerpop-modern.json")) {
    newGraph.io(IoCore.graphson()).reader().vertexIdKey("name").create().readGraph(stream, newGraph);
}

One of the important configuration options of the GraphSONReader and GraphSONWriter is the ability to embed type information into the output. By embedding the types, it becomes possible to serialize a graph without losing type information that might be important when being consumed by another source. The importance of this concept is demonstrated in the following example where a single Vertex is written to GraphSON using the Gremlin Console:

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> f = new FileOutputStream("vertex-1.json")
==>java.io.FileOutputStream@65f942f0
gremlin> graph.io(graphson()).writer().create().writeVertex(f, g.V(1).next(), BOTH)
==>null
gremlin> f.close()
==>null

The following GraphSON example shows the output of GraphSonWriter.writeVertex() with associated edges:

{
    "id": 1,
    "label": "person",
    "outE": {
        "created": [
            {
                "id": 9,
                "inV": 3,
                "properties": {
                    "weight": 0.4
                }
            }
        ],
        "knows": [
            {
                "id": 7,
                "inV": 2,
                "properties": {
                    "weight": 0.5
                }
            },
            {
                "id": 8,
                "inV": 4,
                "properties": {
                    "weight": 1
                }
            }
        ]
    },
    "properties": {
        "name": [
            {
                "id": 0,
                "value": "marko"
            }
        ],
        "age": [
            {
                "id": 1,
                "value": 29
            }
        ]
    }
}

The vertex properly serializes to valid JSON but note that a consuming application will not automatically know how to interpret the numeric values. In coercing those Java values to JSON, such information is lost.

With a minor change to the construction of the GraphSONWriter the lossy nature of GraphSON can be avoided:

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> f = new FileOutputStream("vertex-1.json")
==>java.io.FileOutputStream@152b9ec9
gremlin> mapper = graph.io(graphson()).mapper().embedTypes(true).create()
==>org.apache.tinkerpop.gremlin.structure.io.graphson.GraphSONMapper@2059d036
gremlin> graph.io(graphson()).writer().mapper(mapper).create().writeVertex(f, g.V(1).next(), BOTH)
==>null
gremlin> f.close()
==>null

In the above code, the embedTypes option is set to true and the output below shows the difference in the output:

{
    "@class": "java.util.HashMap",
    "id": 1,
    "label": "person",
    "outE": {
        "@class": "java.util.HashMap",
        "created": [
            "java.util.ArrayList",
            [
                {
                    "@class": "java.util.HashMap",
                    "id": 9,
                    "inV": 3,
                    "properties": {
                        "@class": "java.util.HashMap",
                        "weight": 0.4
                    }
                }
            ]
        ],
        "knows": [
            "java.util.ArrayList",
            [
                {
                    "@class": "java.util.HashMap",
                    "id": 7,
                    "inV": 2,
                    "properties": {
                        "@class": "java.util.HashMap",
                        "weight": 0.5
                    }
                },
                {
                    "@class": "java.util.HashMap",
                    "id": 8,
                    "inV": 4,
                    "properties": {
                        "@class": "java.util.HashMap",
                        "weight": 1
                    }
                }
            ]
        ]
    },
    "properties": {
        "@class": "java.util.HashMap",
        "name": [
            "java.util.ArrayList",
            [
                {
                    "@class": "java.util.HashMap",
                    "id": [
                        "java.lang.Long",
                        0
                    ],
                    "value": "marko"
                }
            ]
        ],
        "age": [
            "java.util.ArrayList",
            [
                {
                    "@class": "java.util.HashMap",
                    "id": [
                        "java.lang.Long",
                        1
                    ],
                    "value": 29
                }
            ]
        ]
    }
}

The ambiguity of components of the GraphSON is now removed by the @class property, which contains Java class information for the data it is associated with. The @class property is used for all non-final types, with the exception of a small number of "natural" types (String, Boolean, Integer, and Double) which can be correctly inferred from JSON typing. While the output is more verbose, it comes with the security of not losing type information. While non-JVM languages won’t be able to consume this information automatically, at least there is a hint as to how the values should be coerced back into the correct types in the target language.

Gryo Reader/Writer

gremlin-kryo Kryo is a popular serialization package for the JVM. Gremlin-Kryo is a binary Graph serialization format for use on the JVM by JVM languages. It is designed to be space efficient, non-lossy and is promoted as the standard format to use when working with graph data inside of the TinkerPop stack. A list of common use cases is presented below:

  • Migration from one Gremlin Structure implementation to another (e.g. TinkerGraph to Neo4jGraph)

  • Serialization of individual graph elements to be sent over the network to another JVM.

  • Backups of in-memory graphs or subgraphs.

Caution
When migrating between Gremlin Structure implementations, Kryo may not lose data, but it is important to consider the features of each Graph and whether or not the data types supported in one will be supported in the other. Failure to do so, may result in errors.

Kryo supports all of the GraphReader and GraphWriter interface methods and can therefore read or write an entire Graph, vertices, edges, etc. The following code shows how to write a Graph instance to file called tinkerpop-modern.kryo and then how to read that file back into a different instance:

final Graph graph = TinkerFactory.createModern();
graph.io(IoCore.gryo()).writeGraph("tinkerpop-modern.kryo");

final Graph newGraph = TinkerGraph.open();
newGraph.io(IoCore.gryo()).readGraph("tinkerpop-modern.kryo")'

If a custom configuration is required, then have the Graph generate a GraphReader or GraphWriter "builder" instance:

final Graph graph = TinkerFactory.createModern();
try (final OutputStream os = new FileOutputStream("tinkerpop-modern.kryo")) {
    graph.io(IoCore.gryo()).writer().create().writeGraph(os, graph);
}

final Graph newGraph = TinkerGraph.open();
try (final InputStream stream = new FileInputStream("tinkerpop-modern.kryo")) {
    newGraph.io(IoCore.gryo()).reader().vertexIdKey("name").create().readGraph(stream, newGraph);
}
Note
The preferred extension for files names produced by Gryo is .kryo.

TinkerPop2 Data Migration

data-migration For those using TinkerPop2, migrating to TinkerPop3 will mean a number of programming changes, but may also require a migration of the data depending on the graph implementation. For example, trying to open TinkerGraph data from TinkerPop2 with TinkerPop3 code will not work, however opening a TinkerPop2 Neo4jGraph with a TinkerPop3 Neo4jGraph should work provided there aren’t Neo4j version compatibility mismatches preventing the read.

If such a situation arises that a particular TinkerPop2 Graph can not be read by TinkerPop3, a "legacy" data migration approach exists. The migration involves writing the TinkerPop2 Graph to GraphSON, then reading it to TinkerPop3 with the LegacyGraphSONReader (a limited implementation of the GraphReader interface).

The following represents an example migration of the "classic" toy graph. In this example, the "classic" graph is saved to GraphSON using TinkerPop2.

gremlin> Gremlin.version()
==>2.5.z
gremlin> graph = TinkerGraphFactory.createTinkerGraph()
==>tinkergraph[vertices:6 edges:6]
gremlin> GraphSONWriter.outputGraph(graph,'/tmp/tp2.json',GraphSONMode.EXTENDED)
==>null

The above console session uses the gremlin-groovy distribution from TinkerPop2. It is important to generate the tp2.json file using the EXTENDED mode as it will include data types when necessary which will help limit "lossiness" on the TinkerPop3 side when imported. Once tp2.json is created, it can then be imported to a TinkerPop3 Graph.

gremlin> Gremlin.version()
==>3.0.0-SNAPSHOT
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> r = LegacyGraphSONReader.build().create()
==>org.apache.tinkerpop.gremlin.structure.io.graphson.LegacyGraphSONReader@64337702
gremlin> r.readGraph(new FileInputStream('/tmp/tp2.json'), graph)
==>null
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.E()
==>e[11][4-created->3]
==>e[12][6-created->3]
==>e[7][1-knows->2]
==>e[8][1-knows->4]
==>e[9][1-created->3]
==>e[10][4-created->5]

Namespace Conventions

End users, vendors, GraphComputer algorithm designers, GremlinPlugin creators, etc. all leverage properties on elements to store information. There are a few conventions that should be respected when naming property keys to ensure that conflicts between these stakeholders do not conflict.

  • End users are granted the flat namespace (e.g. name, age, location) to key their properties and label their elements.

  • Vendors are granted the hidden namespace (e.g. ~metadata) to key their properties and labels. Data key’d as such is only accessible via the vendor implementation code and no other stakeholders are granted read nor write access to data prefixed with "~" (see Graph.Hidden). Test coverage and exceptions exist to ensure that vendors respect this hard boundary.

  • VertexProgram and MapReduce developers should, like GraphStrategy developers, leverage qualified namespaces particular to their domain (e.g. mydomain.myvertexprogram.computedata).

  • GremlinPlugin creators should prefix their plugin name with their domain (e.g. mydomain.myplugin).

Important
TinkerPop uses tinkerpop. and gremlin. as the prefixes for provided strategies, vertex programs, map reduce implementations, and plugins.

The only truly protected namespace is the hidden namespace provided to vendors. From there, its up to engineers to respect the namespacing conventions presented.

The Traversal

gremlin running

At the most general level there is Traversal<S,E> which implements Iterator<E>, where the S stands for start and the E stands for end. A traversal is composed of four primary components:

  1. Step<S,E>: an individual function applied to S to yield E. Steps are chained within a traversal.

  2. TraversalStrategy: interceptor methods to alter the execution of the traversal (e.g. query re-writing).

  3. TraversalSideEffects: key/value pairs that can be used to store global information about the traversal.

  4. Traverser<T>: the object propagating through the Traversal currently representing an object of type T.

The classic notion of a graph traversal is provided by GraphTraversal<S,E> which extends Traversal<S,E>. GraphTraversal provides an interpretation of the graph data in terms of vertices, edges, etc. and thus, a graph traversal DSL.

Important
The underlying Step implementations provided by TinkerPop should encompass most of the functionality required by a DSL author. It is important that DSL authors leverage the provided steps as then the common optimization and decoration strategies can reason on the underlying traversal sequence. If new steps are introduced, then common traversal strategies may not function properly.

Graph Traversal Steps

step types

A GraphTraversal<S,E> can be spawned off of a Graph, Vertex, Edge, or VertexProperty. It can also be spawned anonymously (i.e. empty) via __. A graph traversal is composed of an ordered list of steps. All the steps provided by GraphTraversal inherit from the more general forms diagrammed above. A list of all the steps (and their descriptions) are provided in the TinkerPop3 GraphTraversal JavaDoc. The following subsections will demonstrate the GraphTraversal steps using the Gremlin Console.

Note
To reduce the verbosity of the expression, it is good to import static org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__.*. This way, instead of doing __.inE() for an anonymous traversal, it is possible to simply write inE(). Be aware of language-specific reserved keywords when using anonymous traversals. For example, in and as are reserved keywords in Groovy, therefore you must use the verbose syntax __.in() and __.as() to avoid collisions.

Lambda Steps

Caution
Lambda steps are presented for educational purposes as they represent the foundational constructs of the Gremlin language. In practice, lambda steps should be avoided and traversal verification strategies exist to disallow their use unless explicitly "turned off." For more information on the problems with lambdas, please read A Note on Lambdas.

There are four generic steps by which all other specific steps described later extend.

Step Description

map(Function<Traverser<S>, E>)

map the traverser to some object of type E for the next step to process.

flatMap(Function<Traverser<S>, Iterator<E>>)

map the traverser to an iterator of E objects that are streamed to the next step.

filter(Predicate<Traverser<S>>)

map the traverser to either true or false, where false will not pass the traverser to the next step.

sideEffect(Consumer<Traverser<S>>)

perform some operation on the traverser and pass it to the next step.

branch(Function<Traverser<S>,M>)

split the traverser to all the traversals indexed by the M token.

The Traverser<S> object provides access to:

  1. The current traversed S object — Traverser.get().

  2. The current path traversed by the traverser — Traverser.path().

    1. A helper shorthand to get a particular path-history object — Traverser.path(String) == Traverser.path().get(String).

  3. The number of times the traverser has gone through the current loop — Traverser.loops().

  4. The number of objects represented by this traverser — Traverser.bulk().

  5. The local data structure associated with this traverser — Traverser.sack().

  6. The side-effects associated with the traversal — Traverser.sideEffects().

    1. A helper shorthand to get a particular side-effect — Traverser.sideEffect(String) == Traverser.sideEffects().get(String).

map-lambda

gremlin> g.V(1).out().values('name') //(1)
==>lop
==>vadas
==>josh
gremlin> g.V(1).out().map {it.get().value('name')} //(2)
==>lop
==>vadas
==>josh
  1. An outgoing traversal from vertex 1 to the name values of the adjacent vertices.

  2. The same operation, but using a lambda to access the name property values.

filter-lambda

gremlin> g.V().filter {it.get().label() == 'person'} //(1)
==>v[1]
==>v[2]
==>v[4]
==>v[6]
gremlin> g.V().hasLabel('person') //(2)
==>v[1]
==>v[2]
==>v[4]
==>v[6]
  1. A filter that only allows the vertex to pass if it has an age-property.

  2. The more specific has()-step is implemented as a filter() with respective predicate.

side-effect-lambda

gremlin> g.V().hasLabel('person').sideEffect(System.out.&println) //(1)
v[1]
==>v[1]
v[2]
==>v[2]
v[4]
==>v[4]
v[6]
==>v[6]
  1. Whatever enters sideEffect() is passed to the next step, but some intervening process can occur.

branch-lambda

gremlin> g.V().branch(values('name')).
               option('marko', values('age')).
               option(none, values('name')) //(1)
==>29
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> g.V().choose(has('name','marko'),
                      values('age'),
                      values('name')) //(2)
==>29
==>vadas
==>lop
==>josh
==>ripple
==>peter
  1. If the vertex is "marko", get his age, else get the name of the vertex.

  2. The more specific boolean-based choose()-step is implemented as a branch().

AddEdge Step

Reasoning is the process of making explicit in the data what is implicit in the data. What is explicit in a graph are the objects of the graph — i.e. vertices and edges. What is implicit in the graph is the traversal. In other words, traversals expose meaning where the meaning is defined by the traversal description. For example, take the concept of a "co-developer." Two people are co-developers if they have worked on the same project together. This concept can be represented as a traversal and thus, the concept of "co-developers" can be derived. To add edges via a traversal, there is a collection of addE()-steps (map/sideEffect).

addedge step
gremlin> g.V(1).as('a').out('created').in('created').where(neq('a')).addOutE('co-developer','a','year',2009) //(1)
==>e[12][4-co-developer->1]
==>e[13][6-co-developer->1]
gremlin> g.withSideEffect('a',g.V(3,5).toList()).V(4).addInE('createdBy','a') //(2)
==>e[14][3-createdBy->4]
==>e[15][5-createdBy->4]
gremlin> g.V().as('a').out('created').as('b').select('a','b').addOutE('b','createdBy','a','acl','public') //(3)
==>e[16][3-createdBy->1]
==>e[17][5-createdBy->4]
==>e[18][3-createdBy->4]
==>e[19][3-createdBy->6]
gremlin> g.V(1).as('a').out('knows').addInE('livesNear','a','year',2009).inV().inE('livesNear').values('year') //(4)
==>2009
==>2009
  1. Add a co-developer edge with a year-property between marko and his collaborators.

  2. Add incoming createdBy edges from the josh-vertex to the lop- and ripple-vertices.

  3. It is possible to pull the vertices from a select-projection.

  4. The newly created edge is a traversable object.

AddVertex Step

The addV()-step is used to add vertices to the graph (map/sideEffect). For every incoming object, a vertex is created. Moreover, GraphTraversalSource maintains an addV() method.

gremlin> g.addV(label,'person','name','stephen')
==>v[12]
gremlin> g.V().values('name')
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
==>stephen
gremlin> g.V().outE('knows').addV('name','nothing')
==>v[14]
==>v[16]
gremlin> g.V().has('name','nothing')
==>v[16]
==>v[14]
gremlin> g.V().has('name','nothing').bothE()

AddProperty Step

The property()-step is used to add properties to the elements of the graph (sideEffect). Unlike addV() and addE(), property() is a full sideEffect step in that it does not return the property it created, but the element that streamed into it.

gremlin> g.V(1).property('country','usa')
==>v[1]
gremlin> g.V(1).property('city','santa fe').property('state','new mexico').valueMap()
==>[country:[usa], city:[santa fe], name:[marko], state:[new mexico], age:[29]]
gremlin> g.V(1).property(list,'age',35)
==>v[1]
gremlin> g.V(1).valueMap()
==>[country:[usa], city:[santa fe], name:[marko], state:[new mexico], age:[29, 35]]

Aggregate Step

aggregate step

The aggregate()-step (sideEffect) is used to aggregate all the objects at a particular point of traversal into a Collection. The step uses eager evaluation in that no objects continue on until all previous objects have been fully aggregated (as opposed to store() which lazily fills a collection). The eager evaluation nature is crucial in situations where everything at a particular point is required for future computation. An example is provided below.

gremlin> g.V(1).out('created') //(1)
==>v[3]
gremlin> g.V(1).out('created').aggregate('x') //(2)
==>v[3]
gremlin> g.V(1).out('created').aggregate('x').in('created') //(3)
==>v[1]
==>v[4]
==>v[6]
gremlin> g.V(1).out('created').aggregate('x').in('created').out('created') //(4)
==>v[3]
==>v[5]
==>v[3]
==>v[3]
gremlin> g.V(1).out('created').aggregate('x').in('created').out('created').
                where(without('x')).values('name') //(5)
==>ripple
  1. What has marko created?

  2. Aggregate all his creations.

  3. Who are marko’s collaborators?

  4. What have marko’s collaborators created?

  5. What have marko’s collaborators created that he hasn’t created?

In recommendation systems, the above pattern is used:

"What has userA liked? Who else has liked those things? What have they liked that userA hasn't already liked?"

Finally, aggregate()-step can be modulated via by()-projection.

gremlin> g.V().out('knows').aggregate('x').cap('x')
==>{v[2]=1, v[4]=1}
gremlin> g.V().out('knows').aggregate('x').by('name').cap('x')
==>{vadas=1, josh=1}

And Step

The and()-step ensures that all provided traversals yield a result (filter). Please see or() for or-semantics.

gremlin> g.V().and(
            outE('knows'),
            values('age').is(lt(30))).
              values('name')
==>marko

The and()-step can take an arbitrary number of traversals. All traversals must produce at least one output for the original traverser to pass to the next step.

An infix notation can be used as well. Though, with infix notation, only two traversals can be and’d together.

gremlin> g.V().where(outE('created').and().outE('knows')).values('name')
==>marko

As Step

The as()-step is not a real step, but a "step modulator" similar to by() and option(). With as(), it is possible to provide a label to the step that can later be accessed by steps and data structures that make use of such labels — e.g., select(), match(), and path.

gremlin> g.V().as('a').out('created').as('b').select('a','b') //(1)
==>[a:v[1], b:v[3]]
==>[a:v[4], b:v[5]]
==>[a:v[4], b:v[3]]
==>[a:v[6], b:v[3]]
gremlin> g.V().as('a').out('created').as('b').select('a','b').by('name') //(2)
==>[a:marko, b:lop]
==>[a:josh, b:ripple]
==>[a:josh, b:lop]
==>[a:peter, b:lop]
  1. Select the objects labeled "a" and "b" from the path.

  2. Select the objects labeled "a" and "b" from the path and, for each object, project its name value.

A step can have any number of labels associated with it. This is useful for referencing the same step multiple times in a future step.

gremlin> g.V().hasLabel('software').as('a','b','c').
            select('a','b','c').
              by('name').
              by('lang').
              by(__.in('created').values('name').fold())
==>[a:lop, b:java, c:[marko, josh, peter]]
==>[a:ripple, b:java, c:[josh]]

Barrier Step

The barrier()-step (barrier) turns the the lazy traversal pipeline into a bulk-synchronous pipeline. This step is useful in the following situations:

  • When everything prior to barrier() needs to be executed before moving onto the steps after the barrier() (i.e. ordering).

  • When "stalling" the traversal may lead to a "bulking optimization" in traversals that repeatedly touch many of the same elements (i.e. optimizing).

gremlin> g.V().sideEffect{println "first: ${it}"}.sideEffect{println "second: ${it}"}.iterate()
first: v[1]
second: v[1]
first: v[2]
second: v[2]
first: v[3]
second: v[3]
first: v[4]
second: v[4]
first: v[5]
second: v[5]
first: v[6]
second: v[6]
gremlin> g.V().sideEffect{println "first: ${it}"}.barrier().sideEffect{println "second: ${it}"}.iterate()
first: v[1]
first: v[2]
first: v[3]
first: v[4]
first: v[5]
first: v[6]
second: v[1]
second: v[2]
second: v[3]
second: v[4]
second: v[5]
second: v[6]

The theory behind a "bulking optimization" is simple. If there are one million traversers at vertex 1, then there is no need to calculate one million both()-computations. Instead, represent those one million traversers as a single traverser with a Traverser.bulk() equal to one million and execute both() once. A bulking optimization example is made more salient on a larger graph. Therefore, the example below leverages the Grateful Dead graph.

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> clockWithResult(1){g.V().both().both().both().count().next()} //(1)
==>10582.836385999999
==>126653966
gremlin> clockWithResult(1){g.V().repeat(both()).times(3).count().next()} //(2)
==>1516.360602
==>126653966
gremlin> clockWithResult(1){g.V().both().barrier().both().barrier().both().barrier().count().next()} //(3)
==>16.194036
==>126653966
  1. A non-bulking traversal where each traverser is processed.

  2. Each traverser entering repeat() has its recursion bulked.

  3. A bulking traversal where implicit traversers are not processed.

If barrier() is provided an integer argument, then the barrier will only hold n-number of unique traversers in its barrier before draining the aggregated traversers to the next step. This is useful in the aforementioned bulking optimization scenario, but reduces the risk of an out-of-memory exception.

The non-default LazyBarrierStrategy inserts barrier()-steps in a traversal where appropriate in order to gain the "bulking optimization."

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal(GraphTraversalSource.build().with(LazyBarrierStrategy.instance()).engine(StandardTraversalEngine.build()))
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> clockWithResult(1){g.V().both().both().both().count().next()}
==>14.292721
==>126653966
gremlin> g.V().both().both().both().count().iterate().toString() //(1)
==>[TinkerGraphStep([],vertex), VertexStep(BOTH,vertex), NoOpBarrierStep(10000), VertexStep(BOTH,vertex), NoOpBarrierStep(10000), VertexStep(BOTH,edge), CountGlobalStep]
  1. With LazyBarrierStrategy activated, barrier() steps are automatically inserted where appropriate.

By Step

The by()-step is not an actual step, but instead is a "step-modulator" similar to as() and option(). If a step is able to accept traversals, functions, comparators, etc. then by() is the means by which they are added. The general pattern is step().by()...by(). Some steps can only accept one by() while others can take an arbitrary amount.

gremlin> g.V().group().by(bothE().count()) //(1)
==>[1:[v[2], v[5], v[6]], 3:[v[1], v[3], v[4]]]
gremlin> g.V().group().by(bothE().count()).by('name') //(2)
==>[1:[vadas, ripple, peter], 3:[marko, lop, josh]]
gremlin> g.V().group().by(bothE().count()).by('name').by(count(local)) //(3)
==>[1:3, 3:3]
  1. by(outE().count()) will group the elements by their edge count (traversal).

  2. by('name') will process the grouped elements by their name (element property projection).

  3. by(count(local)) will count the number of elements in each group (traversal).

Cap Step

The cap()-step (barrier) iterates the traversal up to itself and emits the sideEffect referenced by the provided key. If multiple keys are provided, then a Map<String,Object> of sideEffects is emitted.

gremlin> g.V().groupCount('a').by(label).cap('a') //(1)
==>[software:2, person:4]
gremlin> g.V().groupCount('a').by(label).groupCount('b').by(outE().count()).cap('a','b') //(2)
==>[a:[software:2, person:4], b:[0:3, 1:1, 2:1, 3:1]]
  1. Group and count verticies by their label. Emit the side effect labeled a, which is the group count by label.

  2. Same as statement 1, but also emit the side effect labeled b which groups vertices by the number of out edges.

Coalesce Step

The coalesce()-step evaluates the provided traversals in order and returns the first traversal that emits at least one element.

gremlin> g.V(1).coalesce(outE('knows'), outE('created')).inV().path().by('name').by(label)
==>[marko, knows, vadas]
==>[marko, knows, josh]
gremlin> g.V(1).coalesce(outE('created'), outE('knows')).inV().path().by('name').by(label)
==>[marko, created, lop]
gremlin> g.V(1).next().property('nickname', 'okram')
==>vp[nickname->okram]
gremlin> g.V().hasLabel('person').coalesce(values('nickname'), values('name'))
==>okram
==>vadas
==>josh
==>peter

Count Step

count step

The count()-step (map) counts the total number of represented traversers in the streams (i.e. the bulk count).

gremlin> g.V().count()
==>6
gremlin> g.V().hasLabel('person').count()
==>4
gremlin> g.V().hasLabel('person').outE('created').count().path() //(1)
==>[4]
gremlin> g.V().hasLabel('person').outE('created').count().map {it.get() * 10}.path() //(2)
==>[4, 40]
  1. count()-step is a reducing barrier step meaning that all of the previous traversers are folded into a new traverser.

  2. The path of the traverser emanating from count() starts at count().

Important
count(local) counts the current, local object (not the objects in the traversal stream). This works for Collection- and Map-type objects. For any other object, a count of 1 is returned.

Choose Step

choose step

The choose()-step (branch) routes the current traverser to a particular traversal branch option. With choose(), it is possible to implement if/else-based semantics as well as more complicated selections.

gremlin> g.V().hasLabel('person').
               choose(values('age').is(lte(30)),
                 __.in(),
                 __.out()).values('name') //(1)
==>marko
==>ripple
==>lop
==>lop
gremlin> g.V().hasLabel('person').
               choose(values('age')).
                 option(27, __.in()).
                 option(32, __.out()).values('name') //(2)
==>marko
==>ripple
==>lop
  1. If the traversal yields an element, then do in, else do out (i.e. true/false-based option selection).

  2. Use the result of the traversal as a key to the map of traversal options (i.e. value-based option selection).

However, note that choose() can have an arbitrary number of options and moreover, can take an anonymous traversal as its choice function.

gremlin> g.V().hasLabel('person').
               choose(values('name')).
                 option('marko', values('age')).
                 option('josh', values('name')).
                 option('vadas', valueMap()).
                 option('peter', label())
==>29
==>[name:[vadas], age:[27]]
==>josh
==>person

The choose()-step can leverage the Pick.none option match. For anything that does not match a specified option, the none-option is taken.

gremlin> g.V().hasLabel('person').
               choose(values('name')).
                 option('marko', values('age')).
                 option(none, values('name'))
==>29
==>vadas
==>josh
==>peter

Coin Step

To randomly filter out a traverser, use the coin()-step (filter). The provided double argument biases the "coin toss."

gremlin> g.V().coin(0.5)
==>v[4]
==>v[5]
==>v[6]
gremlin> g.V().coin(0.0)
gremlin> g.V().coin(1.0)
==>v[1]
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[6]

Constant Step

To specify a constant value for a traverser, use the constant()-step (map). This is often useful with conditional steps like choose()-step or coalesce()-step.

gremlin> g.V().choose(__.hasLabel('person'),
             __.values('name'),
             __.constant('inhuman')) //(1)
==>marko
==>vadas
==>inhuman
==>josh
==>inhuman
==>peter
gremlin> g.V().coalesce(
             __.hasLabel('person').values('name'),
             __.constant('inhuman')) //(2)
==>marko
==>vadas
==>inhuman
==>josh
==>inhuman
==>peter
  1. Show the names of people, but show "inhuman" for other vertices.

  2. Same as statement 1 (unless there is a person vertex with no name).

CyclicPath Step

cyclicpath step

Each traverser maintains its history through the traversal over the graph — i.e. its path. If it is important that the traverser repeat its course, then cyclic()-path should be used (filter). The step analyzes the path of the traverser thus far and if there are any repeats, the traverser is filtered out over the traversal computation. If non-cyclic behavior is desired, see simplePath().

gremlin> g.V(1).both().both()
==>v[1]
==>v[4]
==>v[6]
==>v[1]
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).both().both().cyclicPath()
==>v[1]
==>v[1]
==>v[1]
gremlin> g.V(1).both().both().cyclicPath().path()
==>[v[1], v[3], v[1]]
==>[v[1], v[2], v[1]]
==>[v[1], v[4], v[1]]

Dedup Step

With dedup()-step (filter), repeatedly seen objects are removed from the traversal stream. Note that if a traverser’s bulk is greater than 1, then it is set to 1 before being emitted.

gremlin> g.V().values('lang')
==>java
==>java
gremlin> g.V().values('lang').dedup()
==>java
gremlin> g.V(1).repeat(bothE('created').dedup().otherV()).emit().path() //(1)
==>[v[1], e[9][1-created->3], v[3]]
==>[v[1], e[9][1-created->3], v[3], e[11][4-created->3], v[4]]
==>[v[1], e[9][1-created->3], v[3], e[12][6-created->3], v[6]]
==>[v[1], e[9][1-created->3], v[3], e[11][4-created->3], v[4], e[10][4-created->5], v[5]]
  1. Traverse all created edges, but don’t touch any edge twice.

If a by-step modulation is provided to dedup(), then the object is processed accordingly prior to determining if it has been seen or not.

gremlin> g.V().valueMap(true, 'name')
==>[name:[marko], label:person, id:1]
==>[name:[vadas], label:person, id:2]
==>[name:[lop], label:software, id:3]
==>[name:[josh], label:person, id:4]
==>[name:[ripple], label:software, id:5]
==>[name:[peter], label:person, id:6]
gremlin> g.V().dedup().by(label).values('name')
==>marko
==>lop

Finally, if dedup() is provided an array of strings, then it will ensure that the de-duplication is not with respect to the current traverser object, but to the path history of the traverser.

gremlin> g.V().as('a').out('created').as('b').in('created').as('c').select('a','b','c')
==>[a:v[1], b:v[3], c:v[1]]
==>[a:v[1], b:v[3], c:v[4]]
==>[a:v[1], b:v[3], c:v[6]]
==>[a:v[4], b:v[5], c:v[4]]
==>[a:v[4], b:v[3], c:v[1]]
==>[a:v[4], b:v[3], c:v[4]]
==>[a:v[4], b:v[3], c:v[6]]
==>[a:v[6], b:v[3], c:v[1]]
==>[a:v[6], b:v[3], c:v[4]]
==>[a:v[6], b:v[3], c:v[6]]
gremlin> g.V().as('a').out('created').as('b').in('created').as('c').dedup('a','b').select('a','b','c') //(1)
==>[a:v[1], b:v[3], c:v[1]]
==>[a:v[4], b:v[5], c:v[4]]
==>[a:v[4], b:v[3], c:v[1]]
==>[a:v[6], b:v[3], c:v[1]]
  1. If the current a and b combination has been seen previously, then filter the traverser.

Drop Step

The drop()-step (filter/sideEffect) is used to remove element and properties from the graph (i.e. remove). It is a filter step because the traversal yields no outgoing objects.

gremlin> g.V().outE().drop()
gremlin> g.E()
gremlin> g.V().properties('name').drop()
gremlin> g.V().valueMap()
==>[age:[29]]
==>[age:[27]]
==>[lang:[java]]
==>[age:[32]]
==>[lang:[java]]
==>[age:[35]]
gremlin> g.V().drop()
gremlin> g.V()

Fold Step

There are situations when the traversal stream needs a "barrier" to aggregate all the objects and emit a computation that is a function of the aggregate. The fold()-step (map) is one particular instance of this. Please see unfold()-step for the inverse functionality.

gremlin> g.V(1).out('knows').values('name')
==>vadas
==>josh
gremlin> g.V(1).out('knows').values('name').fold() //(1)
==>[vadas, josh]
gremlin> g.V(1).out('knows').values('name').fold().next().getClass() //(2)
==>class java.util.ArrayList
gremlin> g.V(1).out('knows').values('name').fold(0) {a,b -> a + b.length()} //(3)
==>9
gremlin> g.V().values('age').fold(0) {a,b -> a + b} //(4)
==>123
gremlin> g.V().values('age').fold(0, sum) //(5)
==>123
gremlin> g.V().values('age').sum() //(6)
==>123.0
  1. A parameterless fold() will aggregate all the objects into a list and then emit the list.

  2. A verification of the type of list returned.

  3. fold() can be provided two arguments —  a seed value and a reduce bi-function ("vadas" is 5 characters + "josh" with 4 characters).

  4. What is the total age of the people in the graph?

  5. The same as before, but using a built-in bi-function.

  6. The same as before, but using the sum()-step.

Group Step

As traversers propagate across a graph as defined by a traversal, sideEffect computations are sometimes required. That is, the actual path taken or the current location of a traverser is not the ultimate output of the computation, but some other representation of the traversal. The group()-step (sideEffect) is one such sideEffect that organizes the objects according to some function of the object. Then, if required, that organization (a list) is reduced. An example is provided below.

gremlin> g.V().group().by(label) //(1)
==>[software:[v[3], v[5]], person:[v[1], v[2], v[4], v[6]]]
gremlin> g.V().group().by(label).by('name') //(2)
==>[software:[lop, ripple], person:[marko, vadas, josh, peter]]
gremlin> g.V().group().by(label).by('name').by(count(local)) //(3)
==>[software:2, person:4]
  1. Group the vertices by their label.

  2. For each vertex in the group, get their name.

  3. For each grouping, what is its size?

The three projection parameters available to group() via by() are:

  1. Key-projection: What feature of the object to group on (a function that yields the map key)?

  2. Value-projection: What feature of the group to store in the key-list?

  3. Reduce-projection: What feature of the key-list to ultimately return?

GroupCount Step

When it is important to know how many times a particular object has been at a particular part of a traversal, groupCount()-step (sideEffect) is used.

"What is the distribution of ages in the graph?"
gremlin> g.V().hasLabel('person').values('age').groupCount()
==>[32:1, 35:1, 27:1, 29:1]
gremlin> g.V().hasLabel('person').groupCount().by('age') //(1)
==>[32:1, 35:1, 27:1, 29:1]
  1. You can also supply a pre-group projection, where the provided by()-modulation determines what to group the incoming object by.

There is one person that is 32, one person that is 35, one person that is 27, and one person that is 29.

"Iteratively walk the graph and count the number of times you see the second letter of each name."
groupcount step
gremlin> g.V().repeat(both().groupCount('m').by(label)).times(10).cap('m')
==>[software:19598, person:39196]

The above is interesting in that it demonstrates the use of referencing the internal Map<Object,Long> of groupCount() with a string variable. Given that groupCount() is a sideEffect-step, it simply passes the object it received to its output. Internal to groupCount(), the object’s count is incremented.

Has Step

has step

It is possible to filter vertices, edges, and vertex properties based on their properties using has()-step (filter). There are numerous variations on has() including:

  • has(key,value): Remove the traverser if its element does not have the provided key/value property.

  • has(key,predicate): Remove the traverser if its element does not have a key value that satisfies the bi-predicate.

  • hasLabel(labels...): Remove the traverser if its element does not have any of the labels.

  • hasId(ids...): Remove the traverser if its element does not have any of the ids.

  • hasKey(keys...): Remove the traverser if its property does not have any of the keys.

  • hasValue(values...): Remove the traverser if its property does not have any of the values.

  • has(key): Remove the traverser if its element does not have a value for the key.

  • hasNot(key): Remove the traverser if its element has a value for the key.

  • has(key, traversal): Remove the traverser if its object does not yield a result through the traversal off the property value.

gremlin> g.V().hasLabel('person')
==>v[1]
==>v[2]
==>v[4]
==>v[6]
gremlin> g.V().hasLabel('person').out().has('name',within('vadas','josh'))
==>v[2]
==>v[4]
gremlin> g.V().hasLabel('person').out().has('name',within('vadas','josh')).
               outE().hasLabel('created')
==>e[10][4-created->5]
==>e[11][4-created->3]
gremlin> g.V().has('age',inside(20,30)).values('age') //(1)
==>29
==>27
gremlin> g.V().has('age',outside(20,30)).values('age') //(2)
==>32
==>35
gremlin> g.V().has('name',within('josh','marko')).valueMap() //(3)
==>[name:[marko], age:[29]]
==>[name:[josh], age:[32]]
gremlin> g.V().has('name',without('josh','marko')).valueMap() //(4)
==>[name:[vadas], age:[27]]
==>[name:[lop], lang:[java]]
==>[name:[ripple], lang:[java]]
==>[name:[peter], age:[35]]
gremlin> g.V().has('name',not(within('josh','marko'))).valueMap() //(5)
==>[name:[vadas], age:[27]]
==>[name:[lop], lang:[java]]
==>[name:[ripple], lang:[java]]
==>[name:[peter], age:[35]]
  1. Find all vertices whose ages are between 20 (inclusive) and 30 (exclusive).

  2. Find all vertices whose ages are not between 20 (inclusive) and 30 (exclusive).

  3. Find all vertices whose names are exact matches to any names in the the collection [josh,marko], display all the key,value pairs for those verticies.

  4. Find all vertices whose names are not in the collection [josh,marko], display all the key,value pairs for those vertices.

  5. Same as the prior example save using not on within to yield without.

TinkerPop does not support a regular expression predicate, although specific graph databases that leverage TinkerPop may provide a partial match extension.

Inject Step

inject step

One of the major features of TinkerPop3 is "injectable steps." This makes it possible to insert objects arbitrarily into a traversal stream. In general, inject()-step (sideEffect) exists and a few examples are provided below.

gremlin> g.V(4).out().values('name').inject('daniel')
==>daniel
==>ripple
==>lop
gremlin> g.V(4).out().values('name').inject('daniel').map {it.get().length()}
==>6
==>6
==>3
gremlin> g.V(4).out().values('name').inject('daniel').map {it.get().length()}.path()
==>[daniel, 6]
==>[v[4], v[5], ripple, 6]
==>[v[4], v[3], lop, 3]

In the last example above, note that the path starting with daniel is only of length 2. This is because the daniel string was inserted half-way in the traversal. Finally, a typical use case is provided below — when the start of the traversal is not a graph object.

gremlin> inject(1,2)
==>1
==>2
gremlin> inject(1,2).map {it.get() + 1}
==>2
==>3
gremlin> inject(1,2).map {it.get() + 1}.map {g.V(it.get()).next()}.values('name')
==>vadas
==>lop

Is Step

It is possible to filter scalar values using is()-step (filter).

gremlin> g.V().values('age').is(32)
==>32
gremlin> g.V().values('age').is(lte(30))
==>29
==>27
gremlin> g.V().values('age').is(inside(30, 40))
==>32
==>35
gremlin> g.V().where(__.in('created').count().is(1)).values('name') //(1)
==>ripple
gremlin> g.V().where(__.in('created').count().is(gte(2))).values('name') //(2)
==>lop
gremlin> g.V().where(__.in('created').values('age').
                                    mean().is(inside(30d, 35d))).values('name') //(3)
==>lop
==>ripple
  1. Find projects having exactly one contributor.

  2. Find projects having two or more contributors.

  3. Find projects whose contributors average age is between 30 and 35.

Limit Step

The limit()-step is analogous to range()-step save that the lower end range is set to 0.

gremlin> g.V().limit(2)
==>v[1]
==>v[2]
gremlin> g.V().range(0, 2)
==>v[1]
==>v[2]
gremlin> g.V().limit(2).toString()
==>[GraphStep([],vertex), RangeGlobalStep(0,2)]

The limit()-step can also be applied with Scope.local, in which case it operates on the incoming collection. The examples below use the The Crew toy data set.

gremlin> g.V().valueMap().select('location').limit(local,2) //(1)
==>[san diego, santa cruz]
==>[centreville, dulles]
==>[bremen, baltimore]
==>[spremberg, kaiserslautern]
gremlin> g.V().valueMap().limit(local, 1) //(2)
==>[name:[marko]]
==>[name:[stephen]]
==>[name:[matthias]]
==>[name:[daniel]]
==>[name:[gremlin]]
==>[name:[tinkergraph]]
  1. List<String> for each vertex containing the first two locations.

  2. Map<String, Object> for each vertex, but containing only the first property value.

Local Step

local step

A GraphTraversal operates on a continuous stream of objects. In many situations, it is important to operate on a single element within that stream. To do such object-local traversal computations, local()-step exists (branch). Note that the examples below use the The Crew toy data set.

gremlin> g.V().as('person').
               properties('location').order().by('startTime',incr).limit(2).value().as('location').
               select('person','location').by('name').by() //(1)
==>[person:daniel, location:spremberg]
==>[person:stephen, location:centreville]
gremlin> g.V().as('person').
               local(properties('location').order().by('startTime',incr).limit(2)).value().as('location').
               select('person','location').by('name').by() //(2)
==>[person:marko, location:san diego]
==>[person:marko, location:santa cruz]
==>[person:stephen, location:centreville]
==>[person:stephen, location:dulles]
==>[person:matthias, location:bremen]
==>[person:matthias, location:baltimore]
==>[person:daniel, location:spremberg]
==>[person:daniel, location:kaiserslautern]
  1. Get the first two people and their respective location according to the most historic location start time.

  2. For every person, get their two most historic locations.

The two traversals above look nearly identical save the inclusion of local() which wraps a section of the traversal in a object-local traversal. As such, the order().by() and the limit() refer to a particular object, not to the stream as a whole.

Warning
The anonymous traversal of local() processes the current object "locally." In OLAP, where the atomic unit of computing is the the vertex and its local "star graph," it is important that the anonymous traversal does not leave the confines of the vertex’s star graph. In other words, it can not traverse to an adjacent vertex’s properties or edges.

MapKeys Step

The mapKeys()-step (flatMap) takes an incoming map and emits its keys. This is especially useful when one is only interested in the top N elements in a groupCount() ranking.

gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> g.V().hasLabel("song").out("followedBy").groupCount().by("name").
               order(local).by(valueDecr).limit(local, 5)
==>[PLAYING IN THE BAND:107, JACK STRAW:99, TRUCKING:94, DRUMS:92, ME AND MY UNCLE:86]
gremlin> g.V().hasLabel("song").out("followedBy").groupCount().by("name").
               order(local).by(valueDecr).limit(local, 5).mapKeys()
==>PLAYING IN THE BAND
==>JACK STRAW
==>TRUCKING
==>DRUMS
==>ME AND MY UNCLE

MapValues Step

The mapValues()-step (flatMap) takes an incoming map and emits its values.

gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> :set max-iteration 10
gremlin> g.V().hasLabel("song").out("sungBy").groupCount().by("name").next() //(1)
==>All=9
==>Weir_Garcia=1
==>Lesh=19
==>Weir_Kreutzmann=1
==>Pigpen_Garcia=1
==>Pigpen=36
==>Unknown=6
==>Weir_Bralove=1
==>Joan_Baez=10
==>Suzanne_Vega=2
...
gremlin> g.V().hasLabel("song").out("sungBy").groupCount().by("name").mapValues() //(2)
==>9
==>1
==>19
==>1
==>1
==>36
==>6
==>1
==>10
==>2
...
gremlin> g.V().hasLabel("song").out("sungBy").groupCount().by("name").mapValues().groupCount().
               order(local).by(valueDecr).limit(local, 5).next() //(3)
==>1=22
==>2=12
==>3=7
==>4=4
==>6=2
  1. Which artist sung how many songs?

  2. Get an anonymized set of song repertoire sizes.

  3. What are the 5 most common song repertoire sizes?

Match Step

The match()-step (map) provides a more declarative form of graph querying based on the notion of pattern matching. With match(), the user provides a collection of "traversal fragments," called patterns, that have variables defined that must hold true throughout the duration of the match(). When a traverser is in match(), a registered MatchAlgorithm analyzes the current state of the traverser (i.e. its history based on its path data), the runtime statistics of the traversal patterns, and returns a traversal-pattern that the traverser should try next. The default MatchAlgorithm provided is called CountMatchAlgorithm and it dynamically revises the pattern execution plan by sorting the patterns according to their filtering capabilities (i.e. largest set reduction patterns execute first). For very large graphs, where the developer is uncertain of the statistics of the graph (e.g. how many knows-edges vs. worksFor-edges exist in the graph), it is advantageous to use match(), as an optimal plan will be determined automatically. Furthermore, some queries are much easier to express via match() than with single-path traversals.

"Who created a project named 'lop' that was also created by someone who is 29 years old? Return the two creators."
match step
gremlin> g.V().match(
                 __.as('a').out('created').as('b'),
                 __.as('b').has('name', 'lop'),
                 __.as('b').in('created').as('c'),
                 __.as('c').has('age', 29)).
               select('a','c').by('name')
==>[a:marko, c:marko]
==>[a:josh, c:marko]
==>[a:peter, c:marko]

Note that the above can also be more concisely written as below which demonstrates that standard inner-traversals can be arbitrarily defined.

gremlin> g.V().match(
                 __.as('a').out('created').has('name', 'lop').as('b'),
                 __.as('b').in('created').has('age', 29).as('c')).
               select('a','c').by('name')
==>[a:marko, c:marko]
==>[a:josh, c:marko]
==>[a:peter, c:marko]
grateful dead schema
Figure 4. Grateful Dead

MatchStep brings functionality similar to SPARQL to Gremlin. Like SPARQL, MatchStep conjoins a set of patterns applied to a graph. For example, the following traversal finds exactly those songs which Jerry Garcia has both sung and written (using the Grateful Dead graph distributed in the data/ directory):

gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> g.V().match(
                 __.as('a').has('name', 'Garcia'),
                 __.as('a').in('writtenBy').as('b'),
                 __.as('a').in('sungBy').as('b')).
               select('b').values('name')
==>CREAM PUFF WAR
==>CRYPTICAL ENVELOPMENT

Among the features which differentiate match() from SPARQL are:

gremlin> g.V().match(
                 __.as('a').out('created').has('name','lop').as('b'), //(1)
                 __.as('b').in('created').has('age', 29).as('c'),
                 __.as('c').repeat(out()).times(2)). //(2)
                   select('c').out('knows').dedup().values('name') //(3)
==>vadas
==>josh
  1. Patterns of arbitrary complexity: match() is not restricted to triple patterns or property paths.

  2. Recursion support: match() supports the branch-based steps within a pattern, including repeat().

  3. Imperative/declarative hybrid: Before and after a match(), it is possible to leverage classic Gremlin traversals.

To extend point #3, it is possible to support going from imperative, to declarative, to imperative, ad infinitum.

gremlin> g.V().match(
                 __.as('a').out('knows').as('b'),
                 __.as('b').out('created').has('name','lop')).
               select('b').out('created').
                   match(
                     __.as('x').in('created').as('y'),
                     __.as('y').out('knows').as('z')).
                   select('z').values('name')
==>vadas
==>josh
Important
The match()-step is stateless. The variable bindings of the traversal patterns are stored in the path history of the traverser. As such, the variables used over all match()-steps within a traversal are globally unique. A benefit of this is that subsequent where(), select(), match(), etc. steps can leverage the same variables in their analysis.

Like all other steps in Gremlin, match() is a function and thus, match() within match() is a natural consequence of Gremlin’s functional foundation (i.e. recursive matching).

gremlin> g.V().match(
                 __.as('a').out('knows').as('b'),
                 __.as('b').out('created').has('name','lop'),
                 __.as('b').match(
                              __.as('b').out('created').as('c'),
                              __.as('c').has('name','ripple')).
                            select('c').as('c')).
               select('a','c').by('name')
==>[a:marko, c:ripple]

If a step-labeled traversal proceeds the match()-step and the traverser entering the match() is destined to bind to a particular variable, then the previous step should be labeled accordingly.

gremlin> g.V().as('a').out('knows').as('b').
           match(
             __.as('b').out('created').as('c'),
             __.not(__.as('c').in('created').as('a'))).
           select('a','b','c').by('name')
==>[a:marko, b:josh, c:ripple]

There are three types of match() traversal patterns.

  1. as('a')...as('b'): both the start and end of the traversal have a declared variable.

  2. as('a')...: only the start of the traversal has a declared variable.

  3. ...: there are no declared variables.

If a variable is at the start of a traversal pattern it must exist as a label in the path history of the traverser else the traverser can not go down that path. If a variable is at the end of a traversal pattern then if the variable exists in the path history of the traverser, the traverser’s current location must match (i.e. equal) its historic location at that same label. However, if the variable does not exist in the path history of the traverser, then the current location is labeled as the variable and thus, becomes a bound variable for subsequent traversal patterns. If a traversal pattern does not have an end label, then the traverser must simply "survive" the pattern (i.e. not be filtered) to continue to the next pattern. If a traversal pattern does not have a start label, then the traverser can go down that path at any point, but will only go down that pattern once as a traversal pattern is executed once and only once for the history of the traverser. Typically, traversal patterns that do not have a start and end label are used in conjunction with and(), or(), and where(). Once the traverser has "survived" all the patterns (or at least one for or()), match()-step analyzes the traverser’s path history and emits a Map<String,Object> of the variable bindings to the next step in the traversal.

gremlin> g.V().as('a').out().as('b'). //(1)
             match( //(2)
               __.as('a').out().count().as('c'), //(3)
               __.not(__.as('a').in().as('b')), //(4)
               or( //(5)
                 __.as('a').out('knows').as('b'),
                 __.as('b').in().count().as('c').and().as('c').is(gt(2)))). //(6)
             dedup('a','c'). //(7)
             select('a','b','c').by('name').by('name').by() //(8)
==>[a:marko, b:lop, c:3]
  1. A standard, step-labeled traversal can come prior to match().

  2. If the traverser’s path prior to entering match() has requisite label values, then those historic values are bound.

  3. It is possible to use barrier steps though they are computed locally to the pattern (as one would expect).

  4. It is possible to not() a pattern.

  5. It is possible to nest and()- and or()-steps for conjunction matching.

  6. Both infix and prefix conjunction notation is supported.

  7. It is possible to "distinct" the specified label combination.

  8. The bound values are of different types — vertex ("a"), vertex ("b"), long ("c").

Using Where with Match

Match is typically used in conjunction with both select() (demonstrated previously) and where() (presented here). A where()-step allows the user to further constrain the result set provided by match().

gremlin> g.V().match(
                 __.as('a').out('created').as('b'),
                 __.as('b').in('created').as('c')).
                 where('a', neq('c')).
               select('a','c').by('name')
==>[a:marko, c:josh]
==>[a:marko, c:peter]
==>[a:josh, c:marko]
==>[a:josh, c:peter]
==>[a:peter, c:marko]
==>[a:peter, c:josh]

The where()-step can take either a P-predicate (example above) or a Traversal (example below). Using MatchPredicateStrategy, where()-clauses are automatically folded into match() and thus, subject to the query optimizer within match()-step.

gremlin> traversal = g.V().match(
                             __.as('a').has(label,'person'), //(1)
                             __.as('a').out('created').as('b'),
                             __.as('b').in('created').as('c')).
                             where(__.as('a').out('knows').as('c')). //(2)
                           select('a','c').by('name'); null //(3)
==>null
gremlin> traversal.toString() //(4)
==>[GraphStep([],vertex), MatchStep(AND,[[MatchStartStep(a), HasStep([~label.eq(person)]), MatchEndStep], [MatchStartStep(a), VertexStep(OUT,[created],vertex), MatchEndStep(b)], [MatchStartStep(b), VertexStep(IN,[created],vertex), MatchEndStep(c)]]), WhereTraversalStep([WhereStartStep(a), VertexStep(OUT,[knows],vertex), WhereEndStep(c)]), SelectStep([a, c],[value(name)])]
gremlin> traversal // (5) (6)
==>[a:marko, c:josh]
gremlin> traversal.toString() //(7)
==>[TinkerGraphStep(vertex,[~label.eq(person)])@[a], MatchStep(AND,[[MatchStartStep(a), VertexStep(OUT,[created],vertex), MatchEndStep(b)], [MatchStartStep(b), VertexStep(IN,[created],vertex), MatchEndStep(c)], [MatchStartStep(a), WhereTraversalStep([WhereStartStep, VertexStep(OUT,[knows],vertex), WhereEndStep(c)]), MatchEndStep]]), SelectStep([a, c],[value(name)])]
  1. Any has()-step traversal patterns that start with the match-key are pulled out of match() to enable the vendor to leverage the filter for index lookups.

  2. A where()-step with a traversal containing variable bindings declared in match().

  3. A useful trick to ensure that the traversal is not iterated by Gremlin Console.

  4. The string representation of the traversal prior to its strategies being applied.

  5. The Gremlin Console will automatically iterate anything that is an iterator or is iterable.

  6. Both marko and josh are co-developers and marko knows josh.

  7. The string representation of the traversal after the strategies have been applied (and thus, where() is folded into match())

Important
A where()-step is a filter and thus, variables within a where() clause are not globally bound to the path of the traverser in match(). As such, where()-steps in match() are used for filtering, not binding.

Max Step

The max()-step (map) operates on a stream of numbers and determines which is the largest number in the stream.

gremlin> g.V().values('age').max()
==>35
gremlin> g.V().repeat(both()).times(3).values('age').max()
==>35
Important
max(local) determines the max of the current, local object (not the objects in the traversal stream). This works for Collection and Number-type objects. For any other object, a max of Double.NaN is returned.

Mean Step

The mean()-step (map) operates on a stream of numbers and determines the average of those numbers.

gremlin> g.V().values('age').mean()
==>30.75
gremlin> g.V().repeat(both()).times(3).values('age').mean() //(1)
==>30.645833333333332
gremlin> g.V().repeat(both()).times(3).values('age').dedup().mean()
==>30.75
  1. Realize that traversers are being bulked by repeat(). There may be more of a particular number than another, thus altering the average.

Important
mean(local) determines the mean of the current, local object (not the objects in the traversal stream). This works for Collection and Number-type objects. For any other object, a mean of Double.NaN is returned.

Min Step

The min()-step (map) operates on a stream of numbers and determines which is the smallest number in the stream.

gremlin> g.V().values('age').min()
==>27
gremlin> g.V().repeat(both()).times(3).values('age').min()
==>27
Important
min(local) determines the min of the current, local object (not the objects in the traversal stream). This works for Collection and Number-type objects. For any other object, a min of Double.NaN is returned.

Or Step

The or()-step ensures that at least one of the provided traversals yield a result (filter). Please see and() for and-semantics.

gremlin> g.V().or(
            __.outE('created'),
            __.inE('created').count().is(gt(1))).
              values('name')
==>marko
==>lop
==>josh
==>peter

The or()-step can take an arbitrary number of traversals. At least one of the traversals must produce at least one output for the original traverser to pass to the next step.

An infix notation can be used as well. Though, with infix notation, only two traversals can be or’d together.

gremlin> g.V().where(outE('created').or().outE('knows')).values('name')
==>marko
==>josh
==>peter

Order Step

When the objects of the traversal stream need to be sorted, order()-step (map) can be leveraged.

gremlin> g.V().values('name').order()
==>josh
==>lop
==>marko
==>peter
==>ripple
==>vadas
gremlin> g.V().values('name').order().by(decr)
==>vadas
==>ripple
==>peter
==>marko
==>lop
==>josh
gremlin> g.V().hasLabel('person').order().by('age', incr).values('name')
==>vadas
==>marko
==>josh
==>peter

One of the most traversed objects in a traversal is an Element. An element can have properties associated with it (i.e. key/value pairs). In many situations, it is desirable to sort an element traversal stream according to a comparison of their properties.

gremlin> g.V().values('name')
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> g.V().order().by('name',incr).values('name')
==>josh
==>lop
==>marko
==>peter
==>ripple
==>vadas
gremlin> g.V().order().by('name',decr).values('name')
==>vadas
==>ripple
==>peter
==>marko
==>lop
==>josh

The order()-step allows the user to provide an arbitrary number of comparators for primary, secondary, etc. sorting. In the example below, the primary ordering is based on the outgoing created-edge count. The secondary ordering is based on the age of the person.

gremlin> g.V().hasLabel('person').order().by(outE('created').count(), incr).
                                          by('age', incr).values('name')
==>vadas
==>marko
==>peter
==>josh
gremlin> g.V().hasLabel('person').order().by(outE('created').count(), incr).
                                          by('age', decr).values('name')
==>vadas
==>peter
==>marko
==>josh

Randomizing the order of the traversers at a particular point in the traversal is possible with Order.shuffle.

gremlin> g.V().hasLabel('person').order().by(shuffle)
==>v[1]
==>v[6]
==>v[2]
==>v[4]
gremlin> g.V().hasLabel('person').order().by(shuffle)
==>v[1]
==>v[2]
==>v[4]
==>v[6]
Important
order(local) orders the current, local object (not the objects in the traversal stream). This works for Collection- and Map-type objects. For any other object, the object is returned unchanged.

Path Step

A traverser is transformed as it moves through a series of steps within a traversal. The history of the traverser is realized by examining its path with path()-step (map).

path step
gremlin> g.V().out().out().values('name')
==>ripple
==>lop
gremlin> g.V().out().out().values('name').path()
==>[v[1], v[4], v[5], ripple]
==>[v[1], v[4], v[3], lop]

If edges are required in the path, then be sure to traverser those edges explicitly.

gremlin> g.V().outE().inV().outE().inV().path()
==>[v[1], e[8][1-knows->4], v[4], e[10][4-created->5], v[5]]
==>[v[1], e[8][1-knows->4], v[4], e[11][4-created->3], v[3]]

It is possible to post-process the elements of the path in a round-robin fashion via by().

gremlin> g.V().out().out().path().by('name').by('age')
==>[marko, 32, ripple]
==>[marko, 32, lop]

Finally, because by()-based post-processing, nothing prevents triggering yet another traversal. In the traversal below, for each element of the path traversed thus far, if its a person (as determined by having an age-property), then get all of their creations, else if its a creation, get all the people that created it.

gremlin> g.V().out().out().path().by(
                            choose(hasLabel('person'),
                                          out('created').values('name'),
                                          __.in('created').values('name')).fold())
==>[[lop], [ripple, lop], [josh]]
==>[[lop], [ripple, lop], [marko, josh, peter]]
Warning
Generating path information is expensive as the history of the traverser is stored into a Java list. With numerous traversers, there are numerous lists. Moreover, in an OLAP GraphComputer environment this becomes exceedingly prohibitive as there are traversers emanating from all vertices in the graph in parallel. In OLAP there are optimizations provided for traverser populations, but when paths are calculated (and each traverser is unique due to its history), then these optimizations are no longer possible.

Path Data Structure

The Path data structure is an ordered list of objects, where each object is associated to a Set<String> of labels. An example is presented below to demonstrate both the Path API as well as how a traversal yields labeled paths.

path data structure
gremlin> path = g.V(1).as('a').has('name').as('b').
                       out('knows').out('created').as('c').
                       has('name','ripple').values('name').as('d').
                       identity().as('e').path().next()
==>[v[1], v[4], v[5], ripple]
gremlin> path.size()
==>4
gremlin> path.objects()
==>v[1]
==>v[4]
==>v[5]
==>ripple
gremlin> path.labels()
==>[a, b]
==>[]
==>[c]
==>[d, e]
gremlin> path.a
==>v[1]
gremlin> path.b
==>v[1]
gremlin> path.c
==>v[5]
gremlin> path.d == path.e
==>true

Profile Step

The profile()-step (sideEffect) exists to allow developers to profile their traversals to determine statistical information like step runtime, counts, etc.

Warning
Profiling a Traversal will impede the Traversal’s performance. This overhead is mostly excluded from the profile results, but durations are not exact. Thus, durations are best considered in relation to each other.
gremlin> g.V().out('created').repeat(both()).times(3).hasLabel('person').values('age').sum().profile().cap(TraversalMetrics.METRICS_KEY)
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
TinkerGraphStep([],vertex)                                             6           6           0.112     3.66
VertexStep(OUT,[created],vertex)                                       4           4           0.032     1.07
RepeatStep([VertexStep(BOTH,vertex), ProfileSte...                    58          40           0.701    22.75
  VertexStep(BOTH,vertex)                                             92          74           0.146
  RepeatEndStep                                                       58          40           0.267
HasStep([~label.eq(person)])                                          48          30           0.166     5.39
PropertiesStep([age],value)                                           48          30           0.850    27.59
SumGlobalStep                                                          1           1           0.315    10.25
SideEffectCapStep([~metrics])                                          1           1           0.902    29.29
                                            >TOTAL                     -           -           3.082        -

The profile()-step generates a TraversalMetrics sideEffect object that contains the following information:

  • Step: A step within the traversal being profiled.

  • Count: The number of represented traversers that passed through the step.

  • Traversers: The number of traversers that passed through the step.

  • Time (ms): The total time the step was actively executing its behavior.

  • % Dur: The percentage of total time spent in the step.

gremlin-exercise It is important to understand the difference between Count and Traversers. Traversers can be merged and as such, when two traversers are "the same" they may be aggregated into a single traverser. That new traverser has a Traverser.bulk() that is the sum of the two merged traverser bulks. On the other hand, the Count represents the sum of all Traverser.bulk() results and thus, expresses the number of "represented" (not enumerated) traversers. Traversers will always be less than or equal to Count.

Range Step

As traversers propagate through the traversal, it is possible to only allow a certain number of them to pass through with range()-step (filter). When the low-end of the range is not met, objects are continued to be iterated. When within the low and high range (both inclusive), traversers are emitted. Finally, when above the high range, the traversal breaks out of iteration.

gremlin> g.V().range(0,3)
==>v[1]
==>v[2]
==>v[3]
gremlin> g.V().range(1,3)
==>v[2]
==>v[3]
gremlin> g.V().repeat(both()).times(1000000).emit().range(6,10)
==>v[1]
==>v[5]
==>v[3]
==>v[1]

The range()-step can also be applied with Scope.local, in which case it operates on the incoming collection. For example, it is possible to produce a Map<String, String> for each traversed path, but containing only the second property value (the "b" step).

gremlin> g.V().as('a').out().as('b').in().as('c').select('a','b','c').by('name').range(local,1,2)
==>[b:lop]
==>[b:lop]
==>[b:lop]
==>[b:vadas]
==>[b:josh]
==>[b:ripple]
==>[b:lop]
==>[b:lop]
==>[b:lop]
==>[b:lop]
==>[b:lop]
==>[b:lop]

The next example uses the The Crew toy data set. It produces a List<String> containing the second and third location for each vertex.

gremlin> g.V().valueMap().select('location').range(local, 1, 3)
==>[santa cruz, brussels]
==>[dulles, purcellville]
==>[baltimore, oakland]
==>[kaiserslautern, aachen]

Repeat Step

gremlin fade

The repeat()-step (branch) is used for looping over a traversal given some break predicate. Below are some examples of repeat()-step in action.

gremlin> g.V(1).repeat(out()).times(2).path().by('name') //(1)
==>[marko, josh, ripple]
==>[marko, josh, lop]
gremlin> g.V().until(has('name','ripple')).
               repeat(out()).path().by('name') //(2)
==>[marko, josh, ripple]
==>[josh, ripple]
==>[ripple]
  1. do-while semantics stating to do out() 2 times.

  2. while-do semantics stating to break if the traverser is at a vertex named "ripple".

Important
There are two modulators for repeat(): until() and emit(). If until() comes after repeat() it is do/while looping. If until() comes before repeat() it is while/do looping. If emit() is placed after repeat(), it is evaluated on the traversers leaving the repeat-traversal. If emit() is placed before repeat(), it is evaluated on the traversers prior to entering the repeat-traversal.

The repeat()-step also supports an "emit predicate", where the predicate for an empty argument emit() is true (i.e. emit() == emit{true}). With emit(), the traverser is split in two — the traverser exits the code block as well as continues back within the code block (assuming until() holds true).

gremlin> g.V(1).repeat(out()).times(2).emit().path().by('name') //(1)
==>[marko, lop]
==>[marko, vadas]
==>[marko, josh]
==>[marko, josh, ripple]
==>[marko, josh, lop]
gremlin> g.V(1).emit().repeat(out()).times(2).path().by('name') //(2)
==>[marko]
==>[marko, lop]
==>[marko, vadas]
==>[marko, josh]
==>[marko, josh, ripple]
==>[marko, josh, lop]
  1. The emit() comes after repeat() and thus, emission happens after the repeat() traversal is executed. Thus, no one vertex paths exist.

  2. The emit() comes before repeat() and thus, emission happens prior to the repeat() traversal being executed. Thus, one vertex paths exist.

The emit()-modulator can take an arbitrary predicate.

gremlin> g.V(1).repeat(out()).times(2).emit(has('lang')).path().by('name')
==>[marko, lop]
==>[marko, josh, ripple]
==>[marko, josh, lop]
repeat step
gremlin> g.V(1).repeat(out()).times(2).emit().path().by('name')
==>[marko, lop]
==>[marko, vadas]
==>[marko, josh]
==>[marko, josh, ripple]
==>[marko, josh, lop]

The first time through the repeat(), the vertices lop, vadas, and josh are seen. Given that loops==0, the traverser repeats. However, because the emit-predicate is declared true, those vertices are emitted. At step 2 (loops==1), the vertices traversed are ripple and lop (Josh’s created projects, as lop and vadas have no out edges) and are also emitted. Now loops==1 so the traverser repeats. As ripple and lop have no out edges there are no vertices to traverse. Given that loops==2, the until-predicate fails. Therefore, the traverser has seen the vertices: lop, vadas, josh, ripple, and lop.

Finally, note that both emit() and until() can take a traversal and in such, situations, the predicate is determined by traversal.hasNext(). A few examples are provided below.

gremlin> g.V(1).repeat(out()).until(hasLabel('software')).path().by('name') //(1)
==>[marko, lop]
==>[marko, josh, ripple]
==>[marko, josh, lop]
gremlin> g.V(1).emit(hasLabel('person')).repeat(out()).path().by('name') //(2)
==>[marko]
==>[marko, vadas]
==>[marko, josh]
gremlin> g.V(1).repeat(out()).until(outE().count().is(0)).path().by('name') //(3)
==>[marko, lop]
==>[marko, vadas]
==>[marko, josh, ripple]
==>[marko, josh, lop]
  1. Starting from vertex 1, keep taking outgoing edges until a software vertex is reached.

  2. Starting from vertex 1, and in an infinite loop, emit the vertex if it is a person and then traverser the outgoing edges.

  3. Starting from vertex 1, keep taking outgoing edges until a vertex is reached that has no more outgoing edges.

Warning
The anonymous traversal of emit() and until() (not repeat()) process their current objects "locally." In OLAP, where the atomic unit of computing is the the vertex and its local "star graph," it is important that the anonymous traversals do not leave the confines of the vertex’s star graph. In other words, they can not traverse to an adjacent vertex’s properties or edges.

Sack Step

gremlin-sacks-running A traverser can contain a local data structure called a "sack". The sack()-step is used to read and write sacks (sideEffect or map). Each sack of each traverser is created when using GraphTraversal.withSack(initialValueSupplier,splitOperator?).

  • Initial value supplier: A Supplier providing the initial value of each traverser’s sack.

  • Split operator: a UnaryOperator that clones the traverser’s sack when the traverser splits. If no split operator is provided, then UnaryOperator.identity() is assumed.

Two trivial examples are presented below to demonstrate the initial value supplier. In the first example below, a traverser is created at each vertex in the graph (g.V()), with a 1.0 sack (withSack(1.0f)), and then the sack value is accessed (sack()). In the second example, a random float supplier is used to generate sack values.

gremlin> g.withSack(1.0f).V().sack()
==>1.0
==>1.0
==>1.0
==>1.0
==>1.0
==>1.0
gremlin> rand = new Random()
==>java.util.Random@2431050d
gremlin> g.withSack {rand.nextFloat()}.V().sack()
==>0.30117
==>0.8536626
==>0.88174033
==>0.09812951
==>0.6718441
==>0.8848922

A more complicated initial value supplier example is presented below where the sack values are used in a running computation and then emitted at the end of the traversal. When an edge is traversed, the edge weight is multiplied by the sack value (sack(mult,'weight')).

gremlin> g.withSack(1.0f).V().repeat(outE().sack(mult,'weight').inV()).times(2)
==>v[5]
==>v[3]
gremlin> g.withSack(1.0f).V().repeat(outE().sack(mult,'weight').inV()).times(2).sack()
==>1.0
==>0.4
gremlin> g.withSack(1.0f).V().repeat(outE().sack(mult,'weight').inV()).times(2).path().
               by().by('weight')
==>[v[1], 1.0, v[4], 1.0, v[5]]
==>[v[1], 1.0, v[4], 0.4, v[3]]

gremlin-sacks-standing When complex objects are used (i.e. non-primitives), then a split operator should be defined to ensure that each traverser gets a clone of its parent’s sack. The first example does not use a split operator and as such, the same map is propagated to all traversers (a global data structure). The second example, demonstrates how Map.clone() ensures that each traverser’s sack contains a unique, local sack.

gremlin> g.withSack {[:]}.V().out().out().
               sack {m,v -> m[v.value('name')] = v.value('lang'); m}.sack() // BAD: single map
==>[ripple:java]
==>[ripple:java, lop:java]
gremlin> g.withSack {[:]}{it.clone()}.V().out().out().
               sack {m,v -> m[v.value('name')] = v.value('lang'); m}.sack() // GOOD: cloned map
==>[ripple:java]
==>[lop:java]
Note
For primitives (i.e. integers, longs, floats, etc.), a split operator is not required as a primitives are encoded in the memory address of the sack, not as a reference to an object.

Sample Step

The sample()-step is useful for sampling some number of traversers previous in the traversal.

gremlin> g.V().outE().sample(1).values('weight')
==>0.5
gremlin> g.V().outE().sample(1).by('weight').values('weight')
==>1.0
gremlin> g.V().outE().sample(2).by('weight').values('weight')
==>1.0
==>1.0

One of the more interesting use cases for sample() is when it is used in conjunction with local(). The combination of the two steps supports the execution of random walks. In the example below, the traversal starts are vertex 1 and selects one edge to traverse based on a probability distribution generated by the weights of the edges. The output is always a single path as by selecting a single edge, the traverser never splits and continues down a single path in the graph.

gremlin> g.V(1).repeat(local(
                  bothE().sample(1).by('weight').otherV()
                )).times(5)
==>v[1]
gremlin> g.V(1).repeat(local(
                  bothE().sample(1).by('weight').otherV()
                )).times(5).path()
==>[v[1], e[9][1-created->3], v[3], e[9][1-created->3], v[1], e[8][1-knows->4], v[4], e[8][1-knows->4], v[1], e[8][1-knows->4], v[4]]
gremlin> g.V(1).repeat(local(
                  bothE().sample(1).by('weight').otherV()
                )).times(10).path()
==>[v[1], e[7][1-knows->2], v[2], e[7][1-knows->2], v[1], e[8][1-knows->4], v[4], e[10][4-created->5], v[5], e[10][4-created->5], v[4], e[11][4-created->3], v[3], e[9][1-created->3], v[1], e[7][1-knows->2], v[2], e[7][1-knows->2], v[1], e[7][1-knows->2], v[2]]

Select Step

Functional languages make use of function composition and lazy evaluation to create complex computations from primitive operations. This is exactly what Traversal does. One of the differentiating aspects of Gremlin’s data flow approach to graph processing is that the flow need not always go "forward," but in fact, can go back to a previously seen area of computation. Examples include path() as well as the select()-step (map). There are two general ways to use select()-step.

  1. Select labeled steps within a path (as defined by as() in a traversal).

  2. Select objects out of a Map<String,Object> flow (i.e. a sub-map).

The first use case is demonstrated via example below.

gremlin> g.V().as('a').out().as('b').out().as('c') // no select
==>v[5]
==>v[3]
gremlin> g.V().as('a').out().as('b').out().as('c').select('a','b','c')
==>[a:v[1], b:v[4], c:v[5]]
==>[a:v[1], b:v[4], c:v[3]]
gremlin> g.V().as('a').out().as('b').out().as('c').select('a','b')
==>[a:v[1], b:v[4]]
==>[a:v[1], b:v[4]]
gremlin> g.V().as('a').out().as('b').out().as('c').select('a','b').by('name')
==>[a:marko, b:josh]
==>[a:marko, b:josh]
gremlin> g.V().as('a').out().as('b').out().as('c').select('a') //(1)
==>v[1]
==>v[1]
  1. If the selection is one step, no map is returned.

When there is only one label selected, then a single object is returned. This is useful for stepping back in a computation and easily moving forward again on the object reverted to.

gremlin> g.V().out().out()
==>v[5]
==>v[3]
gremlin> g.V().out().out().path()
==>[v[1], v[4], v[5]]
==>[v[1], v[4], v[3]]
gremlin> g.V().as('x').out().out().select('x')
==>v[1]
==>v[1]
gremlin> g.V().out().as('x').out().select('x')
==>v[4]
==>v[4]
gremlin> g.V().out().out().as('x').select('x') // pointless
==>v[5]
==>v[3]
Note
When executing a traversal with select() on a standard traversal engine (i.e. OLTP), select() will do its best to avoid calculating the path history and instead, will rely on a global data structure for storing the currently selected object. As such, if only a subset of the path walked is required, select() should be used over the more resource intensive path()-step.

Using Where with Select

Finally, like match()-step, it is possible to use where(), as where is a filter that processes Map<String,Object> streams.

gremlin> g.V().as('a').out('created').in('created').as('b').select('a','b').by('name') //(1)
==>[a:marko, b:marko]
==>[a:marko, b:josh]
==>[a:marko, b:peter]
==>[a:josh, b:josh]
==>[a:josh, b:marko]
==>[a:josh, b:josh]
==>[a:josh, b:peter]
==>[a:peter, b:marko]
==>[a:peter, b:josh]
==>[a:peter, b:peter]
gremlin> g.V().as('a').out('created').in('created').as('b').
               select('a','b').by('name').where('a',neq('b')) //(2)
==>[a:marko, b:josh]
==>[a:marko, b:peter]
==>[a:josh, b:marko]
==>[a:josh, b:peter]
==>[a:peter, b:marko]
==>[a:peter, b:josh]
gremlin> g.V().as('a').out('created').in('created').as('b').
               select('a','b'). //(3)
               where('a',neq('b')).
               where(__.as('a').out('knows').as('b')).
               select('a','b').by('name')
==>[a:marko, b:josh]
  1. A standard select() that generates a Map<String,Object> of variables bindings in the path (i.e. a and b) for the sake of a running example.

  2. The select().by('name') projects each binding vertex to their name property value and where() operates to ensure respective a and b strings are not the same.

  3. The first select() projects a vertex binding set. A binding is filtered if a vertex equals b vertex. A binding is filtered if a doesn’t know b. The second and final select() projects the name of the vertices.

SimplePath Step

simplepath step

When it is important that a traverser not repeat its path through the graph, simplePath()-step should be used (filter). The path information of the traverser is analyzed and if the path has repeated objects in it, the traverser is filtered. If cyclic behavior is desired, see cyclicPath().

gremlin> g.V(1).both().both()
==>v[1]
==>v[4]
==>v[6]
==>v[1]
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).both().both().simplePath()
==>v[4]
==>v[6]
==>v[5]
==>v[3]
gremlin> g.V(1).both().both().simplePath().path()
==>[v[1], v[3], v[4]]
==>[v[1], v[3], v[6]]
==>[v[1], v[4], v[5]]
==>[v[1], v[4], v[3]]

Store Step

When lazy aggregation is needed, store()-step (sideEffect) should be used over aggregate(). The two steps differ in that store() does not block and only stores objects in its side-effect collection as they pass through.

gremlin> g.V().aggregate('x').limit(1).cap('x')
==>{v[1]=1, v[2]=1, v[3]=1, v[4]=1, v[5]=1, v[6]=1}
gremlin> g.V().store('x').limit(1).cap('x')
==>{v[1]=1, v[2]=1}

It is interesting to note that there are three results in the store() side-effect even though the interval selection is for 2 objects. Realize that when the third object is on its way to the range() filter (i.e. [0..1]), it passes through store() and thus, stored before filtered.

gremlin> g.E().store('x').by('weight').cap('x')
==>{0.5=1, 1.0=2, 0.4=2, 0.2=1}

Subgraph Step

subgraph logo

Extracting a portion of a graph from a larger one for analysis, visualization or other purposes is a fairly common use case for graph analysts and developers. The subgraph()-step (sideEffect) provides a way to produce an edge-induced subgraph from virtually any traversal. The following example demonstrates how to produce the "knows" subgraph:

gremlin> subGraph = g.E().hasLabel('knows').subgraph('subGraph').cap('subGraph').next() //(1)
==>tinkergraph[vertices:3 edges:2]
gremlin> sg = subGraph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:3 edges:2], standard]
gremlin> sg.E() //(2)
==>e[7][1-knows->2]
==>e[8][1-knows->4]
  1. As this function produces "edge-induced" subgraphs, subgraph() must be called at edge steps.

  2. The subgraph contains only "knows" edges.

A more common subgraphing use case is to get all of the graph structure surrounding a single vertex:

gremlin> subGraph = g.V(3).repeat(__.inE().subgraph('subGraph').outV()).times(3).cap('subGraph').next() //(1)
==>tinkergraph[vertices:4 edges:4]
gremlin> sg = subGraph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:4 edges:4], standard]
gremlin> sg.E()
==>e[8][1-knows->4]
==>e[9][1-created->3]
==>e[11][4-created->3]
==>e[12][6-created->3]
  1. Starting at vertex 3, traverse 3 steps away on in-edges, outputting all of that into the subgraph.

There can be multiple subgraph() calls within the same traversal. Each operating against either the same graph (i.e. same side-effect key) or different graphs (i.e. different side-effect keys).

gremlin> t = g.V().outE('knows').subgraph('knowsG').inV().outE('created').subgraph('createdG').
                   inV().inE('created').subgraph('createdG').iterate()
gremlin> t.sideEffects.get('knowsG').get().traversal(standard()).E()
==>e[7][1-knows->2]
==>e[8][1-knows->4]
gremlin> t.sideEffects.get('createdG').get().traversal(standard()).E()
==>e[9][1-created->3]
==>e[10][4-created->5]
==>e[11][4-created->3]
==>e[12][6-created->3]
Important
The subgraph()-step only writes to graphs that support user supplied ids for its elements. Moreover, if no graph is specified via withSideEffect(), then TinkerGraph is assumed.

Sum Step

The sum()-step (map) operates on a stream of numbers and sums the numbers together to yield a double. Note that the current traverser number is multiplied by the traverser bulk to determine how many such numbers are being represented.

gremlin> g.V().values('age').sum()
==>123.0
gremlin> g.V().repeat(both()).times(3).values('age').sum()
==>1471.0
Important
sum(local) determines the sum of the current, local object (not the objects in the traversal stream). This works for Collection-type objects. For any other object, a sum of Double.NaN is returned.

Tail Step

tail step

The tail()-step is analogous to limit()-step, except that it emits the last n-objects instead of the first n-objects.

gremlin> g.V().values('name').order()
==>josh
==>lop
==>marko
==>peter
==>ripple
==>vadas
gremlin> g.V().values('name').order().tail() //(1)
==>vadas
gremlin> g.V().values('name').order().tail(1) //(2)
==>vadas
gremlin> g.V().values('name').order().tail(3) //(3)
==>peter
==>ripple
==>vadas
  1. Last name (alphabetically).

  2. Same as statement 1.

  3. Last three names.

The tail()-step can also be applied with Scope.local, in which case it operates on the incoming collection.

gremlin> g.V().as('a').out().as('a').out().as('a').select('a').by(tail(local)).values('name') //(1)
==>ripple
==>lop
gremlin> g.V().as('a').out().as('a').out().as('a').select('a').by(unfold().values('name').fold()).tail(local) //(2)
==>ripple
==>lop
gremlin> g.V().as('a').out().as('a').out().as('a').select('a').by(unfold().values('name').fold()).tail(local, 2) //(3)
==>[josh, ripple]
==>[josh, lop]
gremlin> g.V().valueMap().tail(local) //(4)
==>[age:[29]]
==>[age:[27]]
==>[lang:[java]]
==>[age:[32]]
==>[lang:[java]]
==>[age:[35]]
  1. Only the most recent name from the "a" step (List<Vertex> becomes Vertex).

  2. Same result as statement 1 (List<String> becomes String).

  3. List<String> for each path containing the last two names from the a step.

  4. Map<String, Object> for each vertex, but containing only the last property value.

TimeLimit Step

In many situations, a graph traversal is not about getting an exact answer as its about getting a relative ranking. A classic example is recommendation. What is desired is a relative ranking of vertices, not their absolute rank. Next, it may be desirable to have the traversal execute for no more than 2 milliseconds. In such situations, timeLimit()-step (filter) can be used.

timelimit step
Note
The method clock(int runs, Closure code) is a utility preloaded in the Gremlin Console that can be used to time execution of a body of code.
gremlin> g.V().repeat(both().groupCount('m')).times(16).cap('m').order(local).by(valueDecr).next()
==>v[1]=2744208
==>v[3]=2744208
==>v[4]=2744208
==>v[2]=1136688
==>v[5]=1136688
==>v[6]=1136688
gremlin> clock(1) {g.V().repeat(both().groupCount('m')).times(16).cap('m').order(local).by(valueDecr).next()}
==>1.5060259999999999
gremlin> g.V().repeat(timeLimit(2).both().groupCount('m')).times(16).cap('m').order(local).by(valueDecr).next()
==>v[1]=2744208
==>v[3]=2744208
==>v[4]=2744208
==>v[2]=1136688
==>v[5]=1136688
==>v[6]=1136688
gremlin> clock(1) {g.V().repeat(timeLimit(2).both().groupCount('m')).times(16).cap('m').order(local).by(valueDecr).next()}
==>2.009913

In essence, the relative order is respected, even through the number of traversers at each vertex is not. The primary benefit being that the calculation is guaranteed to complete at the specified time limit (in milliseconds). Finally, note that the internal clock of timeLimit()-step starts when the first traverser enters it. When the time limit is reached, any next() evaluation of the step will yield a NoSuchElementException and any hasNext() evaluation will yield false.

Tree Step

From any one element (i.e. vertex or edge), the emanating paths from that element can be aggregated to form a tree. Gremlin provides tree()-step (sideEffect) for such this situation.

tree step
gremlin> tree = g.V().out().out().tree().next()
==>v[1]={v[4]={v[3]={}, v[5]={}}}

It is important to see how the paths of all the emanating traversers are united to form the tree.

tree step2

The resultant tree data structure can then be manipulated (see Tree JavaDoc).

gremlin> tree = g.V().out().out().tree().by('name').next()
==>marko={josh={ripple={}, lop={}}}
gremlin> tree['marko']
==>josh={ripple={}, lop={}}
gremlin> tree['marko']['josh']
==>ripple={}
==>lop={}
gremlin> tree.getObjectsAtDepth(3)
==>ripple
==>lop

Unfold Step

If the object reaching unfold() (flatMap) is an iterator, iterable, or map, then it is unrolled into a linear form. If not, then the object is simply emitted. Please see fold()-step for the inverse behavior.

gremlin> g.V(1).out().fold().inject('gremlin',[1.23,2.34])
==>gremlin
==>[1.23, 2.34]
==>[v[3], v[2], v[4]]
gremlin> g.V(1).out().fold().inject('gremlin',[1.23,2.34]).unfold()
==>gremlin
==>1.23
==>2.34
==>v[3]
==>v[2]
==>v[4]

Note that unfold() does not recursively unroll iterators. Instead, repeat() can be used to for recursive unrolling.

gremlin> inject(1,[2,3,[4,5,[6]]])
==>1
==>[2, 3, [4, 5, [6]]]
gremlin> inject(1,[2,3,[4,5,[6]]]).unfold()
==>1
==>2
==>3
==>[4, 5, [6]]
gremlin> inject(1,[2,3,[4,5,[6]]]).repeat(unfold()).until(count(local).is(1)).unfold()
==>1
==>2
==>3
==>4
==>5
==>6

Union Step

union step

The union()-step (branch) supports the merging of the results of an arbitrary number of traversals. When a traverser reaches a union()-step, it is copied to each of its internal steps. The traversers emitted from union() are the outputs of the respective internal traversals.

gremlin> g.V(4).union(
                  __.in().values('age'),
                  out().values('lang'))
==>29
==>java
==>java
gremlin> g.V(4).union(
                  __.in().values('age'),
                  out().values('lang')).path()
==>[v[4], v[1], 29]
==>[v[4], v[5], java]
==>[v[4], v[3], java]

ValueMap Step

The valueMap()-step yields a Map representation of the properties of an element.

gremlin> g.V().valueMap()
==>[name:[marko], age:[29]]
==>[name:[vadas], age:[27]]
==>[name:[lop], lang:[java]]
==>[name:[josh], age:[32]]
==>[name:[ripple], lang:[java]]
==>[name:[peter], age:[35]]
gremlin> g.V().valueMap('age')
==>[age:[29]]
==>[age:[27]]
==>[:]
==>[age:[32]]
==>[:]
==>[age:[35]]
gremlin> g.V().valueMap('age','blah')
==>[age:[29]]
==>[age:[27]]
==>[:]
==>[age:[32]]
==>[:]
==>[age:[35]]
gremlin> g.E().valueMap()
==>[weight:0.5]
==>[weight:1.0]
==>[weight:0.4]
==>[weight:1.0]
==>[weight:0.4]
==>[weight:0.2]

It is important to note that the map of a vertex maintains a list of values for each key. The map of an edge or vertex-property represents a single property (not a list). The reason is that vertices in TinkerPop3 leverage vertex properties which are support multiple values per key. Using the The Crew toy graph, the point is made explicit.

gremlin> g.V().valueMap()
==>[name:[marko], location:[san diego, santa cruz, brussels, santa fe]]
==>[name:[stephen], location:[centreville, dulles, purcellville]]
==>[name:[matthias], location:[bremen, baltimore, oakland, seattle]]
==>[name:[daniel], location:[spremberg, kaiserslautern, aachen]]
==>[name:[gremlin]]
==>[name:[tinkergraph]]
gremlin> g.V().has('name','marko').properties('location')
==>vp[location->san diego]
==>vp[location->santa cruz]
==>vp[location->brussels]
==>vp[location->santa fe]
gremlin> g.V().has('name','marko').properties('location').valueMap()
==>[startTime:1997, endTime:2001]
==>[startTime:2001, endTime:2004]
==>[startTime:2004, endTime:2005]
==>[startTime:2005]

If the id, label, key, and value of the Element is desired, then a boolean triggers its insertion into the returned map.

gremlin> g.V().hasLabel('person').valueMap(true)
==>[name:[marko], location:[san diego, santa cruz, brussels, santa fe], label:person, id:1]
==>[name:[stephen], location:[centreville, dulles, purcellville], label:person, id:7]
==>[name:[matthias], location:[bremen, baltimore, oakland, seattle], label:person, id:8]
==>[name:[daniel], location:[spremberg, kaiserslautern, aachen], label:person, id:9]
gremlin> g.V().hasLabel('person').valueMap(true,'name')
==>[name:[marko], label:person, id:1]
==>[name:[stephen], label:person, id:7]
==>[name:[matthias], label:person, id:8]
==>[name:[daniel], label:person, id:9]
gremlin> g.V().hasLabel('person').properties('location').valueMap(true)
==>[value:san diego, startTime:1997, endTime:2001, key:location, id:6]
==>[value:santa cruz, startTime:2001, endTime:2004, key:location, id:7]
==>[value:brussels, startTime:2004, endTime:2005, key:location, id:8]
==>[value:santa fe, startTime:2005, key:location, id:9]
==>[value:centreville, startTime:1990, endTime:2000, key:location, id:10]
==>[value:dulles, startTime:2000, endTime:2006, key:location, id:11]
==>[value:purcellville, startTime:2006, key:location, id:12]
==>[value:bremen, startTime:2004, endTime:2007, key:location, id:13]
==>[value:baltimore, startTime:2007, endTime:2011, key:location, id:14]
==>[value:oakland, startTime:2011, endTime:2014, key:location, id:15]
==>[value:seattle, startTime:2014, key:location, id:16]
==>[value:spremberg, startTime:1982, endTime:2005, key:location, id:17]
==>[value:kaiserslautern, startTime:2005, endTime:2009, key:location, id:18]
==>[value:aachen, startTime:2009, key:location, id:19]

Vertex Steps

vertex steps

The vertex steps (flatMap) are fundamental to the Gremlin language. Via these steps, its possible to "move" on the graph — i.e. traverse.

  • out(string...): Move to the outgoing adjacent vertices given the edge labels.

  • in(string...): Move to the incoming adjacent vertices given the edge labels.

  • both(string...): Move to both the incoming and outgoing adjacent vertices given the edge labels.

  • outE(string...): Move to the outgoing incident edges given the edge labels.

  • inE(string...): Move to the incoming incident edges given the edge labels.

  • bothE(string...): Move to both the incoming and outgoing incident edges given the edge labels.

  • outV(): Move to the outgoing vertex.

  • inV(): Move to the incoming vertex.

  • bothV(): Move to both vertices.

  • otherV() : Move to the vertex that was not the vertex that was moved from.

gremlin> g.V(4)
==>v[4]
gremlin> g.V(4).outE() //(1)
==>e[10][4-created->5]
==>e[11][4-created->3]
gremlin> g.V(4).inE('knows') //(2)
==>e[8][1-knows->4]
gremlin> g.V(4).inE('created') //(3)
gremlin> g.V(4).bothE('knows','created','blah')
==>e[10][4-created->5]
==>e[11][4-created->3]
==>e[8][1-knows->4]
gremlin> g.V(4).bothE('knows','created','blah').otherV()
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(4).both('knows','created','blah')
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(4).outE().inV() //(4)
==>v[5]
==>v[3]
gremlin> g.V(4).out() //(5)
==>v[5]
==>v[3]
gremlin> g.V(4).inE().outV()
==>v[1]
gremlin> g.V(4).inE().bothV()
==>v[1]
==>v[4]
  1. All outgoing edges.

  2. All incoming knows-edges.

  3. All incoming created-edges.

  4. Moving forward touching edges and vertices.

  5. Moving forward only touching vertices.

Where Step

The where()-step filters the current object based on either the object itself (Scope.local) or the path history of the object (Scope.global) (filter). This step is typically used in conjuction with either match()-step or select()-step, but can be used in isolation.

gremlin> g.V(1).as('a').out('created').in('created').where(neq('a')) //(1)
==>v[4]
==>v[6]
gremlin> g.withSideEffect('a',['josh','peter']).V(1).out('created').in('created').values('name').where(within('a')) //(2)
==>josh
==>peter
gremlin> g.V(1).out('created').in('created').where(out('created').count().is(gt(1))).values('name') //(3)
==>josh
  1. Who are marko’s collaborators, where marko can not be his own collaborator? (predicate)

  2. Of the co-creators of marko, only keep those whose name is josh or peter. (using a sideEffect)

  3. Which of marko’s collaborators have worked on more than 1 project? (using a traversal)

Important
Please see match().where() and select().where() for how where() can be used in conjunction with Map<String,Object> projecting steps — i.e. Scope.local.

A few more examples of filtering an arbitrary object based on a anonymous traversal is provided below.

gremlin> g.V().where(out('created')).values('name') //(1)
==>marko
==>josh
==>peter
gremlin> g.V().out('knows').where(out('created')).values('name') //(2)
==>josh
gremlin> g.V().where(out('created').count().is(gte(2))).values('name') //(3)
==>josh
gremlin> g.V().where(out('knows').where(out('created'))).values('name') //(4)
==>marko
gremlin> g.V().where(__.not(out('created'))).where(__.in('knows')).values('name') //(5)
==>vadas
gremlin> g.V().where(__.not(out('created')).and().in('knows')).values('name') //(6)
==>vadas
  1. What are the names of the people who have created a project?

  2. What are the names of the people that are known by someone one and have created a project?

  3. What are the names of the people how have created two or more projects?

  4. What are the names of the people who know someone that has created a project? (This only works in OLTP — see the WARNING below)

  5. What are the names of the people who have not created anything, but are known by someone?

  6. The concatenation of where()-steps is the same as a single where()-step with an and’d clause.

Warning
The anonymous traversal of where() processes the current object "locally". In OLAP, where the atomic unit of computing is the the vertex and its local "star graph," it is important that the anonymous traversal does not leave the confines of the vertex’s star graph. In other words, it can not traverse to an adjacent vertex’s properties or edges.

A Note on Predicates

A P is a predicate of the form Function<Object,Boolean>. That is, given some object, return true or false. The provided predicates are outlined in the table below and are used in various steps such as has()-step, where()-step, is()-step, etc.

Predicate Description

eq(object)

Is the incoming object equal to the provided object?

neq(object)

Is the incoming object not equal to the provided object?

lt(number)

Is the incoming number less than the provided number?

lte(number)

Is the incoming number less than or equal to the provided number?

gt(number)

Is the incoming number greater than the provided number?

gte(number)

Is the incoming number greater than or equal to the provided number?

inside(number,number)

Is the incoming number greater than the first provided number and less than the second?

outside(number,number)

Is the incoming number less than the first provided number and greater than the second?

between(number,number)

Is the incoming number greater than or equal to the first provided number and less than the second?

within(objects...)

Is the incoming object in the array of provided objects?

without(objects...)

Is the incoming object not in the array of the provided objects?

gremlin> eq(2)
==>eq(2)
gremlin> not(neq(2)) //(1)
==>eq(2)
gremlin> not(within('a','b','c'))
==>without([a, b, c])
gremlin> not(within('a','b','c')).test('d') //(2)
==>true
gremlin> not(within('a','b','c')).test('a')
==>false
gremlin> within(1,2,3).and(not(eq(2))).test(3) //(3)
==>true
gremlin> inside(1,4).or(eq(5)).test(3) //(4)
==>true
gremlin> inside(1,4).or(eq(5)).test(5)
==>true
gremlin> between(1,2) //(5)
==>and([gte(1), lt(2)])
gremlin> not(between(1,2))
==>or([lt(1), gte(2)])
  1. The not() of a P-predicate is another P-predicate.

  2. P-predicates are arguments to various steps which internally test() the incoming value.

  3. P-predicates can be and’d together.

  4. P-predicates can be or' together.

  5. and() is a P-predicate and thus, a P-predicate can be composed of multiple P-predicates.

Finally, note that where()-step takes a P<String>. The provided string value refers to a variable binding, not to the explicit string value.

gremlin> g.V().as('a').both().both().as('b').count()
==>30
gremlin> g.V().as('a').both().both().as('b').where('a',neq('b')).count()
==>18
Note
It is possible for vendors and users to extend P and provide new predicates. For instance, a regex(pattern) could be a vendor-specific P.

A Note on Barrier Steps

barrier Gremlin is primarily a lazy, stream processing language. This means that Gremlin fully processes (to the best of its abilities) any traversers currently in the traversal pipeline before getting more data from the start/head of the traversal. However, there are numerous situations in which a completely lazy computation is not possible (or impractical). When a computation is not lazy, a "barrier step" exists. There are three types of barriers:

  1. CollectingBarrierStep: All of the traversers prior to the step are put into a collection and then processed in some way (e.g. ordered) prior to the collection being "drained" one-by-one to the next step. Examples include: order(), sample(), aggregate(), barrier().

  2. ReducingBarrierStep: All of the traversers prior to the step are processed by a reduce function and once all the previous traversers are processed, a single "reduced value" traverser is emitted to the next step. Examples include: fold(), count(), sum(), max(), min().

  3. SupplyingBarrierStep: All of the traversers prior to the step are iterated (no processing) and then some provided supplier yields a single traverser to continue to the next step. Examples include: cap().

In Gremlin OLAP (see TraversalVertexProgram), a barrier is introduced at the end of every adjacent vertex step. This means that the traversal does its best to compute as much as possible at the current, local vertex. What is can’t compute without referencing an adjacent vertex is aggregated into a barrier collection. When there are no more traversers at the local vertex, the barriered traversers are the messages that are propagated to remote vertices for further processing.

A Note On Lambdas

lambda A lambda is a function that can be referenced by software and thus, passed around like any other piece of data. In Gremlin, lambdas make it possible to generalize the behavior of a step such that custom steps can be created (on-the-fly) by the user. However, it is advised to avoid using lambdas if possible.

gremlin> g.V().filter{it.get().value('name') == 'marko'}.
               flatMap{it.get().vertices(OUT,'created')}.
               map {it.get().value('name')} //(1)
==>lop
gremlin> g.V().has('name','marko').out('created').values('name') //(2)
==>lop
  1. A lambda-rich Gremlin traversal which should and can be avoided. (bad)

  2. The same traversal (result), but without using lambdas. (good)

Gremlin attempts to provide the user a comprehensive collection of steps in the hopes that the user will never need to leverage a lambda in practice. It is advised that users only leverage a lambda if and only if there is no corresponding lambda-less step that encompasses the desired functionality. The reason being, lambdas can not be optimized by Gremlin’s compiler strategies as they can not be programmatically inspected (see traversal strategies).

In many situations where a lambda could be used, either a corresponding step exists or a traversal can be provided in its place. A TraversalLambda behaves like a typical lambda, but it can be optimized and it yields less objects than the corresponding pure-lambda form.

gremlin> g.V().out().out().path().by {it.value('name')}.
                                  by {it.value('name')}.
                                  by {g.V(it).in('created').values('name').fold().next()} //(1)
==>[marko, josh, [josh]]
==>[marko, josh, [marko, josh, peter]]
gremlin> g.V().out().out().path().by('name').
                                  by('name').
                                  by(__.in('created').values('name').fold()) //(2)
==>[marko, josh, [josh]]
==>[marko, josh, [marko, josh, peter]]
  1. The length-3 paths have each of their objects transformed by a lambda. (bad)

  2. The length-3 paths have their objects transformed by a lambda-less step and a traversal lambda. (good)

TraversalStrategy

traversal-strategy A TraversalStrategy can analyze a Traversal and mutate the traversal as it deems fit. This is useful in multiple situations:

  • There is an application-level feature that can be embedded into the traversal logic (decoration).

  • There is a more efficient way to express the traversal at the TinkerPop3 level (optimization).

  • There is a more efficient way to express the traversal at the graph vendor level (vendor optimization).

  • There are are some final adjustments required before executing the traversal (finalization).

  • There are certain traversals that are not legal for the application or traversal engine (verification).

A simple OptimizationStrategy is the IdentityRemovalStrategy.

public final class IdentityRemovalStrategy extends AbstractTraversalStrategy<TraversalStrategy.OptimizationStrategy> implements TraversalStrategy.OptimizationStrategy {

    private static final IdentityRemovalStrategy INSTANCE = new IdentityRemovalStrategy();

    private IdentityRemovalStrategy() {
    }

    @Override
    public void apply(final Traversal.Admin<?, ?> traversal) {
        if (!TraversalHelper.hasStepOfClass(IdentityStep.class, traversal))
            return;

        TraversalHelper.getStepsOfClass(IdentityStep.class, traversal).stream().forEach(identityStep -> {
            final Step<?, ?> previousStep = identityStep.getPreviousStep();
            if (!(previousStep instanceof EmptyStep) || identityStep.getLabels().isEmpty()) {
                ((IdentityStep<?>) identityStep).getLabels().forEach(previousStep::addLabel);
                traversal.removeStep(identityStep);
            }
        });
    }

    public static IdentityRemovalStrategy instance() {
        return INSTANCE;
    }
}

This strategy simply removes any IdentityStep steps in the Traversal as aStep().identity().identity().bStep() is equivalent to aStep().bStep(). For those traversal strategies that require other strategies to execute prior or post to the strategy, then the following two methods can be defined in TraversalStrategy (with defaults being an empty set). If the TraversalStrategy is in a particular traversal category (i.e. decoration, optimization, vendor-optimization, finalization, or verification), then priors and posts are only possible within the category.

public Set<Class<? extends S>> applyPrior();
public Set<Class<? extends S>> applyPost();
Important
TraversalStrategy categories are sorted within their category and the categories are then executed in the following order: decoration, optimization, finalization, and verification. If a designed strategy does not fit cleanly into these categories, then it can implement TraversalStrategy and its prior and posts can reference strategies within any category.

An example of a VendorOptimizationStrategy is provided below.

g.V().has('name','marko')

The expression above can be executed in a O(|V|) or O(log(|V|) fashion in TinkerGraph depending on whether there is or is not an index defined for "name."

public final class TinkerGraphStepStrategy extends AbstractTraversalStrategy<TraversalStrategy.VendorOptimizationStrategy> implements TraversalStrategy.VendorOptimizationStrategy {

    private static final TinkerGraphStepStrategy INSTANCE = new TinkerGraphStepStrategy();

    private TinkerGraphStepStrategy() {
    }

    @Override
    public void apply(final Traversal.Admin<?, ?> traversal) {
        if (traversal.getEngine().isComputer())
            return;

        final Step<?, ?> startStep = traversal.getStartStep();
        if (startStep instanceof GraphStep) {
            final GraphStep<?> originalGraphStep = (GraphStep) startStep;
            final TinkerGraphStep<?> tinkerGraphStep = new TinkerGraphStep<>(originalGraphStep);
            TraversalHelper.replaceStep(startStep, (Step) tinkerGraphStep, traversal);

            Step<?, ?> currentStep = tinkerGraphStep.getNextStep();
            while (true) {
                if (currentStep instanceof HasContainerHolder) {
                    tinkerGraphStep.hasContainers.addAll(((HasContainerHolder) currentStep).getHasContainers());
                    currentStep.getLabels().forEach(tinkerGraphStep::addLabel);
                    traversal.removeStep(currentStep);
                } else {
                    break;
                }
                currentStep = currentStep.getNextStep();
            }
        }
    }

    public static TinkerGraphStepStrategy instance() {
        return INSTANCE;
    }
}

The traversal is redefined by simply taking a chain of has()-steps after g.V() (TinkerGraphStep) and providing them to TinkerGraphStep. Then its up to TinkerGraphStep to determine if an appropriate index exists. In the code below, review the vertices() method and note how if an index exists, for a particular HasContainer, then that index is first queried before the remaining HasContainer filters are serially applied. Given that the strategy uses non-TinkerPop3 provided steps, it should go into the VendorOptimizationStrategy category to ensure the added step does not corrupt the OptimizationStrategy strategies.

gremlin> t = g.V().has('name','marko'); null
==>null
gremlin> t.toString()
==>[GraphStep([],vertex), HasStep([name.eq(marko)])]
gremlin> t.iterate(); null
==>null
gremlin> t.toString()
==>[TinkerGraphStep(vertex,[name.eq(marko)])]
Caution
The reason that OptimizationStrategy and VendorOptimizationStrategy are two different categories is that optimization strategies should only rewrite the traversal using TinkerPop3 steps. This ensures that the optimizations executed at the end of the optimization strategy round are TinkerPop3 compliant. From there, vendor optimizations can analyze the traversal and rewrite the traversal as desired using vendor specific steps (e.g. replacing GraphStep.HasStep...HasStep with TinkerGraphStep). If vendor’s optimizations use vendor-specific steps and implement OptimizationStrategy, then other TinkerPop3 optimizations may fail to optimize the traversal or mis-understand the vendor-specific step behaviors (e.g. VendorVertexStep extends VertexStep) and yield incorrect semantics.

A collection of useful DecorationStrategy strategies are provided with TinkerPop3 and are generally useful to end-users. The following sub-sections detail these strategies:

ElementIdStrategy

ElementIdStrategy provides control over element identifiers. Some Graph implementations, such as TinkerGraph, allow specification of custom identifiers when creating elements:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> v = g.addV(id,'42a')
==>v[42a]
gremlin> g.V('42a')
==>v[42a]

Other Graph implementations, such as Neo4j, generate element identifiers automatically and cannot be assigned. As a helper, ElementIdStrategy can be used to make identifier assignment possible by using vertex and edge indicies under the hood.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> strategy = ElementIdStrategy.build().create()
==>ElementIdStrategy
gremlin> g = GraphTraversalSource.build().with(strategy).create(graph)
==>graphtraversalsource[neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]], standard]
gremlin> g.addV(id, '42a').id()
==>42a
Important
The key that is used to store the assigned identifier should be indexed in the underlying graph database. If it is not indexed, then lookups for the elements that use these identifiers will perform a linear scan.

EventStrategy

The purpose of the EventStrategy is to raise events to one or more MutationListener objects as changes to the underlying Graph occur within a Traversal. Such a strategy is useful for logging changes, triggering certain actions based on change, or any application that needs notification of some mutating operation during a Traversal. If the transaction is rolled back, the event queue is reset.

The following events are raised to the MutationListener:

  • New vertex

  • New edge

  • Vertex property changed

  • Edge property changed

  • Vertex property removed

  • Edge property removed

  • Vertex removed

  • Edge removed

To start processing events from a Traversal first implement the MutationListener interface. An example of this implementation is the ConsoleMutationListener which writes output to the console for each event. The following console session displays the basic usage:

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> l = new ConsoleMutationListener(graph)
==>MutationListener[tinkergraph[vertices:6 edges:6]]
gremlin> strategy = EventStrategy.build().addListener(l).create()
==>EventStrategy
gremlin> g = GraphTraversalSource.build().with(strategy).create(graph)
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.addV('name','stephen')
Vertex [v[12]] added to graph [tinkergraph[vertices:7 edges:6]]
==>v[12]
gremlin> g.E().drop()
Edge [e[7][1-knows->2]] removed from graph [tinkergraph[vertices:7 edges:6]]
Edge [e[8][1-knows->4]] removed from graph [tinkergraph[vertices:7 edges:5]]
Edge [e[9][1-created->3]] removed from graph [tinkergraph[vertices:7 edges:4]]
Edge [e[10][4-created->5]] removed from graph [tinkergraph[vertices:7 edges:3]]
Edge [e[11][4-created->3]] removed from graph [tinkergraph[vertices:7 edges:2]]
Edge [e[12][6-created->3]] removed from graph [tinkergraph[vertices:7 edges:1]]

By default, the EventStrategy is configured with an EventQueue that raises events as they occur within execution of a Step. As such, the final line of Gremlin execution that drops all edges shows a bit of an inconsistent count, where the removed edge count is accounted for after the event is raised. The strategy can also be configured with a TransactionalEventQueue that captures the changes within a transaction and does not allow them to fire until the transaction is committed.

Caution
EventStrategy is not meant for usage in tracking global mutations across separate processes. In other words, a mutation in one JVM process is not raised as an event in a different JVM process. In addition, events are not raised when mutations occur outside of the Traversal context.

PartitionStrategy

partition graph

PartitionStrategy partitions the vertices and edges of a graph into String named partitions (i.e. buckets, subgraphs, etc.). The idea behind PartitionStrategy is presented in the image above where each element is in a single partition (represented by its color). Partitions can be read from, written to, and linked/joined by edges that span one or two partitions (e.g. a tail vertex in one partition and a head vertex in another).

There are three primary variables in PartitionStrategy:

  1. Partition Key - The property key that denotes a String value representing a partition.

  2. Write Partition - A String denoting what partition all future written elements will be in.

  3. Read Partitions - A Set<String> of partitions that can be read from.

The best way to understand PartitionStrategy is via example.

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> strategyA = PartitionStrategy.build().partitionKey("_partition").writePartition("a").addReadPartition("a").create()
==>PartitionStrategy
gremlin> strategyB = PartitionStrategy.build().partitionKey("_partition").writePartition("b").addReadPartition("b").create()
==>PartitionStrategy
gremlin> gA = GraphTraversalSource.build().with(strategyA).create(graph)
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> gA.addV() // this vertex has a property of {_partition:"a"}
==>v[12]
gremlin> gB = GraphTraversalSource.build().with(strategyB).create(graph)
==>graphtraversalsource[tinkergraph[vertices:7 edges:6], standard]
gremlin> gB.addV() // this vertex has a property of {_partition:"b"}
==>v[14]
gremlin> gA.V()
==>v[12]
gremlin> gB.V()
==>v[14]

By writing elements to particular partitions and then restricting read partitions, the developer is able to create multiple graphs within a single address space. Moreover, by supporting references between partitions, it is possible to merge those multiple graphs (i.e. join partitions).

ReadOnlyStrategy

ReadOnlyStrategy is largely self-explanatory. A Traversal that has this strategy applied will throw an IllegalStateException if the Traversal has any mutating steps within it.

SubgraphStrategy

SubgraphStrategy is quite similar to PartitionStrategy in that it restrains a Traversal to certain vertices and edges as determined by a Traversal criterion defined individually for each.

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> strategy = SubgraphStrategy.build().edgeCriterion(hasId(8,9,10)).create()
==>SubgraphStrategy
gremlin> g = GraphTraversalSource.build().with(strategy).create(graph)
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V() // shows all vertices as no filter for vertices was specified
==>v[1]
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[6]
gremlin> g.E() // shows only the edges defined in the edgeCriterion
==>e[8][1-knows->4]
==>e[9][1-created->3]
==>e[10][4-created->5]

This strategy is implemented such that the vertices attached to an Edge must both satisfy the vertexCriterion (if present) in order for the Edge to be considered a part of the subgraph.

The GraphComputer

graphcomputer-puffers TinkerPop3 provides two primary means of interacting with a graph: online transaction processing (OLTP) and online analytical processing (OLAP). OTLP-based graph systems allow the user to query the graph in real-time. However, typically, real-time performance is only possible when a local traversal is enacted. A local traversal is one that starts at a particular vertex (or small set of vertices) and touches a small set of connected vertices (by any arbitrary path of arbitrary length). In short, OLTP queries interact with a limited set of data and respond on the order of milliseconds or seconds. On the other hand, with OLAP graph processing, the entire graph is processed and thus, every vertex and edge is analyzed (some times more than once for iterative, recursive algorithms). Due to the amount of data being processed, the results are typically not returned in real-time and for massive graphs (i.e. graphs represented across a cluster of machines), results can take on the order of minutes or hours.

oltp vs olap

The image above demonstrates the difference between Gremlin OLTP and Gremlin OLAP. With Gremlin OLTP, the graph is walked by moving from vertex-to-vertex via incident edges. With Gremlin OLAP, all vertices are provided a VertexProgram. The programs send messages to one another with the topological structure of the graph acting as the communication network (though random message passing possible). In many respects, the messages passed are like the OLTP traversers moving from vertex-to-vertex. However, all messages are moving independent of one another, in parallel. Once a vertex program is finished computing, TinkerPop3’s OLAP engine supports any number MapReduce jobs over the resultant graph.

Important
GraphComputer was designed from the start to be used within a multi-JVM, distributed environment — in other words, a multi-machine compute cluster. As such, all the computing objects must be able to be migrated between JVMs. The pattern promoted is to store state information in a Configuration object to later be regenerated by a loading process. It is important to realize that VertexProgram, MapReduce, and numerous particular instances rely heavily on the state of the computing classes (not the structure, but the processes) to be stored in a Configuration.

VertexProgram

bsp-diagram GraphComputer takes a VertexProgram. A VertexProgram can be thought of as a piece of code that is executed at each vertex in logically parallel manner until some termination condition is met (e.g. a number of iterations have occurred, no more data is changing in the graph, etc.). A submitted VertexProgram is copied to all the workers in the graph. A worker is not an explicit concept in the API, but is assumed of all GraphComputer implementations. At minimum each vertex is a worker (though this would be inefficient due to the fact that each vertex would maintain a VertexProgram). In practice, the workers partition the vertex set and and are responsible for the execution of the VertexProgram over all the vertices within their sphere of influence. The workers orchestrate the execution of the VertexProgram.execute() method on all their vertices in an bulk synchronous parallel (BSP) fashion. The vertices are able to communicate with one another via messages. There are two kinds of messages in Gremlin OLAP: MessageScope.Local and MessageScope.Global. A local message is a message to an adjacent vertex. A global message is a message to any arbitrary vertex in the graph. Once the VertexProgram has completed its execution, any number of MapReduce jobs are evaluated. MapReduce jobs are provided by the user via GraphComputer.mapReduce() or by the VertexProgram via VertexProgram.getMapReducers().

graphcomputer

The example below demonstrates how to submit a VertexProgram to a graph’s GraphComputer. GraphComputer.submit() yields a Future<ComputerResult>. The ComputerResult has the resultant computed graph which can be a full copy of the original graph (see Hadoop-Gremlin) or a view over the original graph (see TinkerGraph). The ComputerResult also provides access to computational side-effects called Memory (which includes, for example, runtime, number of iterations, results of MapReduce jobs, and VertexProgram-specific memory manipulations).

gremlin> result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
==>result[tinkergraph[vertices:6 edges:0],memory[size:0]]
gremlin> result.memory().runtime
==>83
gremlin> g = result.graph().traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:0], standard]
gremlin> g.V().valueMap('name',PageRankVertexProgram.PAGE_RANK)
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[marko]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[vadas]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[josh]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], name:[ripple]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[peter]]
Note
This model of "vertex-centric graph computing" was made popular by Google’s Pregel graph engine. In the open source world, this model is found in OLAP graph computing systems such as Giraph, Hama, and Faunus. TinkerPop3 extends the popularized model with integrated post-processing MapReduce jobs over the vertex set.
Important
As of TinkerPop3 3.0.0-SNAPSHOT, message combiners are not supported. The primary reason is that TinkerPop wants to provide a model of message combining that does not require all messages to a particular vertex to be combined. This allows for more complex message passing scenarios to exist, where multi-typed messages are possible. However, at this time, no general solution has been developed.

MapReduce

The BSP model proposed by Pregel stores the results of the computation in a distributed manner as properties on the elements in the graph. In many situations, it is necessary to aggregate those resultant properties into a single result set (i.e. a statistic). For instance, assume a VertexProgram that computes a nominal cluster for each vertex (i.e. a graph clustering algorithm). At the end of the computation, each vertex will have a property denoting the cluster it was assigned to. TinkerPop3 provides the ability to answer global questions about the clusters. For instance, in order to answer the following questions, MapReduce jobs are required:

  • How many vertices are in each cluster? (presented below)

  • How many unique clusters are there? (presented below)

  • What is the average age of each vertex in each cluster?

  • What is the degree distribution of the vertices in each cluster?

A compressed representation of the MapReduce API in TinkerPop3 is provided below. The key idea is that the map-stage processes all vertices to emit key/value pairs. Those values are aggregated on their respective key for the reduce-stage to do its processing to ultimately yield more key/value pairs.

public interface MapReduce<MK, MV, RK, RV, R> {
  public void map(final Vertex vertex, final MapEmitter<MK, MV> emitter);
  public void reduce(final MK key, final Iterator<MV> values, final ReduceEmitter<RK, RV> emitter);
  // there are more methods
}
Important
The vertex that is passed into the MapReduce.map() method does not contain edges. The vertex only contains original and computed vertex properties. This reduces the amount of data required to be loaded and ensures that MapReduce is used for post-processing computed results. All edge-based computing should be accomplished in the VertexProgram.
mapreduce

The MapReduce extension to GraphComputer is made explicit when examining the PeerPressureVertexProgram and corresponding ClusterPopulationMapReduce. In the code below, the GraphComputer result returns the computed on Graph as well as the Memory of the computation (ComputerResult). The memory maintain the results of any MapReduce jobs. The cluster population MapReduce result states that there are 5 vertices in cluster 1 and 1 vertex in cluster 6. This can be verified (in a serial manner) by looking at the PeerPressureVertexProgram.CLUSTER property of the resultant graph. Notice that the property is "hidden" unless it is directly accessed via name.

gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> result = graph.compute().program(PeerPressureVertexProgram.build().create()).mapReduce(ClusterPopulationMapReduce.build().create()).submit().get()
==>result[tinkergraph[vertices:6 edges:0],memory[size:2]]
gremlin> result.memory().get('clusterPopulation')
==>1=5
==>6=1
gremlin> g = result.graph().traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:0], standard]
gremlin> g.V().values(PeerPressureVertexProgram.CLUSTER).groupCount().next()
==>1=5
==>6=1
gremlin> g.V().valueMap()
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[1], name:[marko], age:[29]]
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[1], name:[vadas], age:[27]]
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[1], name:[lop], lang:[java]]
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[1], name:[josh], age:[32]]
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[1], name:[ripple], lang:[java]]
==>[gremlin.peerPressureVertexProgram.voteStrength:[1.0], gremlin.peerPressureVertexProgram.cluster:[6], name:[peter], age:[35]]

If there are numerous statistics desired, then its possible to register as many MapReduce jobs as needed. For instance, the ClusterCountMapReduce determines how many unique clusters were created by the peer pressure algorithm. Below both ClusterCountMapReduce and ClusterPopulationMapReduce are computed over the resultant graph.

gremlin> result = graph.compute().program(PeerPressureVertexProgram.build().create()).
                    mapReduce(ClusterPopulationMapReduce.build().create()).
                    mapReduce(ClusterCountMapReduce.build().create()).submit().get()
==>result[tinkergraph[vertices:6 edges:0],memory[size:3]]
gremlin> result.memory().clusterPopulation
==>1=5
==>6=1
gremlin> result.memory().clusterCount
==>2
Important
The MapReduce model of TinkerPop3 does not support MapReduce chaining. Thus, the order in which the MapReduce jobs are executed is irrelevant. This is made apparent when realizing that the map()-stage takes a Vertex as its input and the reduce()-stage yields key/value pairs. Thus, the results of reduce can not feed back into map.

A Collection of VertexPrograms

TinkerPop3 provides a collection of VertexPrograms that implement common algorithms. This section discusses the various implementations.

Important
The vertex programs presented are what are provided as of TinkerPop 3.0.0-SNAPSHOT. Over time, with future releases, more algorithms will be added.

PageRankVertexProgram

gremlin-pagerank PageRank is perhaps the most popular OLAP-oriented graph algorithm. This eigenvector centrality variant was developed by Brin and Page of Google. PageRank defines a centrality value for all vertices in the graph, where centrality is defined recursively where a vertex is central if it is connected to central vertices. PageRank is an iterative algorithm that converges to a steady state distribution. If the pageRank values are normalized to 1.0, then the pageRank value of a vertex is the probability that a random walker will be seen that that vertex in the graph at any arbitrary moment in time. In order to help developers understand the methods of a VertexProgram, the PageRankVertexProgram code is analyzed below.

public class PageRankVertexProgram implements VertexProgram<Double> { (1)

    private MessageScope.Local<Double> incidentMessageScope = MessageScope.Local.of(__::outE); (2)
    private MessageScope.Local<Double> countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));

    public static final String PAGE_RANK = "gremlin.pageRankVertexProgram.pageRank"; (3)
    public static final String EDGE_COUNT = "gremlin.pageRankVertexProgram.edgeCount";

    private static final String VERTEX_COUNT = "gremlin.pageRankVertexProgram.vertexCount";
    private static final String ALPHA = "gremlin.pageRankVertexProgram.alpha";
    private static final String TOTAL_ITERATIONS = "gremlin.pageRankVertexProgram.totalIterations";
    private static final String INCIDENT_TRAVERSAL_SUPPLIER = "gremlin.pageRankVertexProgram.incidentTraversalSupplier";

     private ConfigurationTraversal<Vertex, Edge> configurationTraversal;
    private double vertexCountAsDouble = 1.0d;
    private double alpha = 0.85d;
    private int totalIterations = 30;

    private static final Set<String> COMPUTE_KEYS = new HashSet<>(Arrays.asList(PAGE_RANK, EDGE_COUNT));

    private PageRankVertexProgram() {}

    @Override
    public void loadState(final Graph graph, final Configuration configuration) { (4)
        if (configuration.containsKey(TRAVERSAL_SUPPLIER)) {
                    this.configurationTraversal = ConfigurationTraversal.loadState(graph, configuration, TRAVERSAL_SUPPLIER);
                    this.incidentMessageScope = MessageScope.Local.of(this.configurationTraversal);
                    this.countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
                }
        this.vertexCountAsDouble = configuration.getDouble(VERTEX_COUNT, 1.0d);
        this.alpha = configuration.getDouble(ALPHA, 0.85d);
        this.totalIterations = configuration.getInt(TOTAL_ITERATIONS, 30);
    }

    @Override
    public void storeState(final Configuration configuration) {
        configuration.setProperty(VERTEX_PROGRAM, PageRankVertexProgram.class.getName());
        configuration.setProperty(VERTEX_COUNT, this.vertexCountAsDouble);
        configuration.setProperty(ALPHA, this.alpha);
        configuration.setProperty(TOTAL_ITERATIONS, this.totalIterations);
        if (null != this.traversalSupplier) {
            this.traversalSupplier.storeState(configuration);
        }
    }

    @Override
    public Set<String> getElementComputeKeys() { (5)
        return COMPUTE_KEYS;
    }

    @Override
    public Optional<MessageCombiner<Double>> getMessageCombiner() {
        return (Optional) PageRankMessageCombiner.instance();
    }

    @Override
    public Set<MessageScope> getMessageScopes(final int iteration) {
        final Set<MessageScope> set = new HashSet<>();
        set.add(0 == iteration ? this.countMessageScope : this.incidentMessageScope);
        return set;
    }

    @Override
    public void setup(final Memory memory) {

    }

   @Override
    public void execute(final Vertex vertex, Messenger<Double> messenger, final Memory memory) { (6)
        if (memory.isInitialIteration()) {  (7)
            messenger.sendMessage(this.countMessageScope, 1.0d);
        } else if (1 == memory.getIteration()) {  (8)
            double initialPageRank = 1.0d / this.vertexCountAsDouble;
            double edgeCount = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b);
            vertex.property(PAGE_RANK, initialPageRank);
            vertex.property(EDGE_COUNT, edgeCount);
            messenger.sendMessage(this.incidentMessageScope, initialPageRank / edgeCount);
        } else { (9)
            double newPageRank = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b);
            newPageRank = (this.alpha * newPageRank) + ((1.0d - this.alpha) / this.vertexCountAsDouble);
            vertex.property(PAGE_RANK, newPageRank);
            messenger.sendMessage(this.incidentMessageScope, newPageRank / vertex.<Double>value(EDGE_COUNT));
        }
    }

    @Override
    public boolean terminate(final Memory memory) { (10)
        return memory.getIteration() >= this.totalIterations;
    }

    @Override
    public String toString() {
        return StringFactory.vertexProgramString(this, "alpha=" + this.alpha + ",iterations=" + this.totalIterations);
    }
}
  1. PageRankVertexProgram implements VertexProgram<Double> because the messages it sends are Java doubles.

  2. The default path of energy propagation is via outgoing edges from the current vertex.

  3. The resulting PageRank values for the vertices are stored as a hidden property.

  4. A vertex program is constructed using an Apache Configuration to ensure easy dissemination across a cluster of JVMs.

  5. A vertex program must define the "compute keys" that are the properties being operated on during the computation.

  6. The "while"-loop of the vertex program.

  7. In order to determine how to distribute the energy to neighbors, a "1"-count is used to determine how many incident vertices exist for the MessageScope.

  8. Initially, each vertex is provided an equal amount of energy represented as a double.

  9. Energy is aggregated, computed on according to the PageRank algorithm, and then disseminated according to the defined MessageScope.Local.

  10. The computation is terminated after a pre-defined number of iterations.

PeerPressureVertexProgram

The PeerPressureVertexProgram is a clustering algorithm that assigns a nominal value to each vertex in the graph. The nominal value represents the vertex’s cluster. If two vertices have the same nominal value, then they are in the same cluster. The algorithm proceeds in the following manner.

  1. Every vertex assigns itself to a unique cluster ID (initially, its vertex ID).

  2. Every vertex determines its per neighbor vote strength as 1.0d / incident edges count.

  3. Every vertex sends its cluster ID and vote strength to its adjacent vertices as a Pair<Serializable,Double>

  4. Every vertex generates a vote energy distribution of received cluster IDs and changes its current cluster ID to the most frequent cluster ID.

    1. If there is a tie, then the cluster with the lowest toString() comparison is selected.

  5. Steps 3 and 4 repeat until either a max number of iterations has occurred or no vertex has adjusted its cluster anymore.

TraversalVertexProgram

traversal-vertex-program The TraversalVertexProgram is a "special" VertexProgram in that it can be executed via GraphTraversal and the submit :> command in Gremlin Console. In Gremlin, it is possible to have the same traversal executed using either the standard OTLP-engine or the GraphComputer OLAP-engine. The difference being where the traversal is submitted.

Note
This model of graph traversal in a BSP system was first implemented by the Faunus graph analytics engine and originally described in Local and Distributed Traversal Engines.
gremlin> g = graph.traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().both().hasLabel('person').values('age').groupCount().next() // OLTP
==>32=3
==>35=1
==>27=1
==>29=3
gremlin> g = graph.traversal(computer())
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], tinkergraphcomputer]
gremlin> g.V().both().hasLabel('person').values('age').groupCount().next() // OLAP
==>32=3
==>35=1
==>27=1
==>29=3
olap traversal

In the OLAP example above, a TraversalVertexProgram is (logically) sent to each vertex in the graph. Each instance evaluation requires (logically) 5 BSP iterations and each iteration is interpreted as such:

  1. g.V(): Put a traverser on each vertex in the graph.

  2. both(): Propagate each traverser to the vertices both-adjacent to its current vertex.

  3. hasLabel('person'): If the vertex is not a person, kill the traversers at that vertex.

  4. values('age'): Have all the traversers reference the integer age of their current vertex.

  5. groupCount(): Count how many times a particular age has been seen.

While 5 iterations were presented, in fact, TraversalVertexProgram will execute the traversal in only 3 iterations. The reason being is that hasLabel('person').values('age').groupCount() can all be executed in a single iteration as any message sent would simply be to the current executing vertex. Thus, a simple optimization exists in Gremlin OLAP called "reflexive message passing" which simulates non-message-passing BSP iterations within a single BSP iteration.

When the computation is complete a MapReduce job executes which aggregates all the groupCount() sideEffect Map (i.e. "HashMap") objects on each vertex into a single local representation (thus, turning the distributed Map representation into a local Map representation).

Distributed Gremlin Gotchas

Gremlin OLTP is not identical to Gremlin OLAP.

Important
There are two primary theoretical differences between Gremlin OLTP and Gremlin OLAP. First, Gremlin OLTP (via Traversal) leverages a depth-first execution engine. Depth-first execution has a limited memory footprint due to lazy evaluation. On the other hand, Gremlin OLAP (via TraversalVertexProgram) leverages a breadth-first execution engine which maintains a larger memory footprint, but a better time complexity due to vertex-local traversers being able to be merged. The second difference is that Gremlin OLTP is executed in a serial fashion, while Gremlin OLAP is executed in a parallel fashion. These two fundamental differences lead to the behaviors enumerated below.
gremlin without a cause
  1. Traversal sideEffects are represented as a distributed data structure across the graph’s vertex set. It is not possible to get a global view of a sideEffect until it is aggregated via a MapReduce job. In some situations, the local vertex representation of the sideEffect is sufficient to ensure the intended semantics of the traversal are respected. However, this is not generally true so be wary of traversals that require global views of a sideEffect.

  2. When evaluating traversals that rely on path information (i.e. the history of the traversal), practical computational limits can easily be reached due the combinatoric explosion of data. With path computing enabled, every traverser is unique and thus, must be enumerated as opposed to being counted/merged. The difference being a collection of paths vs. a single 64-bit long at a single vertex. For more information on this concept, please see Faunus Provides Big Graph Data.

  3. When traversals of the form x.as('a').y.someSideEffectStep('a').z are evaluated, the a object is stored in the path information of the traverser and thus, such traversals (may) turn on path calculations when executed on a GraphComputer.

  4. Steps that are concerned with the global ordering of traversers do not have a meaningful representation in OLAP. For example, what does order()-step mean when all traversers are being processed in parallel? Even if the traversers were aggregated and ordered, then at the next step they would return to being executed in parallel and thus, in an unpredictable order. When order()-like steps are executed at the end of a traversal (i.e the final step), the TraverserMapReduce job ensures the resultant serial representation is ordered accordingly.

  5. Steps that are concerned with providing a global aggregate to the next step of computation do not have a correlate in OLAP. For example, fold()-step can only fold up the objects at each executing vertex. Next, even if a global fold was possible, where would it go? Which vertex would be the host of the data structure? The fold()-step only makes sense as an end-step whereby a MapReduce job can generate the proper global-to-local data reduction.

Gremlin Applications

Gremlin applications represent tools that are built on top of the core APIs to help expose common functionality to users when working with graphs. There are two key applications:

  1. Gremlin Console - A REPL environment for interactive development and analysis

  2. Gremlin Server - A server that hosts script engines thus enabling remote Gremlin execution

gremlin-lab-coat Gremlin is designed to be extensible, making it possible for users and vendors to customize it to their needs. Such extensibility is also found in the Gremlin Console and Server, where a universal plugin system makes it possible to extend their capabilities. One of the important aspects of the plugin system is the ability to help the user install the plugins through the command line thus automating the process of gathering dependencies and other error prone activities.

The process of plugin installation is handled by Grape, which helps resolve dependencies into the classpath. It is therefore important to ensure that Grape is properly configured in order to use the automated capabilities of plugin installation. Grape is configured by ~/.groovy/grapeConfig.xml and generally speaking, if that file is not present, the default settings will suffice. However, they will not suffice if a required dependency is not in one of the default configured repositories. Please see the Custom Ivy Settings section of the Grape documentation for more details on the defaults. TinkerPop recommends the following configuration in that file:

<ivysettings>
  <settings defaultResolver="downloadGrapes"/>
  <resolvers>
    <chain name="downloadGrapes">
      <filesystem name="cachedGrapes">
        <ivy pattern="${user.home}/.groovy/grapes/[organisation]/[module]/ivy-[revision].xml"/>
        <artifact pattern="${user.home}/.groovy/grapes/[organisation]/[module]/[type]s/[artifact]-[revision].[ext]"/>
      </filesystem>
      <ibiblio name="codehaus" root="http://repository.codehaus.org/" m2compatible="true"/>
      <ibiblio name="central" root="http://central.maven.org/maven2/" m2compatible="true"/>
      <ibiblio name="java.net2" root="http://download.java.net/maven/2/" m2compatible="true"/>
      <ibiblio name="hyracs-releases" root="http://obelix.ics.uci.edu/nexus/content/groups/hyracks-public-releases/" m2compatible="true"/>
    </chain>
  </resolvers>
</ivysettings>

Note that if the intention is to work with TinkerPop snapshots then the file should also include:

<ibiblio name="apache-snapshots" root="http://repository.apache.org/snapshots/" m2compatible="true"/>

Additionally, the Graph configuration can also be modified to include the local system’s Maven .m2 directory by including:

<ibiblio name="local" root="file:${user.home}/.m2/repository/" m2compatible="true"/>

This configuration is useful during development (i.e. if one is working with locally built artifacts) of TinkerPop Plugins. Consider adding the "local" reference first in the set of <ibilio> resolvers, as putting it after "apache-snapshots" will likely resolve dependencies from that repository before looking locally. If it does that, then it’s possible that the artifact from the newer local build will not be used.

Caution
If building TinkerPop from source, be sure to clear TinkerPop-related jars from the ~/.groovy/grapes directory as they can become stale on some systems and not re-import properly from the local .m2 after fresh rebuilds.

Gremlin Console

gremlin-console The Gremlin Console is an interactive terminal or REPL that can be used to traverse graphs and interact with the data that they contain. It represents the most common method for performing ad-hoc graph analysis, small to medium sized data loading projects and other exploratory functions. The Gremlin Console is highly extensible, featuring a rich plugin system that allows new tools, commands, DSLs, etc. to be exposed to users.

To start the Gremlin Console, run gremlin.sh or gremlin.bat:

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin loaded: tinkerpop.server
plugin loaded: tinkerpop.utilities
plugin loaded: tinkerpop.tinkergraph
gremlin>
Note
If the above plugins are not loaded then they will need to be enabled or else certain examples will not work. If using the standard Gremlin Console distribution, then the plugins should be enabled by default. See below for more information on the :plugin use command to manually enable plugins. These plugins, with the exception of tinkerpop.tinkergraph, cannot be removed from the Console as they are a part of the gremlin-console.jar itself. These plugins can only be deactivated.

The Gremlin Console is loaded and ready for commands. Recall that the console hosts the Gremlin-Groovy language. Please review Groovy for help on Groovy-related constructs. In short, Groovy is a superset of Java. What works in Java, works in Groovy. However, Groovy provides many shorthands to make it easier to interact with the Java API. Moreoever, Gremlin provides many neat shorthands to make it easier to express paths through a property graph.

gremlin> i = 'goodbye'
==>goodbye
gremlin> j = 'self'
==>self
gremlin> i + " " + j
==>goodbye self
gremlin> "${i} ${j}"
==>goodbye self

The "toy" graph provides a way to get started with Gremlin quickly.

gremlin> g = TinkerFactory.createModern().traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V()
==>v[1]
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[6]
gremlin> g.V().values('name')
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> g.V().has('name','marko').out('knows').values('name')
==>vadas
==>josh
Tip
When using Gremlin-Groovy in a Groovy class file, add static { GremlinLoader.load() } to the head of the file.

Console Commands

In addition to the standard commands of the Groovy Shell, Gremlin adds some other useful operations. The following table outlines the most commonly used commands:

Command Alias Description

:help

:?

Displays list of commands and descriptions. When followed by a command name, it will display more specific help on that particular item.

:exit

:x

Ends the Console session.

import

:i

Import a class into the Console session.

:clear

:c

Sometimes the Console can get into a state where the command buffer no longer understands input (e.g. a misplaced ( or }). Use this command to clear that buffer.

:load

:l

Load a file or URL into the command buffer for execution.

:install

:+

Imports a maven library and its dependencies into the Console.

:uninstall

:-

Removes a maven library and its dependencies. A restart of the console is required for removal to fully take effect.

:plugin

:pin

Plugin management functions to list, activate and deactivate available plugins.

:remote

:rem

Configures a "remote" context where Gremlin or results of Gremlin will be processed via usage of :submit.

:submit

:>

Submit Gremlin to the currently active context defined by :remote.

Gremlin Console adds a special max-iteration preference that can be configured with the standard :set command from the Groovy Shell. Use this setting to control the maximum number of results that the Console will display. Consider the following usage:

gremlin> :set max-iteration 10
gremlin> (0..200)
==>0
==>1
==>2
==>3
==>4
==>5
==>6
==>7
==>8
==>9
...
gremlin> :set max-iteration 5
gremlin> (0..200)
==>0
==>1
==>2
==>3
==>4
...

If this setting is not present, the console will default the maximum to 100 results.

Dependencies and Plugin Usage

The Gremlin Console can dynamically load external code libraries and make them available to the user. Furthermore, those dependencies may contain Gremlin plugins which can expand the language, provide useful functions, etc. These important console features are managed by the :install and :plugin commands.

The following Gremlin Console session demonstrates the basics of these features:

gremlin> :plugin list  (1)
==>tinkerpop.server[active]
==>tinkerpop.gephi
==>tinkerpop.utilities[active]
==>tinkerpop.sugar
==>tinkerpop.tinkergraph[active]
gremlin> :plugin use tinkerpop.sugar  (2)
==>tinkerpop.sugar activated
gremlin> :install org.apache.tinkerpop neo4j-gremlin 3.0.0-SNAPSHOT  (3)
==>loaded: [org.apache.tinkerpop, neo4j-gremlin, 3.0.0-SNAPSHOT]
gremlin> :plugin list (4)
==>tinkerpop.server[active]
==>tinkerpop.gephi
==>tinkerpop.utilities[active]
==>tinkerpop.sugar
==>tinkerpop.tinkergraph[active]
==>tinkerpop.neo4j
gremlin> :plugin use tinkerpop.neo4j (5)
==>tinkerpop.neo4j activated
gremlin> :plugin list (6)
==>tinkerpop.server[active]
==>tinkerpop.gephi
==>tinkerpop.sugar[active]
==>tinkerpop.utilities[active]
==>tinkerpop.neo4j[active]
==>tinkerpop.tinkergraph[active]
  1. Show a list of "available" plugins. The list of "available" plugins is determined by the classes available on the Console classpath. Plugins need to be "active" for their features to be available.

  2. To make a plugin "active" execute the :plugin use command and specify the name of the plugin to enable.

  3. Sometimes there are external dependencies that would be useful within the Console. To bring those in, execute :install and specify the Maven coordinates for the dependency.

  4. Note that there is a "tinkerpop.neo4j" plugin available, but it is not yet "active".

  5. Again, to use the "tinkerpop.neo4j" plugin, it must be made "active" with :plugin use.

  6. Now when the plugin list is displayed, the "tinkerpop.neo4j" plugin is displayed as "active".

Caution
Plugins must be compatible with the version of the Gremlin Console (or Gremlin Server) being used. Attempts to use incompatible versions cannot be guaranteed to work. Moreover, be prepared for dependency conflicts in third-party plugins, that may only be resolved via manual jar removal from the ext/{plugin} directory.
Tip
It is possible to manage plugin activation and deactivation by manually editing the ext/plugins.txt file which contains the class names of the "active" plugins. It is also possible to clear dependencies added by :install by deleting them from the ext directory.

Gremlin Server

gremlin-server Gremlin Server provides a way to remotely execute Gremlin scripts against one or more Graph instances hosted within it. The benefits of using Gremlin Server include:

  • Allows any Gremlin Structure-enabled graph to exist as a standalone server, which in turn enables the ability for multiple clients to communicate with the same graph database.

  • Enables execution of ad-hoc queries through remotely submitted Gremlin scripts.

  • Allows for the hosting of Gremlin-based DSLs (Domain Specific Language) that expand the Gremlin language to match the language of the application domain, which will help support common graph use cases such as searching, ranking, and recommendation.

  • Provides a method for Non-JVM languages (e.g. Python, Javascript, etc.) to communicate with the TinkerPop stack.

  • Exposes numerous methods for extension and customization to include serialization options, remote commands, etc.

Note
Gremlin Server is the replacement for Rexster.

By default, communication with Gremlin Server occurs over WebSockets and exposes a custom sub-protocol for interacting with the server.

Connecting via Console

The most direct way to get started with Gremlin Server is to issue it some remote Gremlin scripts from the Gremlin Console. To do that, first start Gremlin Server:

$ bin/gremlin-server.sh conf/gremlin-server-modern.yaml
[INFO] GremlinServer -
         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----

[INFO] GremlinServer - Configuring Gremlin Server from conf/gremlin-server-modern.yaml
[INFO] MetricManager - Configured Metrics Slf4jReporter configured with interval=180000ms and loggerName=org.apache.tinkerpop.gremlin.server.Settings$Slf4jReporterMetrics
[INFO] Graphs - Graph [graph] was successfully configured via [conf/tinkergraph-empty.properties].
[INFO] ServerGremlinExecutor - Initialized Gremlin thread pool.  Threads in pool named with pattern gremlin-*
[INFO] ScriptEngines - Loaded gremlin-groovy ScriptEngine
[INFO] GremlinExecutor - Initialized gremlin-groovy ScriptEngine with scripts/generate-modern.groovy
[INFO] ServerGremlinExecutor - Initialized GremlinExecutor and configured ScriptEngines.
[INFO] ServerGremlinExecutor - A GraphTraversalSource is now bound to [g] with graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
[INFO] GremlinServer - Executing start up LifeCycleHook
[INFO] Logger$info - Loading 'modern' graph data.
[INFO] AbstractChannelizer - Configured application/vnd.gremlin-v1.0+gryo with org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
[INFO] AbstractChannelizer - Configured application/vnd.gremlin-v1.0+gryo-stringd with org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
[INFO] GremlinServer$1 - Gremlin Server configured with worker thread pool of 1, gremlin pool of 8 and boss thread pool of 1.
[INFO] GremlinServer$1 - Channel started at port 8182.

Gremlin Server is configured by the provided YAML file conf/gremlin-server-modern.yaml. That file tells Gremlin Server many things such as:

  • The host and port to serve on

  • Thread pool sizes

  • Where to report metrics gathered by the server

  • The serializers to make available

  • The Gremlin ScriptEngine instances to expose and external dependencies to inject into them

  • Graph instances to expose

The log messages that printed above show a number of things, but most importantly, there is a Graph instance named graph that is exposed in Gremlin Server. This graph is an in-memory TinkerGraph and was empty at the start of the server. An initialization script at scripts/generate-modern.groovy was executed during startup. It’s contents are as follows:



// Generates the modern graph into an "empty" TinkerGraph via LifeCycleHook
// it is important that the hook be assigned to a variable (in this case "hook").
// the exact name of this variable is unimportant.
hook = [
  onStartUp: { ctx ->
    ctx.logger.info("Loading 'modern' graph data.")
    TinkerFactory.generateModern(graph)
  }
] as LifeCycleHook

// Define the default TraversalSource to bind queries to. Code outside of the "hook"
// will execute for each instantiated ScriptEngine instance. Use this part of the
// script to initialize functions that are meant to be re-usable.
g = graph.traversal()

There are two important aspects to the above script. First, it defines a LifeCycleHook for Gremlin Server. The "hook" provides a way to tie script code into the Gremlin Server startup and shutdown sequences. The LifeCycleHook has two methods that can be implemented onStartUp and onShutDown. These events are called once at Gremlin Server start and once at Gremlin Server stop. Code outside of the "hook" is executed for each ScriptEngine creation (multiple may be created when "sessions" are enabled). In this case, the startup hook loads the "modern" graph into the empty TinkerGraph instance, preparing it for use. Outside of the "hook", the script then creates a TraversalSource variable g from graph. This variable g will be made available on future remote script executions. Note that Graph and TraversalSource objects that are created or modified in the initialization script will become globally bound to the server. It is not possible to bind variables of other types. Any functions that are defined will be cached for future use.

With Gremlin Server running it is now possible to issue some scripts to it for processing. Start Gremlin Console as follows:

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
gremlin>

The console has the notion of a "remote", which represents a place a script will be sent from the console to be evaluated elsewhere in some other context (e.g. Gremlin Server, Hadoop, etc.). To create a remote in the console, do the following:

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Connected - localhost/127.0.0.1:8182

The :remote command shown above displays the current status of the remote connection. This command can also be used to configure a new connection and change other related settings. To actually send a script to the server a different command is required:

gremlin> :> g.V().values('name')
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> :> g.V().has('name','marko').out('created').values('name')
==>lop
gremlin> :> g.E().label().groupCount()
==>{created=4, knows=2}
gremlin> result
==>result{object={created=4, knows=2} class=java.lang.String}
gremlin> :remote close
==>Removed - Gremlin Server - [localhost/127.0.0.1:8182]

The :> command, which is a shorthand for :submit, sends the script to the server to execute there. Results are wrapped in an Result object which is a just a holder for each individual result. The class shows the data type for the containing value. Note that the last script sent was supposed to return a Map, but its class is java.lang.String. By default, the connection is configured to only return text results. In other words, Gremlin Server is using toString to serialize all results back to the console. This enables virtually any object on the server to be returned to the console, but it doesn’t allow the opportunity to work with this data in any way in the console itself. A different configuration of the :remote is required to get the results back as "objects":

gremlin> :remote connect tinkerpop.server conf/remote-objects.yaml //(1)
==>Connected - localhost/127.0.0.1:8182
gremlin> :remote list //(2)
==>*0 - Gremlin Server - [localhost/127.0.0.1:8182]
gremlin> :> g.E().label().groupCount() //(3)
==>[created:4, knows:2]
gremlin> m = result[0].object //(4)
==>created=4
==>knows=2
gremlin> m.sort {it.value}
==>knows=2
==>created=4
gremlin> script = """
                  matthias = graph.addVertex('name','matthias')
                  matthias.addEdge('co-creator',g.V().has('name','marko').next())
                  """
==>
         matthias = graph.addVertex('name','matthias')
         matthias.addEdge('co-creator',g.V().has('name','marko').next())

gremlin> :> @script //(5)
==>e[15][13-co-creator->1]
gremlin> :> g.V().has('name','matthias').out('co-creator').values('name')
==>marko
gremlin> :remote close
==>Removed - Gremlin Server - [localhost/127.0.0.1:8182]
  1. This configuration file specifies that results should be deserialized back into an Object in the console with the caveat being that the server and console both know how to serialize and deserialize the result to be returned.

  2. There are now two configured remote connections. The one marked by an asterisk is the one that was just created and denotes the current one that :sumbit will react to.

  3. When the script is executed again, the class is no longer shown to be a java.lang.String. It is instead a java.util.HashMap.

  4. The last result of a remote script is always stored in the reserved variable result, which allows access to the Result and by virtue of that, the Map itself.

  5. If the submission requires multiple-lines to express, then a multi-line string can be created. The :> command realizes that the user is referencing a variable via @ and submits the string script.

Tip
In Groovy, """ text """ is a convenient way to create a multi-line string and works well in concert with :> @variable. Note that this model of submitting a string variable works for all :> based plugins, not just Gremlin Server.

Connecting via Java

<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>gremlin-driver</artifactId>
   <version>3.0.0-SNAPSHOT</version>
</dependency>

gremlin-java TinkerPop3 comes equipped with a reference client for Java-based applications. It is referred to as Gremlin Driver, which enables applications to send requests to Gremlin Server and get back results.

Gremlin code is sent to the server from a Client instance. A Client is created as follows:

Cluster cluster = Cluster.open();  (1)
Client client = cluster.connect(); (2)
  1. Opens a reference to localhost - note that there are many configuration options available in defining a Cluster object.

  2. Creates a Client given the configuration options of the Cluster.

Once a Client instance is ready, it is possible to issue some Gremlin:

ResultSet results = client.submit("[1,2,3,4]");  (1)
results.stream().map(i -> i.get(Integer.class) * 2);       (2)

CompletableFuture<List<Result>> results = client.submit("[1,2,3,4]").all();  (3)

CompletableFuture<ResultSet> future = client.submitAsync("[1,2,3,4]"); (4)

Map<String,Object> params = new HashMap<>()
params.put("x",4)
client.submit("[1,2,3,x]", params); (5)
  1. Submits a script that simply returns a List of integers. This method blocks until the request is written to the server and a ResultSet is constructed.

  2. Even though the ResultSet is constructed, it does not mean that the server has sent back the results (or even evaluated the script potentially). The ResultSet is just a holder that is awaiting the results from the server. In this case, they are streamed from the server as they arrive.

  3. Submit a script, get a ResultSet, then return a CompletableFuture that will be called when all results have been returned.

  4. Submit a script asynchronously without waiting for the request to be written to the server.

  5. Parameterized request are considered the most efficient way to send Gremlin to the server as they can be cached, which will boost performance and reduce resources required on the server.

Rebinding

Scripts submitted to Gremlin Server automatically have the globally configured Graph and TraversalSource instances made available to them. Therefore, if Gremlin Server configures two TraversalSource instances called "g1" and "g2" a script can simply reference them directly as:

client.submit("g1.V()")
client.submit("g2.V()")

While this is an acceptable way to submit scripts, it has the downside of forcing the client to encode the server-side variable name directly into the script being sent. If the server configuration ever changed such that "g1" became "g100", the client-side code might have to see a significant amount of change. Decoupling the script code from the server configuration can be managed by the rebind method on Client as follows:

Client g1Client = client.rebind("g1")
Client g2Client = client.rebind("g2")
g1Client.submit("g.V()")
g2Client.submit("g.V()")

The above code demonstrates how the rebind method can be used such that the script need only contain a reference to "g" and "g1" and "g2" are automatically rebound into "g" on the server-side.

Serialization

When using Gryo serialization (the default serializer for the driver), it is important that the client and server have the same serializers configured or else one or the other will experience serialization exceptions and fail to always communicate. Discrepancy in serializer registration between client and server can happen fairly easily as graphs will automatically include serializers on the server-side, thus leaving the client to be configured manually. This can be done manually as follows:

GryoMapper kryo = GryoMapper.build().addRegistry(TitanIoRegistry.INSTANCE).create();
MessageSerializer serializer = new GryoMessageSerializerV1d0(kryo);
Cluster cluster = Cluster.build()
                .serializer(serializer)
                .create();
Client client = cluster.connect().init();

The above code demonstrates using the TitanIoRegistry which is an IoRegistry instance. It tells the serializer what classes (from Titan in this case) to auto-register during serialization. Gremlin Server roughly uses this same approach when it configures it’s serializers, so using this same model will ensure compatibility when making requests.

Connecting via REST

gremlin-rexster While the default behavior for Gremlin Server is to provide a WebSockets-based connection, it can also be configured to support REST. The REST endpoint provides for a communication protocol familiar to most developers, with a wide support of programming languages, tools and libraries for accessing it. As a result, REST provides a fast way to get started with Gremlin Server. It also may represent an easier upgrade path from Rexster as the API for the endpoint is very similar to Rexster’s Gremlin Extension.

Gremlin Server provides for a single REST endpoint - a Gremlin evaluator - which allows the submission of a Gremlin script as a request. For each request, it returns a response containing the serialized results of that script. To enable this endpoint, Gremlin Server needs to be configured with the HttpChannelizer, which replaces the default WebSocketChannelizer, in the configuration file:

channelizer: org.apache.tinkerpop.gremlin.server.channel.HttpChannelizer

This setting is already configured in the gremlin-server-rest-modern.yaml file that is packaged with the Gremlin Server distribution. To utilize it, start Gremlin Server as follows:

bin/gremlin-server.sh conf/gremlin-server-rest-modern.yaml

Once the server has started, issue a request. Here’s an example with cURL:

$ curl "http://localhost:8182?gremlin=100-1"

which returns:

{
  "result":{"data":99,"meta":{}},
  "requestId":"0581cdba-b152-45c4-80fa-3d36a6eecf1c",
  "status":{"code":200,"attributes":{},"message":""}
}

The above example showed a GET operation, but the preferred method for this endpoint is POST:

curl -X POST -d "{\"gremlin\":\"100-1\"}" "http://localhost:8182"

which returns:

{
  "result":{"data":99,"meta":{}},
  "requestId":"ef2fe16c-441d-4e13-9ddb-3c7b5dfb10ba",
  "status":{"code":200,"attributes":{},"message":""}
}

It is also preferred that Gremlin scripts be parameterized when possible via bindings:

curl -X POST -d "{\"gremlin\":\"100-x\", \"bindings\":{\"x\":1}}" "http://localhost:8182"

The bindings argument is a Map of variables where the keys become available as variables in the Gremlin script. Note that parameterization of requests is critical to performance, as repeated script compilation can be avoided on each request.

Note
It is possible to pass bindings via GET based requests. Query string arguments prefixed with "bindings." will be treated as parameters, where that prefix will be removed and the value following the period will become the parameter name. In other words, bindings.x will create a parameter named "x" that can be referenced in the submitted Gremlin script. The caveat is that these arguments will always be treated as String values. To ensure that data types are preserved or to pass complex objects such as lists or maps, use POST which will at least support the allowed JSON data types.

Finally, as Gremlin Server can host multiple ScriptEngine instances (e.g. gremlin-groovy, nashorn), it is possible to define the language to utilize to process the request:

curl -X POST -d "{\"gremlin\":\"100-x\", \"language\":\"gremlin-groovy\", \"bindings\":{\"x\":1}}" "http://localhost:8182"

By default this value is set to gremlin-groovy. If using a GET operation, this value can be set as a query string argument with by setting the language key.

Caution
Consider the size of the result of a submitted script being returned from the REST endpoint. A script that iterates thousands of results will serialize each of those in memory into a single JSON result set. It is quite possible that such a script will generate OutOfMemoryError exceptions on the server. Consider the default WebSockets configuration, which supports streaming, if that type of use case is required.

Configuring

As mentioned earlier, Gremlin Server is configured though a YAML file. By default, Gremlin Server will look for a file called config/gremlin-server.yaml to configure itself on startup. To override this default, supply the file to use to bin/gremlin-server.sh as in:

bin/gremlin-server.sh conf/gremlin-server-min.yaml

The gremlin-server.sh file also serves a second purpose. It can be used to "install" dependencies to the Gremlin Server path. For example, to be able to configure and use other Graph implementations, the dependencies must be made available to Gremlin Server. To do this, use the -i switch and supply the Maven coordinates for the dependency to "install". For example, to use Neo4j in Gremlin Server:

bin/gremlin-server.sh -i org.apache.tinkerpop neo4j-gremlin 3.0.0-SNAPSHOT

This command will "grab" the appropriate dependencies and copy them to the ext directory of Gremlin Server, which will then allow them to be "used" the next time the server is started. To uninstall dependencies, simply delete them from the ext directory.

The following table describes the various configuration options that Gremlin Server expects:

Key Description Default

channelizer

The fully qualified classname of the Channelizer implementation to use. A Channelizer is a "channel initializer" which Gremlin Server uses to define the type of processing pipeline to use. By allowing different Channelizer implementations, Gremlin Server can support different communication protocols (e.g. Websockets, Java NIO, etc.).

WebSocketChannelizer

graphs

A Map of Graph configuration files where the key of the Map becomes the name to which the Graph will be bound and the value is the file name of a Graph configuration file.

none

gremlinPool

The number of "Gremlin" threads available to execute actual scripts in a ScriptEngine. This pool represents the workers available to handle blocking operations in Gremlin Server.

8

host

The name of the host to bind the server to.

localhost

maxAccumulationBufferComponents

Maximum number of request components that can be aggregated for a message.

1024

maxChunkSize

The maximum length of the content or each chunk. If the content length exceeds this value, the transfer encoding of the decoded request will be converted to chunked and the content will be split into multiple HttpContent objects. If the transfer encoding of the HTTP request is chunked already, each chunk will be split into smaller chunks if the length of the chunk exceeds this value.

8192

maxContentLength

The maximum length of the aggregated content for a message. Works in concert with maxChunkSize where chunked requests are accumulated back into a single message. A request exceeding this size will return a 413 - Request Entity Too Large status code. A response exceeding this size will raise an internal exception.

65536

maxHeaderSize

The maximum length of all headers.

8192

maxInitialLineLength

The maximum length of the initial line (e.g. "GET / HTTP/1.0") processed in a request, which essentially controls the maximum length of the submitted URI.

4096

metrics.consoleReporter.enabled

Turns on console reporting of metrics.

false

metrics.consoleReporter.interval

Time in milliseconds between reports of metrics to console.

180000

metrics.csvReporter.enabled

Turns on CSV reporting of metrics.

false

metrics.csvReporter.fileName

The file to write metrics to.

none

metrics.csvReporter.interval

Time in milliseconds between reports of metrics to file.

180000

metrics.gangliaReporter.addressingMode

Set to MULTICAST or UNICAST.

none

metrics.gangliaReporter.enabled

Turns on Ganglia reporting of metrics.

false

metrics.gangliaReporter.host

Define the Ganglia host to report Metrics to.

localhost

metrics.gangliaReporter.interval

Time in milliseconds between reports of metrics for Ganglia.

180000

metrics.gangliaReporter.port

Define the Ganglia port to report Metrics to.

8649

metrics.graphiteReporter.enabled

Turns on Graphite reporting of metrics.

false

metrics.graphiteReporter.host

Define the Graphite host to report Metrics to.

localhost

metrics.graphiteReporter.interval

Time in milliseconds between reports of metrics for Graphite.

180000

metrics.graphiteReporter.port

Define the Graphite port to report Metrics to.

2003

metrics.graphiteReporter.prefix

Define a "prefix" to append to metrics keys reported to Graphite.

none

metrics.jmxReporter.enabled

Turns on JMX reporting of metrics.

false

metrics.slf4jReporter.enabled

Turns on SLF4j reporting of metrics.

false

metrics.slf4jReporter.interval

Time in milliseconds between reports of metrics to SLF4j.

180000

plugins

A list of plugins that should be activated on server startup in the available script engines. It assumes that the plugins are in Gremlin Server’s classpath.

none

port

The port to bind the server to.

8182

processors

A List of Map settings, where each Map represents a OpProcessor implementation to use along with its configuration.

none

processors[X].className

The full class name of the OpProcessor implementation.

none

processors[X].config

A Map containing OpProcessor specific configurations.

none

resultIterationBatchSize

Defines the size in which the result of a request is "batched" back to the client. In other words, if set to 1, then a result that had ten items in it would get each result sent back individually. If set to 2 the same ten results would come back in five batches of two each.

64

scriptEngines

A Map of ScriptEngine implementations to expose through Gremlin Server, where the key is the name given by the ScriptEngine implementation. The key must match the name exactly for the ScriptEngine to be constructed. The value paired with this key is itself a Map of configuration for that ScriptEngine.

none

scriptEngines.<name>.imports

A comma separated list of classes/packages to make available to the ScriptEngine.

none

scriptEngines.<name>.staticImports

A comma separated list of "static" imports to make available to the ScriptEngine.

none

scriptEngines.<name>.scripts

A comma separated list of script files to execute on ScriptEngine initialization. Graph and TraversalSource instance references produced from scripts will be stored globally in Gremlin Server, therefore it is possible to use initialization scripts to add Traversal Strategies or create entirely new Graph instances all together. Instantiating a LifeCycleHook in a script provides a way to execute scripts when Gremlin Server starts and stops.

none

scriptEngines.<name>.config

A Map of configuration settings for the ScriptEngine. These settings are dependent on the ScriptEngine implementation being used.

none

scriptEvaluationTimeout

The amount of time in milliseconds before a script evaluation times out. The notion of "script evaluation" refers to the time it takes for the ScriptEngine to do its work and not any additional time it takes for the result of the evaluation to be iterated and serialized.

30000

serializers

A List of Map settings, where each Map represents a MessageSerializer implementation to use along with its configuration.

none

serializers[X].className

The full class name of the MessageSerializer implementation.

none

serializers[X].config

A Map containing MessageSerializer specific configurations.

none

serializedResponseTimeout

The amount of time in milliseconds before a response serialization times out. The notion of "response serialization" refers to the time it takes for Gremlin Server to iterate an entire result after the script is evaluated in the ScriptEngine.

30000

ssl.enabled

Determines if SSL is turned on or not.

false

ssl.keyCertChainFile

The X.509 certificate chain file in PEM format. If this value is not present and ssl.enabled is true a self-signed certificate will be used (not suitable for production).

none

ssl.keyFile

The PKCS#8 private key file in PEM format. If this value is not present and ssl.enabled is true a self-signed certificate will be used (not suitable for production).

none

ssl.keyPassword

The password of the keyFile if it’s not password-protected

none

ssl.trustCertChainFile

Trusted certificates for verifying the remote endpoint’s certificate. The file should contain an X.509 certificate chain in PEM format. A system default will be used if this setting is not present.

none

threadPoolBoss

The number of threads available to Gremlin Server for accepting connections. Should always be set to 1.

1

threadPoolWorker

The number of threads available to Gremlin Server for processing non-blocking reads and writes.

1

writeBufferHighWaterMark

If the number of bytes in the network send buffer exceeds this value then the channel is no longer writeable, accepting no additional writes until buffer is drained and the writeBufferLowWaterMark is met.

65536

writeBufferLowWaterMark

Once the number of bytes queued in the network send buffer exceeds the writeBufferHighWaterMark, the channel will not become writeable again until the buffer is drained and it drops below this value.

65536

Note
Configuration of Ganglia requires an additional library that is not packaged with Gremlin Server due to its LGPL licensing that conflicts with the TinkerPop’s Apache 2.0 License. To run Gremlin Server with Ganglia monitoring, download the org.acplt:oncrpc jar from here and copy it to the Gremlin Server /lib directory before starting the server.

Serialization

Gremlin Server can accept requests and return results using different serialization formats. The format of the serialization is configured by the serializers setting described in the table above. Note that some serializers have additional configuration options as defined by the serializers[X].config setting. The config setting is a Map where the keys and values get passed to the serializer at its initialization. The available and/or expected keys are dependent on the serializer being used. Gremlin Server comes packaged with two different serializers: GraphSON and Gryo.

GraphSON

The GraphSON serializer produces human readable output in JSON format and is a good configuration choice for those trying to use TinkerPop from non-JVM languages. JSON obviously has wide support across virtually all major programming languages and can be consumed by a wide variety of tools.

  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0 }

The above configuration represents the default serialization under the application/json MIME type and produces JSON consistent with standard JSON data types. It has the following configuration option:

Key Description Default

useMapperFromGraph

Specifies the name of the Graph (from the graphs Map in the configuration file) from which to plugin any custom serializers that are tied to it.

none

  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0 }

When the standard JSON data types are not enough (e.g. need to identify the difference between double and float data types), the above configuration will embed types into the JSON itself. The type embedding uses standard Java type names, so interpretation from non-JVM languages will be required. It has the MIME type of application/vnd.gremlin-v1.0+json and the following configuration options:

Key Description Default

useMapperFromGraph

Specifies the name of the Graph (from the graphs Map in the configuration file) from which to plugin any custom serializers that are tied to it.

none

Gryo

The Gryo serializer utilizes Kryo-based serialization which produces a binary output. This format is best consumed by JVM-based languages.

  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerGremlinV1d0 }

It has the MIME type of application/vnd.gremlin-v1.0+gryo and the following configuration options:

Key Description Default

serializeResultToString

When set to true, results are serialized by first calling toString() on each object in the result list resulting in an extended MIME Type of application/vnd.gremlin-v1.0+gryo-stringd. When set to false Kryo-based serialization is applied.

false

useMapperFromGraph

Specifies the name of the Graph (from the graphs Map in the configuration file) from which to plugin any custom serializers that are tied to it.

none

custom

A list of classes with custom kryo Serializer implementations related to them in the form of <class>;<serializer-class>.

none

Best Practices

The following sections define best practices for working with Gremlin Server.

Tuning

gremlin-handdrawn Tuning Gremlin Server for a particular environment may require some simple trial-and-error, but the following represent some basic guidelines that might be useful:

  • When configuring the size of threadPoolWorker start with the default of 1 and increment by one as needed to a maximum of 2*number of cores.

  • The "right" size of the gremlinPool setting is somewhat dependent on the type of scripts that will be processed by Gremlin Server. As requests arrive to Gremlin Server they are decoded and queued to be processed by threads in this pool. When this pool is exhausted of threads, Gremlin Server will continue to accept incoming requests, but the queue will continue to grow. If left to grow too large, the server will begin to slow. When tuning around this setting, consider whether the bulk of the scripts being processed will be "fast" or "slow", where "fast" generally means being measured in the low hundreds of milliseconds and "slow" means anything longer than that.

    • If the bulk of the scripts being processed are expected to be "fast", then a good starting point for this setting is 2*threadPoolWorker.

    • If the bulk of the scripts being processed are expected to be "slow", then a good starting point for this setting is 4*threadPoolWorker.

  • Scripts that are "slow" can really hurt Gremlin Server if they are not properly accounted for. ScriptEngine evaluations are blocking operations that aren’t easily interrupted, so once a "slow" script is being evaluated in the context of a ScriptEngine it must finish its work. Lots of "slow" scripts will eventually consume the gremlinPool preventing other scripts from getting processed from the queue.

    • To limit the impact of this problem consider properly setting the scriptEvaluationTimeout and the serializedResponseTimeout to something "sane".

    • Test the traversals being sent to Gremlin Server and determine the maximum time they take to evaluate and iterate over results, then set these configurations accordingly.

    • Note that scriptEvaluationTimeout does not interrupt the evaluation on timeout. It merely allows Gremlin Server to "ignore" the result of that evaluation, which means the thread in the gremlinPool will still be consumed after the timeout.

    • The serializedResponseTimeout will kill the result iteration process and prevent additional processing. In most situations, the iteration and serialization process is the more costly step in this process as an errant script that retuns a million or more results could send Gremlin Server into a long streaming cycle. Script evaluation on the other hand is usually very fast, occurring on the order of milliseconds, but that is entirely dependent on the contents of the script itself.

Parameterized Scripts

gremlin-parameterized Use script parameterization. Period. Gremlin Server caches all scripts that are passed to it. The cache is keyed based on the a hash of the script. Therefore g.V(1) and g.V(2) will be recognized as two separate scripts in the cache. If that script is parameterized to g.V(x) where x is passed as a parameter from the client, there will be no additional compilation cost for future requests on that script. Compilation of a script should be considered "expensive" and avoided when possible.

Cache Management

If Gremlin Server processes a large number of unique scripts, the cache will grow beyond the memory available to Gremlin Server and an OutOfMemoryException will loom. Script parameterization goes a long way to solving this problem and running out of memory should not be an issue for those cases. If it is a problem or if there is no script parameterization due to a given use case (perhaps using with use of sessions), it is possible to better control the nature of the script cache from the client side, by issuing scripts with a parameter to help define how the garbage collector should treat the references.

The parameter is called #jsr223.groovy.engine.keep.globals and has four options:

  • hard - available in the cache for the life of the JVM (default when not specified).

  • soft - retained until memory is "low" and should be reclaimed before an OutOfMemoryException is thrown.

  • weak - garbage collected even when memory is abundant.

  • phantom - removed immediately after being evaluated by the ScriptEngine.

By specifying an option other than hard, an OutOfMemoryException in Gremlin Server should be avoided.

Considering Sessions

The preferred approach for issuing requests to Gremlin Server is to do so in a sessionless manner. The concept of "sessionless" refers to a request that is completely encapsulated within a single transaction, such that the script in the request starts with a new transaction and ends with closed transaction. Sessionless requests have automatic transaction management handled by Gremlin Server, thus automatically opening and closing transactions as previously described. The downside to the sessionless approach is that the entire script to be executed must be known at the time of submission so that it can all be executed at once. This requirement makes it difficult for some use cases where more control over the transaction is desired.

For such use cases, Gremlin Server supports sessions. With sessions, the user is in complete control of the start and end of the transaction. This feature comes with some additional expense to consider:

  • Initialization scripts will be executed for each session created so any expense related to them will be established each time a session is constructed.

  • There will be one script cache per session, which obviously increases memory requirements. The cache is not shared, so as to ensure that a session has isolation from other session environments. As a result, if the same script is executed in each session the same compilation cost will be paid for each session it is executed in.

  • Each session will require its own thread pool with a single thread in it - this ensures that transactional boundaries are managed properly from one request to the next.

  • If there are multiple Gremlin Server instances, communication from the client to the server must be bound to the server that the session was initialized in. Gremlin Server does not share session state as the transactional context of a Graph is bound to the thread it was initialized in.

A session is a "heavier" approach to the simple "request/response" approach of sessionless requests, but is sometimes necessary for a given use case.

Developing a Driver

gremlin server protocol

One of the roles for Gremlin Server is to provide a bridge from TinkerPop to non-JVM languages (e.g. Go, Python, etc.). Developers can build language bindings (or driver) that provide a way to submit Gremlin scripts to Gremlin Server and get back results. Given the exstensible nature of Gremlin Server, it is difficult to provide an authoritative guide to developing a driver. It is however possible to describe the core communication protocal using the standard out-of-the-box configuration which should provide enough information to develop a driver for a specific language.

gremlin server flow

Gremlin Server is distributed with a configuration that utilizes WebSockets with a custom sub-protocol. Under this configuration, Gremlin Server accepts requests containing a Gremlin script, evaluates that script and then streams back the results. The notion of "streaming" is depicted in the diagram to the right.

The diagram shows an incoming request to process the Gremlin script of g.V. Gremlin Server evaluates that script, getting an Iterator of vertices as a result, and steps through each Vertex within it. The vertices are batched together given the resultIterationBatchSize configuration. In this case, that value must be 2 given that each "response" contains two vertices. Each response is serialized given the requested serializer type (JSON is likely best for non-JVM languages) and written back to the requesting client immediately. Gremlin Server does not wait for the entire result to be iterated, before sending back a response. It will send the responses as they are realized.

This approach allows for the processing of large result sets without having to serialize the entire result into memory for the response. It places a bit of a burden on the developer of the driver however, because it becomes necessary to provide a way to reconstruct the entire result on the client side from all of the individual responses that Gremlin Server returns for a single request. Again, this description of Gremlin Server’s "flow" is related to the out-of-the-box configuration. It is quite possible to construct other flows, that might be more amenable to a particular language or style of processing.

To formulate a request to Gremlin Server, a RequestMessage needs to be constructed. The RequestMessage is a generalized representation of a request that carries a set of "standard" values in addition to optional ones that are dependent on the operation being performed. A RequestMessage has these fields:

Key Description

requestId

A UUID representing the unique identification for the request.

op

The name of the "operation" to execute based on the available OpProcessor configured in the Gremlin Server. To evaluate a script, use eval.

processor

The name of the OpProcessor to utilize. The default OpProcessor for evaluating scripts is unamed and therefore script evaluation purposes, this value can be an empty string.

args

A Map of arbitrary parameters to pass to Gremlin Server. The requirements for the contents of this Map are dependent on the op selected.

This message can be serialized in any fashion that is supported by Gremlin Server. New serialization methods can be plugged in by implementing a ServiceLoader enabled MessageSerializer, however Gremlin Server provides for JSON serialization by default which will be good enough for purposes of most developers building drivers. A RequestMessage to evaluate a script with variable bindings looks like this in JSON:

{ "requestId":"1d6d02bd-8e56-421d-9438-3bd6d0079ff1",
  "op":"eval",
  "processor":"",
  "args":{"gremlin":"g.traversal().V(x).out()",
          "bindings":{"x":1},
          "language":"gremlin-groovy"}}

The above JSON represents the "body" of the request to send to Gremlin Server. When sending this "body" over websockets Gremlin Server can accept a packet frame using a "text" (1) or a "binary" (2) opcode. Using "text" is a bit more limited in that Gremlin Server will always process the body of that request as JSON. Generally speaking "text" is just for testing purposes.

The preferred method for sending requests to Gremlin Server is to use the "binary" opcode. In this case, a "header" will need be sent in addition to to the "body". The "header" basically consists of a "mime type" so that Gremlin Server knows how to deserialize the RequestMessage. So, the actual byte array sent to Gremlin Server would be formatted as follows:

gremlin server request

The first byte represents the length of the "mime type" string value that follows. Given the default configuration of Gremlin Server, this value should be set to application/json. The "payload" represents the JSON message above encoded as bytes.

Note
Gremlin Server will only accept masked packets as it pertains to websocket packet header construction.

When Gremlin Server receives that request, it will decode it given the "mime type", pass it to the requested OpProcessor which will execute the op defined in the message. In this case, it will evaluate the script g.traversal().V(x).out() using the bindings supplied in the args and stream back the results in a series of ResponseMessages. A ResponseMessage looks like this:

Key Description

requestId

The identifier of the RequestMessage that generated this ResponseMessage.

status

The status contains a Map of three keys: code which refers to a ResultCode that is somewhat analogous to an HTTP status code, attributes that represent a Map of protocol-level information, and message which is just a human-readable String usually associated with errors.

result

The result contains a Map of two keys: data which refers to the actual data returned from the server (the type of data is determined by the operation requested) and meta which is a Map of meta-data related to the response.

In this case the ResponseMessage returned to the client would look something like this:

{"result":{"data":[{"id": 2,"label": "person","type": "vertex","properties": [
  {"id": 2, "value": "vadas", "label": "name"},
  {"id": 3, "value": 27, "label": "age"}]},
  ], "meta":{}},
 "requestId":"1d6d02bd-8e56-421d-9438-3bd6d0079ff1",
 "status":{"code":206,"attributes":{},"message":""}}

Gremlin Server is capable of streaming results such that additional responses will arrive over the websocket until the iteration of the result on the server is complete. Each successful incremental message will have a ResultCode of 206. Termination of the stream will be marked by a final 200 status code. Note that all messages without a 206 represent terminating conditions for a request. The following table details the various status codes that Gremlin Server will send:

Code Name Description

200

SUCCESS

The server successfully processed a request to completion - there are no messages remaining in this stream.

204

NO CONTENT

The server processed the request but there is no result to return (e.g. an Iterator with no elements).

206

PARTIAL CONTENT

The server successfully returned some content, but there is more in the stream to arrive - wait for a SUCCESS to signify the end of the stream.

498

MALFORMED REQUEST

The request message was not properly formatted which means it could not be parsed at all or the "op" code was not recognized such that Gremlin Server could properly route it for processing. Check the message format and retry the request.

499

INVALID REQUEST ARGUMENTS

The request message was parseable, but the arguments supplied in the message were in conflict or incomplete. Check the message format and retry the request.

500

SERVER ERROR

A general server error occurred that prevented the request from being processed.

597

SCRIPT EVALUATION ERROR

The script submitted for processing evaluated in the ScriptEngine with errors and could not be processed. Check the script submitted for syntax errors or other problems and then resubmit.

598

SERVER TIMEOUT

The server exceeded one of the timeout settings for the request and could therefore only partially responded or did not respond at all.

599

SERVER SERIALIZATION ERROR

The server was not capable of serializing an object that was returned from the script supplied on the request. Either transform the object into something Gremlin Server can process within the script or install mapper serialization classes to Gremlin Server.

SUCCESS and NO CONTENT messages are terminating messages that indicate that a request was properly handled on the server and that there are no additional messages streaming in for that request. When developing a driver, it is important to note the slight differences in semantics for these result codes when it comes to sessionless versus in-session requests. For a sessionless request, which operates under automatic transaction management, Gremlin Server will only send one of these message types after result iteration and transaction commit(). In other words, the driver could potentially expect to receive a number of "successful" PARTIAL CONTENT messages before ultimately ending in failure on commit(). For in-session requests, the client is responsible for managing the transaction and therefore, a first request could receive multiple "success" related messages, only to fail on a future request that finally issues the commit().

OpProcessors Arguments

The following sections define a non-exhaustive list of available operations and arguments for embedded OpProcessors (i.e. ones packaged with Gremlin Server).

Common

All OpProcessor instances support these arguments.

Key Type Description

batchSize

Int

When the result is an iterator this value defines the number of iterations each ResponseMessage should contain - overrides the resultIterationBatchSize server setting.

Standard OpProcessor

The "standard" OpProcessor handles requests for the primary function of Gremlin Server - executing Gremlin. Requests made to this OpProcessor are "sessionless" in the sense that a request must encapsulate the entirety of a transaction. There is no state maintained between requests. A transaction is started when the script is first evaluated and is committed when the script completes (or rolled back if an error occurred).

Key Description

processor

As this is the default OpProcessor this value can be set to an empty string

op

Key Description

eval

evaluate a Gremlin script provided as a String

eval operation arguments

Key Type Description

gremlin

String

Required The Gremlin script to evaluate

bindings

Map

A map of key/value pairs to apply as variables in the context of the Gremlin script

language

String

The flavor used (e.g. gremlin-groovy)

rebindings

Map

A map of key/value pairs that allow globally bound Graph and TraversalSource objects to be rebound to different variable names for purposes of the current request. The value represents the name the global variable and its key represents the new binding name as it will be referenced in the Gremlin query. For example, if the Gremlin Server defines two TraversalSource instances named g1 and g2, it would be possible to send a rebinding pair with key of "g" and value of "g2" and thus allow the script to refer to "g2" simply as "g".

Session OpProcessor

The "session" OpProcessor handles requests for the primary function of Gremlin Server - executing Gremlin. It is like the "standard" OpProcessor, but instead maintains state between sessions and leaves all transaction management up to the calling client. It is important that clients that open sessions, commit or roll them back, however Gremlin Server will try to clean up such things when a session is killed that has been abandoned. It is important to consider that a session can only be maintained with a single machine. In the event that multiple Gremlin Server are deployed, session state is not shared among them.

Key Description

processor

This value should be set to session

op

Key Description

eval

evaluate a Gremlin script

eval operation arguments

Key Type Description

gremlin

String

Required The Gremlin script to evaluate

session

String

Required The session identifier for the current session - typically this value should be a UUID (the session will be created if it doesn’t exist)

bindings

Map

A map of key/value pairs to apply as variables in the context of the Gremlin script

language

String

The flavor used (e.g. gremlin-groovy)

Gremlin Plugins

gremlin-plugin

Plugins provide a way to expand the features of Gremlin Console and Gremlin Server. The first step to developing a plugin is to implement the GremlinPlugin interface:


package org.apache.tinkerpop.gremlin.groovy.plugin;

import java.util.Optional;

/**
 * Those wanting to extend Gremlin can implement this interface to provide mapper imports and extension
 * methods to the language itself.  Gremlin uses ServiceLoader to install plugins.  It is necessary for
 * projects to include a org.apache.tinkerpop.gremlin.groovy.plugin.GremlinPlugin file in META-INF/services of their
 * packaged project which includes the full class names of the implementations of this interface to install.
 *
 * @author Stephen Mallette (http://stephen.genoprime.com)
 */
public interface GremlinPlugin {
    public static final String ENVIRONMENT = "GremlinPlugin.env";

    /**
     * The name of the plugin.  This name should be unique (use a namespaced approach) as naming clashes will
     * prevent proper plugin operations. Plugins developed by TinkerPop will be prefixed with "tinkerpop."
     * For example, TinkerPop's implementation of Giraph would be named "tinkerpop.giraph".  If Facebook were
     * to do their own implementation the implementation might be called "facebook.giraph".
     */
    public String getName();

    /**
     * Implementers will typically execute imports of classes within their project that they want available in the
     * console or they may use meta programming to introduce new extensions to the Gremlin.
     *
     * @throws IllegalEnvironmentException   if there are missing environment properties required by the plugin as
     *                                       provided from {@link PluginAcceptor#environment()}.
     * @throws PluginInitializationException if there is a failure in the plugin iniitalization process
     */
    public void pluginTo(final PluginAcceptor pluginAcceptor) throws IllegalEnvironmentException, PluginInitializationException;

    /**
     * Some plugins may require a restart of the plugin host for the classloader to pick up the features.  This is
     * typically true of plugins that rely on {@code Class.forName()} to dynamically instantiate classes from the
     * root classloader (e.g. JDBC drivers that instantiate via @{code DriverManager}).
     */
    public default boolean requireRestart() {
        return false;
    }

    /**
     * Allows a plugin to utilize features of the {@code :remote} and {@code :submit} commands of the Gremlin Console.
     * This method does not need to be implemented if the plugin is not meant for the Console for some reason or
     * if it does not intend to take advantage of those commands.
     */
    public default Optional<RemoteAcceptor> remoteAcceptor() {
        return Optional.empty();
    }
}

The most simple plugin and the one most commonly implemented will likely be one that just provides a list of classes to import to the Gremlin Console. This type of plugin is the easiest way for implementers of the TinkerPop Structure and Process APIs to make their implementations available to users. The TinkerGraph implementation has just such a plugin:


package org.apache.tinkerpop.gremlin.tinkergraph.groovy.plugin;

import org.apache.tinkerpop.gremlin.groovy.plugin.AbstractGremlinPlugin;
import org.apache.tinkerpop.gremlin.groovy.plugin.IllegalEnvironmentException;
import org.apache.tinkerpop.gremlin.groovy.plugin.PluginAcceptor;
import org.apache.tinkerpop.gremlin.groovy.plugin.PluginInitializationException;
import org.apache.tinkerpop.gremlin.tinkergraph.process.computer.TinkerGraphComputer;
import org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph;

import java.util.HashSet;
import java.util.Set;

/**
 * @author Stephen Mallette (http://stephen.genoprime.com)
 */
public final class TinkerGraphGremlinPlugin extends AbstractGremlinPlugin {


    private static final Set<String> IMPORTS = new HashSet<String>() {{
        add(IMPORT_SPACE + TinkerGraph.class.getPackage().getName() + DOT_STAR);
        add(IMPORT_SPACE + TinkerGraphComputer.class.getPackage().getName() + DOT_STAR);
    }};

    @Override
    public String getName() {
        return "tinkerpop.tinkergraph";
    }

    @Override
    public void pluginTo(final PluginAcceptor pluginAcceptor) throws PluginInitializationException, IllegalEnvironmentException {
        pluginAcceptor.addImports(IMPORTS);
    }

    @Override
    public void afterPluginTo(final PluginAcceptor pluginAcceptor) throws IllegalEnvironmentException, PluginInitializationException {

    }
}

Note that the plugin provides a unique name for the plugin which follows a namespaced pattern as namespace.plugin-name (e.g. "tinkerpop.hadoop" - "tinkerpop" is the reserved namespace for TinkerPop maintained plugins). To make TinkerGraph classes available to the Console, the PluginAcceptor is given a Set of imports to provide to the plugin host. The PluginAcceptor essentially behaves as an abstraction to the "host" that is handling the GremlinPlugin. GremlinPlugin implementations maybe hosted by the Console as well as the ScriptEngine in Gremlin Server. Obviously, registering new commands and other operations that are specific to the Groovy Shell don’t make sense there. Write the code for the plugin defensively by checking the GremlinPlugin.env key in the PluginAcceptor.environment() to understand which environment the plugin is being used in.

There is one other step to follow to ensure that the GremlinPlugin is visible to its hosts. GremlinPlugin implementations are loaded via ServiceLoader and therefore need a resource file added to the jar file where the plugin exists. Add a file called org.apache.tinkerpop.gremlin.groovy.plugin.GremlinPlugin to META-INF.services. In the case of the TinkerGraph plugin above, that file will have this line in it:

org.apache.tinkerpop.gremlin.tinkergraph.groovy.plugin.TinkerGraphGremlinPlugin

Once the plugin is packaged, there are two ways to test it out:

  1. Copy the jar and its dependencies to the Gremlin Console path and start it.

  2. Start Gremlin Console and try the :install command: :install com.company my-plugin 1.0.0.

In either case, once one of these two approaches is taken, the jars and their dependencies are available to the Console. The next step is to "activate" the plugin by doing :plugin use my-plugin, where "my-plugin" refers to the name of the plugin to activate.

Note
When :install is used logging dependencies related to SLF4J are filtered out so as not to introduce multiple logger bindings (which generates warning messages to the logs).

A plugin can do much more than just import classes. One can expand the Gremlin language with new functions or steps, provide useful commands to make repetitive or complex tasks easier to execute, or do helpful integrations with other systems. The secret to doing so lies in the PluginAcceptor. As mentioned earlier, the PluginAcceptor provides access to the host of the plugin. It provides several important methods for doing so:

  1. addBinding - These two function allow the plugin to inject whatever context it wants to the host. For example, doing addBinding('x',1) would place a variable of x with a value of 1 into the console at the time of the plugin load.

  2. eval - Evaluates a script in the context of the host at the time of plugin startup. For example, doing eval("sum={x,y->x+y}") would create a sum function that would be available to the user of the Console after the load of the plugin.

  3. environment - Provides context from the host environment. For the console, the environment will return a Map containing a reference to the IO stream and the Groovysh instance. These classes represent very low-level access to the underpinnings of the console. Access to Groovysh allows for advanced features such as registering new commands (e.g. like the :plugin or :remote commands).

Plugins can also tie into the :remote and :submit commands. Recall that a :remote represents a different context within which Gremlin is executed, when issued with :submit. It is encouraged to use this integration point when possible, as opposed to registering new commands that can otherwise follow the :remote and :submit pattern. To expose this integration point as part of a plugin, implement the RemoteAcceptor interface:

Tip
Be good to the users of plugins and prevent dependency conflicts. Maintaining a conflict free plugin is most easily done by using the Maven Enforcer Plugin.
Tip
Consider binding the plugin’s minor version to the TinkerPop minor version so that it’s easy for users to figure out plugin compatibility. Otherwise, clearly document a compatibility matrix for the plugin somewhere that users can find it.

package org.apache.tinkerpop.gremlin.groovy.plugin;

import org.codehaus.groovy.tools.shell.Groovysh;

import java.io.Closeable;
import java.util.List;

/**
 * @author Stephen Mallette (http://stephen.genoprime.com)
 */
public interface RemoteAcceptor extends Closeable {

    public static final String RESULT = "result";

    /**
     * Gets called when :remote is used in conjunction with the "connect" option.  It is up to the implementation
     * to decide how additional arguments on the line should be treated after "connect".
     *
     * @return an object to display as output to the user
     * @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with connecting
     */
    public Object connect(final List<String> args) throws RemoteException;

    /**
     * Gets called when :remote is used in conjunction with the "config" option.  It is up to the implementation
     * to decide how additional arguments on the line should be treated after "config".
     *
     * @return an object to display as output to the user
     * @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with configuration
     */
    public Object configure(final List<String> args) throws RemoteException;

    /**
     * Gets called when :submit is executed.  It is up to the implementation to decide how additional arguments on
     * the line should be treated after "submit".
     *
     * @return an object to display as output to the user
     * @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with submission
     */
    public Object submit(final List<String> args) throws RemoteException;

    /**
     * Retrieve a script as defined in the shell context.  This allows for multi-line scripts to be submitted.
     */
    public static String getScript(final String submittedScript, final Groovysh shell) {
        return submittedScript.startsWith("@") ? shell.getInterp().getContext().getProperty(submittedScript.substring(1)).toString() : submittedScript;
    }
}

The RemoteAcceptor implementation ties to a GremlinPlugin and will only be executed when in use with the Gremlin Console plugin host. Simply instantiate and return a RemoteAcceptor in the GremlinPlugin.remoteAcceptor() method of the plugin implementation. Generally speaking, each call to remoteAcceptor() should produce a new instance of a RemoteAcceptor. It will likely be necessary that you provide context from the GremlinPlugin to the RemoteAcceptor plugin. For example, the RemoteAcceptor implementation might require an instance of Groovysh to provide a way to dynamically evaluate a script provided to it so that it can process the results in a different way.

Gephi Plugin

gephi-logo Gephi is an interactive visualization, exploration, and analysis platform for graphs. The Graph Streaming plugin for Gephi provides an API that can be leveraged to stream graphs and visualize traversals interactively through the Gremlin Gephi Plugin.

The following instructions assume that Gephi has been download and installed. It further assumes that the Graph Streaming plugin has been installed (Tools > Plugins). The following instructions explain how to visualize a Graph and Traversal.

In Gephi, create a new project with File > New Project. In the lower left view, click the "Streaming" tab, open the Master drop down, and right click Master Server > Start which starts the Graph Streaming server in Gephi and by default accepts requests at http://localhost:8080/workspace0:

gephi start server

Start the Gremlin Console and activate the Gephi plugin:

gremlin> :plugin use tinkerpop.gephi
==>tinkerpop.gephi activated
gremlin> graph = TinkerFactory.createModern()
==>tinkergraph[vertices:6 edges:6]
gremlin> :remote connect tinkerpop.gephi
==>Connection to Gephi - http://localhost:8080/workspace0 with stepDelay:1000, startRGBColor:[0.0, 1.0, 0.5], colorToFade:g, colorFadeRate:0.7
gremlin> :> graph
==>tinkergraph[vertices:6 edges:6]

The above Gremlin session activates the Gephi plugin, creates the "modern" TinkerGraph, uses the :remote command to setup a connection to the Graph Streaming server in Gephi (with default parameters that will be explained below), and then uses :submit which sends the vertices and edges of the graph to the Gephi Streaming Server. The resulting graph appears in Gephi as displayed in the left image below.

gephi graph submit
Note
Issuing :> graph again will clear the Gephi workspace and then re-write the graph. To manually empty the workspace do :> clear.

Now that the graph is visualized in Gephi, it is possible to apply a layout algorithm, change the size and/or color of vertices and edges, and display labels/properties of interest. Further information can be found in Gephi’s tutorial on Visualization. After applying the Fruchterman Reingold layout, increasing the node size, decreasing the edge scale, and displaying the id, name, and weight attributes the graph looks as displayed in the right image above.

Note
It’s recommended to choose a continuously running layout algorithm like Fruchterman Reingold or Force Atlas, because every update to color the visited vertices causes their positions to be reset, so these layouts will constantly adjust to account for these changes and make visualization of the traversals. This also explains why the graph seems to rotate each store step in the screenshots below.

Consider the following traversal:

g = graph.traversal()
g.V(2).in().out('knows').
      has('age',gt(30)).outE('created').
      has('weight',gt(0.5d)).inV()

To visualize it insert the appropriately named store('n') steps where n is an integer, and the vertices will be highlighted in ascending store step order.

gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V(2).in('knows').out('knows').has('age',gt(30)).
                outE('created').has('weight',gt(0.5d)).inV().values('name')
==>ripple
gremlin> traversal = g.V(2).store('1').
                in('knows').store('2').
                out('knows').has('age',gt(30)).store('3').
                outE('created').has('weight',gt(0.5d)).inV().store('4')
==>v[5]
gremlin> traversal.getSideEffects().get('1')
==>Optional[{v[2]=1}]
gremlin> traversal.getSideEffects().get('2')
==>Optional[{v[1]=1}]
gremlin> traversal.getSideEffects().get('3')
==>Optional[{v[4]=1}]
gremlin> traversal.getSideEffects().get('4')
==>Optional[{v[5]=1}]
gremlin> :> traversal
Visualizing vertices at step: 1... Visited: 1
Visualizing vertices at step: 2... Visited: 1
Visualizing vertices at step: 3... Visited: 1
Visualizing vertices at step: 4... Visited: 1

When :> traversal is called, it iterates through the sideEffects of the traversal accessing the vertices stored at each corresponding step. It then updates the vertices' color with startRGBColor, which in this case is a lime-green blue: [0.0,1.0,0.5]. After the first step visualization, it sleeps for the configured stepDelay in milliseconds. On the second step, it decays the configured colorToFade of all the previously visited vertices in prior steps, by multiplying the current colorToFade value for each vertex with the colorFadeRate. To avoid color decay on prior steps, then provide a colorFadeRate value of 1.0. The screenshots below show how the visualization evolves over the 4 steps:

gephi traversal

Once a traversal visualization has executed, clear the colors in Gephi by selecting the grey square icon under the magnifying glass icon on the lower left tool bar next to the graph canvas. Run another traversal against the same graph and it will update the appropriate vertices. To get a sense of how the visualization configuration parameters affect the output, see the example below:

gremlin> :remote config startRGBColor [0.0,0.3,1.0]
==>Connection to Gephi - http://localhost:8080/workspace0 with stepDelay:1000, startRGBColor:[0.0, 0.3, 1.0], colorToFade:g, colorFadeRate:0.7
gremlin> :remote config colorToFade b
==>Connection to Gephi - http://localhost:8080/workspace0 with stepDelay:1000, startRGBColor:[0.0, 0.3, 1.0], colorToFade:b, colorFadeRate:0.7
gremlin> :remote config colorFadeRate 0.5
==>Connection to Gephi - http://localhost:8080/workspace0 with stepDelay:1000, startRGBColor:[0.0, 0.3, 1.0], colorToFade:b, colorFadeRate:0.5
gremlin> :> traversal
Visualizing vertices at step: 1... Visited: 1
Visualizing vertices at step: 2... Visited: 1
Visualizing vertices at step: 3... Visited: 1
Visualizing vertices at step: 4... Visited: 1
gephi traversal config

The visualization configuration above starts with a blue color now (most recently visited), fading the blue color (so that dark green remains on oldest visited), and fading the blue color more quickly so that the gradient from dark green to blue across steps has higher contrast. Here is a more detailed description of Gephi plugin configuration parameters, in the order accepted on the :remote connect gephi command, or modified via the :remote config command:

Parameter Description Default

workspace

The name of the workspace that your Graph Streaming server is started for.

workspace0

host

The host URL where the Graph Streaming server is configured for.

localhost

port

The port number of the URL that the Graph Streaming server is listening on.

8080

stepDelay

The amount of time in milliseconds to pause between step visualizations.

1000

startRGBColor

A size 3 float array of RGB color values which define the starting color to update most recently visited nodes with.

[0.0,1.0,0.5]

colorToFade

A single char from the set {r,g,b,R,G,B} determining which color to fade for vertices visited in prior steps

g

colorFadeRate

A float value in the range (0.0,1.0] which is multiplied against the current colorToFade value for prior vertices; a 1.0 value effectively turns off the color fading of prior step visited vertices

0.7

Server Plugin

gremlin-server Gremlin Server remotely executes Gremlin scripts that are submitted to it. The Server Plugin provides a way to submit scripts to Gremlin Server for remote processing. Read more about the plugin and how it works in the Gremlin Server section on Connecting via Console.

Note
The Server Plugin is enabled in the Gremlin Console by default.

Sugar Plugin

gremlin-sugar In previous versions of Gremlin-Groovy, there were numerous syntactic sugars that users could rely on to make their traversals more succinct. Unfortunately, many of these conventions made use of Java reflection and thus, were not performant. In TinkerPop3, these conveniences have been removed in support of the standard Gremlin-Groovy syntax being both inline with Gremlin-Java8 syntax as well as always being the most performant representation. However, for those users that would like to use the previous syntactic sugars (as well as new ones), there is SugarGremlinPlugin (a.k.a Gremlin-Groovy-Sugar).

Important
It is important that the sugar plugin is loaded in a Gremlin Console session prior to any manipulations of the respective TinkerPop3 objects as Groovy will cache unavailable methods and properties.
gremlin> :plugin use tinkerpop.sugar
==>tinkerpop.sugar activated
Tip
When using Sugar in a Groovy class file, add static { SugarLoader.load() } to the head of the file. Note that SugarLoader.load() will automatically call GremlinLoader.load().

Graph Traversal Methods

If a GraphTraversal property is unknown and there is a corresponding method with said name off of GraphTraversal then the property is assumed to be a method call. This enables the user to omit ( ) from the method name. However, if the property does not reference a GraphTraversal method, then it is assumed to be a call to values(property).

gremlin> g.V //(1)
==>v[1]
==>v[2]
==>v[3]
==>v[4]
==>v[5]
==>v[6]
gremlin> g.V.name //(2)
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> g.V.outE.weight //(3)
==>0.4
==>0.5
==>1.0
==>1.0
==>0.4
==>0.2
  1. There is no need for the parentheses in g.V().

  2. The traversal is interpreted as g.V().values('name').

  3. A chain of zero-argument step calls with a property value call.

Range Queries

The [x] and [x..y] range operators in Groovy translate to RangeStep calls.

gremlin> g.V[0..2]
==>v[1]
==>v[2]
gremlin> g.V[0..<2]
==>v[1]
gremlin> g.V[2]
==>v[3]

Logical Operators

The & and | operator are overloaded in SugarGremlinPlugin. When used, they introduce the AndStep and OrStep markers into the traversal. See and() and or() for more information.

gremlin> g.V.where(outE('knows') & outE('created')).name //(1)
==>marko
gremlin> t = g.V.where(outE('knows') | inE('created')).name; null //(2)
==>null
gremlin> t.toString()
==>[GraphStep([],vertex), TraversalFilterStep([VertexStep(OUT,[knows],edge), OrStep, VertexStep(IN,[created],edge)]), PropertiesStep([name],value)]
gremlin> t
==>marko
==>lop
==>ripple
gremlin> t.toString()
==>[TinkerGraphStep([],vertex), TraversalFilterStep([OrStep([[VertexStep(OUT,[knows],edge)], [VertexStep(IN,[created],edge)]])]), PropertiesStep([name],value)]
  1. Introducing the AndStep with the & operator.

  2. Introducing the OrStep with the | operator.

Traverser Methods

It is rare that a user will ever interact with a Traverser directly. However, if they do, some method redirects exist to make it easy.

gremlin> g.V().map{it.get().value('name')}  // conventional
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter
gremlin> g.V.map{it.name}  // sugar
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter

Utilities Plugin

The Utilities Plugin provides various functions, helper methods and imports of external classes that are useful in the console.

Note
The Utilities Plugin is enabled in the Gremlin Console by default.

Benchmarking and Profiling

The GPerfUtils library provides a number of performance utilities for Groovy. Specifically, these tools cover benchmarking and profiling.

Benchmarking allows execution time comparisons of different pieces of code. While such a feature is generally useful, in the context of Gremlin, benchmarking can help compare traversal performance times to determine the optimal approach. Profiling helps determine the parts of a program which are taking the most execution time, yielding low-level insight into the code being examined.

gremlin> :plugin use tinkerpop.sugar // Activate sugar plugin for use in benchmark
==>Specify the name of the plugin to use
gremlin> benchmark{
          'sugar' {g.V(1).name.next()}
          'nosugar' {g.V(1).values('name').next()}
gremlin> }.prettyPrint()
Environment
===========
* Groovy: 2.4.1
* JVM: Java HotSpot(TM) 64-Bit Server VM (25.40-b25, Oracle Corporation)
    * JRE: 1.8.0_40
    * Total Memory: 803 MB
    * Maximum Memory: 1820.5 MB
* OS: Mac OS X (10.8.5, x86_64)

Options
=======
* Warm Up: Auto (- 60 sec)
* CPU Time Measurement: On

          user  system    cpu   real

sugar    12636       2  12638  12638
nosugar   8222       2   8224   8224
==>null
gremlin> profile { g.V().iterate() }.prettyPrint()
Flat:

 %    cumulative   self            self     total    self    total   self    total
time   seconds    seconds  calls  ms/call  ms/call  min ms  min ms  max ms  max ms  name
52.1        0.00     0.00      1     0.61     1.18    0.61    1.18    0.61    1.18  groovysh_evaluate$_run_closure1.doCall
40.2        0.00     0.00      1     0.47     0.47    0.47    0.47    0.47    0.47  org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.iterate
 7.5        0.00     0.00      1     0.08     0.08    0.08    0.08    0.08    0.08  org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource.V

Call graph:

index  % time  self  children  calls  name
               0.00      0.00    1/1      <spontaneous>
[1]     100.0  0.00      0.00      1  groovysh_evaluate$_run_closure1.doCall [1]
               0.00      0.00    1/1      org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.iterate [2]
               0.00      0.00    1/1      org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource.V [3]
------------------------------------------------------------------------------------------------------------------------------------
               0.00      0.00    1/1      groovysh_evaluate$_run_closure1.doCall [1]
[2]      40.2  0.00      0.00      1  org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.iterate [2]
------------------------------------------------------------------------------------------------------------------------------------
               0.00      0.00    1/1      groovysh_evaluate$_run_closure1.doCall [1]
[3]       7.5  0.00      0.00      1  org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource.V [3]
------------------------------------------------------------------------------------------------------------------------------------
==>null

Describe Graph

A good implementation of the Gremlin APIs will validate their features against the Gremlin test suite. To learn more about a specific implementation’s compliance with the test suite, use the describeGraph function. The following shows the output for HadoopGraph:

gremlin> describeGraph(HadoopGraph)
==>
IMPLEMENTATION - org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
TINKERPOP TEST SUITE
- Compliant with (5 of 10 suites) +
> org.apache.tinkerpop.gremlin.structure.StructureStandardSuite
> org.apache.tinkerpop.gremlin.process.ProcessStandardSuite
> org.apache.tinkerpop.gremlin.process.ProcessComputerSuite
> org.apache.tinkerpop.gremlin.process.GroovyProcessStandardSuite
> org.apache.tinkerpop.gremlin.process.GroovyProcessComputerSuite
- Opts out of 23 individual tests
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_hasXname_GarciaX__a_0writtenBy_b__a_0sungBy_bX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_0sungBy_b__a_0sungBy_c__b_writtenBy_d__c_writtenBy_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_0sungBy_b__a_0writtenBy_c__b_writtenBy_d__c_sungBy_d__d_hasXname_GarciaXX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_knows_b__c_knows_bX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_created_b__c_created_bX_selectXa_b_cX_byXnameX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_out_asXcX_matchXb_knows_a__c_created_eX_selectXcX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.MatchTest$Traversals#g_V_matchXa_hasXname_GarciaX__a_0writtenBy_b__a_0sungBy_bX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyMatchTest$Traversals#g_V_matchXa_knows_b__c_knows_bX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyMatchTest$Traversals#g_V_matchXa_created_b__c_created_bX_selectXa_b_cX_byXnameX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyMatchTest$Traversals#g_V_out_asXcX_matchXb_knows_a__c_created_eX_selectXcX
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyMatchTest$Traversals#g_V_matchXa_0sungBy_b__a_0sungBy_c__b_writtenBy_d__c_writtenBy_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyMatchTest$Traversals#g_V_matchXa_0sungBy_b__a_0writtenBy_c__b_writtenBy_d__c_sungBy_d__d_hasXname_GarciaXX
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.CountTest$Traversals#g_V_both_both_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.CountTest$Traversals#g_V_repeatXoutX_timesX3X_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.CountTest$Traversals#g_V_repeatXoutX_timesX8X_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyCountTest$Traversals#g_V_both_both_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyCountTest$Traversals#g_V_repeatXoutX_timesX3X_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.map.GroovyCountTest$Traversals#g_V_repeatXoutX_timesX8X_count
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest#shouldNotAllowNullMemoryKeys
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest#shouldNotAllowSettingUndeclaredMemoryKeys
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest#shouldHaveConsistentMemoryVertexPropertiesAndExceptions
        "Giraph does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though."
> org.apache.tinkerpop.gremlin.process.traversal.step.sideEffect.ProfileTest$Traversals#g_V_out_out_profile_grateful
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
> org.apache.tinkerpop.gremlin.process.traversal.step.sideEffect.GroovyProfileTest$Traversals#g_V_out_out_profile_grateful
        "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute."
- NOTE -
The describeGraph() function shows information about a Graph implementation.
It uses information found in Java Annotations on the implementation itself to
determine this output and does not assess the actual code of the test cases of
the implementation itself.  Compliant implementations will faithfully and
honestly supply these Annotations to provide the most accurate depiction of
their support.

Implementations

gremlin racecar

Vendor Requirements

tinkerpop-enabled At the core of TinkerPop3 is a Java8 API. The implementation of this core API and its validation via the gremlin-test suite is all that is required of a vendor wishing to provide a TinkerPop3-enabled graph engine. Once a vendor has a valid implementation, then all the applications provided by TinkerPop (e.g. Gremlin Console, Gremlin Server, etc.) and 3rd-party developers (e.g. Gremlin-Scala, Gremlin-JS, etc.) will integrate properly with their graph engine. Finally, please feel free to use the logo on the left to promote your TinkerPop3 implementation.

Implementing Gremlin-Core

The classes that a vendor should focus on implementing are itemized below. It is a good idea to study the TinkerGraph (in-memory OLTP and OLAP in tinkergraph-gremlin), Neo4jGraph (OTLP w/ transactions in neo4j-gremlin) and/or HadoopGraph (OLAP in hadoop-gremlin) implementations for ideas and patterns.

  1. Online Transactional Processing Graph Systems (OLTP)

    1. Structure API: Graph, Element, Vertex, Edge, Property and Transaction (if transactions are supported).

    2. Process API: TraversalStrategy instances for optimizing Gremlin traversals to the vendors graph system (i.e. TinkerGraphStepStrategy).

  2. Online Analytics Processing Graph Systems (OLAP)

    1. Everything required of OTLP is required of OLAP (but not vice versa).

    2. GraphComputer API: GraphComputer, Messenger, Memory.

Please consider the following implementation notes:

  • Be sure your Graph implementation is named as XXXGraph (e.g. TinkerGraph, Neo4jGraph, HadoopGraph, etc.).

  • Use StringHelper to ensuring that the toString() representation of classes are consistent with other implementations.

  • Ensure that your implementation’s Features (Graph, Vertex, etc.) are correct so that test cases handle particulars accordingly.

  • Use the numerous static method helper classes such as ElementHelper, GraphComputerHelper, VertexProgramHelper, etc.

  • There are a number of default methods on the provided interfaces that are semantically correct. However, if they are not efficient for the implementation, override them.

  • Implement the structure/ package interfaces first and then, if desired, interfaces in the process/ package interfaces.

  • ComputerGraph is a Wrapper system that ensure proper semantics during a GraphComputer computation.

OLTP Implementations

pipes-character-1 The most important interfaces to implement are in the structure/ package. These include interfaces like Graph, Vertex, Edge, Property, Transaction, etc. The StructureStandardSuite will ensure that the semantics of the methods implemented are correct. Moreover, there are numerous Exceptions classes with static exceptions that should be thrown by the vendor so that all the exceptions and their messages are consistent amongst all TinkerPop3 implementations.

OLAP Implementations

furnace-character-1 Implementing the OLAP interfaces may be a bit more complicated. Note that before OLAP interfaces are implemented, it is necessary for the OLTP interfaces to be, at minimal, implemented as specified in OLTP Implementations. A summary of each required interface implementation is presented below:

  1. GraphComputer: A fluent builder for specifying an isolation level, a VertexProgram, and any number of MapReduce jobs to be submitted.

  2. Memory: A global blackboard for ANDing, ORing, INCRing, and SETing values for specified keys.

  3. Messenger: The system that collects and distributes messages being propagated by vertices executing the VertexProgram application.

  4. MapReduce.MapEmitter: The system that collects key/value pairs being emitted by the MapReduce applications map-phase.

  5. MapReduce.ReduceEmitter: The system that collects key/value pairs being emitted by the MapReduce applications combine- and reduce-phases.

Note
The VertexProgram and MapReduce interfaces in the process/computer/ package are not required by the vendor to implement. Instead, these are interfaces to be implemented by application developers writing VertexPrograms and MapReduce jobs.
Important
TinkerPop3 provides three OLAP implementations: TinkerGraphComputer (TinkerGraph), GiraphGraphComputer (HadoopGraph), and SparkGraphComputer (Hadoop). Given the complexity of the OLAP system, it is good to study and copy many of the patterns used in these reference implementations.
Implementing GraphComputer

furnace-character-3 The most complex method in GraphComputer is the submit()-method. The method must do the following:

  1. Ensure the the GraphComputer has not already been executed.

  2. Ensure that at least there is a VertexProgram or 1 MapReduce job.

  3. If there is a VertexProgram, validate that it can execute on the GraphComputer given the respectively defined features.

  4. Create the Memory to be used for the computation.

  5. Execute the VertexProgram.setup() method once and only once.

  6. Execute the VertexProgram.execute() method for each vertex.

  7. Execute the VertexProgram.terminate() method once and if true, repeat VertexProgram.execute().

  8. When VertexProgram.terminate() returns true, move to MapReduce job execution.

  9. MapReduce jobs are not required to be executed in any specified order.

  10. For each Vertex, execute MapReduce.map(). Then (if defined) execute MapReduce.combine() and MapReduce.reduce().

  11. Update Memory with runtime information.

  12. Construct a new ComputerResult containing the compute Graph and Memory.

Implementing Memory

gremlin-brain The Memory object is initially defined by VertexProgram.setup(). The memory data is available in the first round of the VertexProgram.execute() method. Each Vertex, when executing the VertexProgram, can update the Memory in its round. However, the update is not seen by the other vertices until the next round. At the end of the first round, all the updates are aggregated and the new memory data is available on the second round. This process repeats until the VertexProgram terminates.

Implementing Messenger

The Messenger object is similar to the Memory object in that a vertex can read and write to the Messenger. However, the data it reads are the messages sent to the vertex in the previous step and the data it writes are the messages that will be readable by the receiving vertices in the subsequent round.

Implementing MapReduce Emitters

hadoop-logo-notext The MapReduce framework in TinkerPop3 is similar to the model popularized by Hadoop. The primary difference is that all Mappers process the vertices of the graph, not an arbitrary key/value pair. However, the vertices' edges can not be accessed — only their properties. This greatly reduces the amount of data needed to be pushed through the MapReduce engine as any edge information required, can be computed in the VertexProgram.execute() method. Moreover, at this stage, vertices can not be mutated, only their token and property data read. A Gremlin OLAP vendor needs to provide implementations for to particular classes: MapReduce.MapEmitter and MapReduce.ReduceEmitter. TinkerGraph’s implementation is provided below which demonstrates the simplicity of the algorithm (especially when the data is all within the same JVM).

public class TinkerMapEmitter<K, V> implements MapReduce.MapEmitter<K, V> {

    public Map<K, Queue<V>> reduceMap;
    public Queue<KeyValue<K, V>> mapQueue;
    private final boolean doReduce;

    public TinkerMapEmitter(final boolean doReduce) { (1)
        this.doReduce = doReduce;
        if (this.doReduce)
            this.reduceMap = new ConcurrentHashMap<>();
        else
            this.mapQueue = new ConcurrentLinkedQueue<>();
    }

    @Override
    public void emit(K key, V value) {
        if (this.doReduce)
            this.reduceMap.computeIfAbsent(key, k -> new ConcurrentLinkedQueue<>()).add(value); (2)
        else
            this.mapQueue.add(new KeyValue<>(key, value)); (3)
    }

    protected void complete(final MapReduce<K, V, ?, ?, ?> mapReduce) {
        if (!this.doReduce && mapReduce.getMapKeySort().isPresent()) { (4)
            final Comparator<K> comparator = mapReduce.getMapKeySort().get();
            final List<KeyValue<K, V>> list = new ArrayList<>(this.mapQueue);
            Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
            this.mapQueue.clear();
            this.mapQueue.addAll(list);
        } else if (mapReduce.getMapKeySort().isPresent()) {
            final Comparator<K> comparator = mapReduce.getMapKeySort().get();
            final List<Map.Entry<K, Queue<V>>> list = new ArrayList<>();
            list.addAll(this.reduceMap.entrySet());
            Collections.sort(list, Comparator.comparing(Map.Entry::getKey, comparator));
            this.reduceMap = new LinkedHashMap<>();
            list.forEach(entry -> this.reduceMap.put(entry.getKey(), entry.getValue()));
        }
    }
}
  1. If the MapReduce job has a reduce, then use one data structure (reduceMap), else use another (mapList). The difference being that a reduction requires a grouping by key and therefore, the Map<K,Queue<V>> definition. If no reduction/grouping is required, then a simple Queue<KeyValue<K,V>> can be leveraged.

  2. If reduce is to follow, then increment the Map with a new value for the key. MapHelper is a TinkerPop3 class with static methods for adding data to a Map.

  3. If no reduce is to follow, then simply append a KeyValue to the queue.

  4. When the map phase is complete, any map-result sorting required can be executed at this point.

public class TinkerReduceEmitter<OK, OV> implements MapReduce.ReduceEmitter<OK, OV> {

    protected Queue<KeyValue<OK, OV>> reduceQueue = new ConcurrentLinkedQueue<>();

    @Override
    public void emit(final OK key, final OV value) {
        this.reduceQueue.add(new KeyValue<>(key, value));
    }

    protected void complete(final MapReduce<?, ?, OK, OV, ?> mapReduce) {
        if (mapReduce.getReduceKeySort().isPresent()) {
            final Comparator<OK> comparator = mapReduce.getReduceKeySort().get();
            final List<KeyValue<OK, OV>> list = new ArrayList<>(this.reduceQueue);
            Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
            this.reduceQueue.clear();
            this.reduceQueue.addAll(list);
        }
    }
}

The method MapReduce.reduce() is defined as:

public void reduce(final OK key, final Iterator<OV> values, final ReduceEmitter<OK, OV> emitter) { ... }

In other words, for the TinkerGraph implementation, iterate through the entrySet of the reduceMap and call the reduce() method on each entry. The reduce() method can emit key/value pairs which are simply aggregated into a Queue<KeyValue<OK,OV>> in an analogous fashion to TinkerMapEmitter when no reduce is to follow. These two emitters are tied together in TinkerGraphComputer.submit().

...
for (final MapReduce mapReduce : mapReducers) {
    if (mapReduce.doStage(MapReduce.Stage.MAP)) {
        final TinkerMapEmitter<?, ?> mapEmitter = new TinkerMapEmitter<>(mapReduce.doStage(MapReduce.Stage.REDUCE));
        final SynchronizedIterator<Vertex> vertices = new SynchronizedIterator<>(this.graph.vertices());
        workers.setMapReduce(mapReduce);
        workers.mapReduceWorkerStart(MapReduce.Stage.MAP);
        workers.executeMapReduce(workerMapReduce -> {
            while (true) {
                final Vertex vertex = vertices.next();
                if (null == vertex) return;
                workerMapReduce.map(ComputerGraph.mapReduce(vertex), mapEmitter);
            }
        });
        workers.mapReduceWorkerEnd(MapReduce.Stage.MAP);

        // sort results if a map output sort is defined
        mapEmitter.complete(mapReduce);

        // no need to run combiners as this is single machine
        if (mapReduce.doStage(MapReduce.Stage.REDUCE)) {
            final TinkerReduceEmitter<?, ?> reduceEmitter = new TinkerReduceEmitter<>();
            final SynchronizedIterator<Map.Entry<?, Queue<?>>> keyValues = new SynchronizedIterator((Iterator) mapEmitter.reduceMap.entrySet().iterator());
            workers.mapReduceWorkerStart(MapReduce.Stage.REDUCE);
            workers.executeMapReduce(workerMapReduce -> {
                while (true) {
                    final Map.Entry<?, Queue<?>> entry = keyValues.next();
                    if (null == entry) return;
                        workerMapReduce.reduce(entry.getKey(), entry.getValue().iterator(), reduceEmitter);
                    }
                });
            workers.mapReduceWorkerEnd(MapReduce.Stage.REDUCE);
            reduceEmitter.complete(mapReduce); // sort results if a reduce output sort is defined
            mapReduce.addResultToMemory(this.memory, reduceEmitter.reduceQueue.iterator()); (1)
        } else {
            mapReduce.addResultToMemory(this.memory, mapEmitter.mapQueue.iterator()); (2)
        }
    }
}
...
  1. Note that the final results of the reducer are provided to the Memory as specified by the application developer’s MapReduce.addResultToMemory() implementation.

  2. If there is no reduce stage, the the map-stage results are inserted into Memory as specified by the application developer’s MapReduce.addResultToMemory() implementation.

IO Implementations

If a Graph requires custom serializers for IO to work properly, implement the Graph.io method. A typical example of where a Graph would require such a custom serializers is if their identifier system uses non-primitive values, such as OrientDB’s Rid class. From basic serialization of a single Vertex all the way up the stack to Gremlin Server, the need to know how to handle these complex identifiers is an important requirement.

The first step to implementing custom serializers is to first implement the IoRegistry interface and register the custom classes and serializers to it. Each Io implementation has different requirements for what it expects from the IoRegistry:

  • GraphML - No custom serializers expected/allowed.

  • GraphSON - Register a Jackson SimpleModule. The SimpleModule encapsulates specific classes to be serialized, so it does not need to be registered to a specific class in the IoRegistry (use null).

  • Gryo - Expects registration of one of three objects:

    • Register just the custom class with a null Kryo Serializer implementation - this class will use default "field-level" Kryo serialization.

    • Register the custom class with a specific Kryo ‘Serializer’ implementation.

    • Register the custom class with a Function<Kryo, Serializer> for those cases where the Kryo Serializer requires the Kryo instance to get constructed.

This implementation should provide a zero-arg constructor as the stack may require instantiation via reflection. Consider extending AbstractIoRegistry for convenience as follows:

public class MyGraphIoRegistry extends AbstractIoRegistry {
    public MyGraphIoRegistry() {
        register(GraphSONIo.class, null, new MyGraphSimpleModule());
        register(GryoIo.class, MyGraphIdClass.class, new MyGraphIdSerializer());
    }
}

In the Graph.io method, provide the IoRegistry object to the supplied Builder and call the create method to return that Io instance as follows:

public <I extends Io> I io(final Io.Builder<I> builder) {
    return (I) builder.graph(this).registry(myGraphIoRegistry).create();
}}

In this way, Graph implementations can pre-configure custom serializers for IO interactions and users will not need to know about those details. Following this pattern will ensure proper execution of the test suite as well as simplified usage for end-users.

Important
Proper implementation of IO is critical to successful Graph operations in Gremlin Server. The Test Suite does have "serialization" tests that provide some assurance that an implementation is working properly, but those tests cannot make assertions against any specifics of a custom serializer. It is the responsibility of the implementer to test the specifics of their custom serializers.
Tip
Consider separating serializer code into its own module, if possible, so that clients that use the Graph implementation remotely don’t need a full dependency on the entire Graph - just the IO components and related classes being serialized.

Validating with Gremlin-Test

gremlin-edumacated

<dependency>
  <groupId>org.apache.tinkerpop</groupId>
  <artifactId>gremlin-test</artifactId>
  <version>3.0.0-SNAPSHOT</version>
</dependency>
<dependency>
  <groupId>org.apache.tinkerpop</groupId>
  <artifactId>gremlin-groovy-test</artifactId>
  <version>3.0.0-SNAPSHOT</version>
</dependency>

The operational semantics of any OLTP or OLAP implementation are validated by gremlin-test and functional interoperability with the Groovy environment is ensured by gremlin-groovy-test. To implement these tests, provide test case implementations as shown below, where XXX below denotes the name of the graph implementation (e.g. TinkerGraph, Neo4jGraph, HadoopGraph, etc.).

// Structure API tests
@RunWith(StructureStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXStructureStandardTest {}

// Process API tests
@RunWith(ProcessComputerSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXProcessComputerTest {}

@RunWith(ProcessStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXProcessStandardTest {}

@RunWith(GroovyEnvironmentSuite.class)
@GraphProviderClass(provider = XXXProvider.class, graph = TinkerGraph.class)
public class XXXGroovyEnvironmentTest {}

@RunWith(GroovyProcessStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = TinkerGraph.class)
public class XXXGroovyProcessStandardTest {}

@RunWith(GroovyProcessComputerSuite.class)
@GraphProviderClass(provider = XXXGraphComputerProvider.class, graph = TinkerGraph.class)
public class XXXGroovyProcessComputerTest {}

The above set of tests represent the minimum test suite set to implement. There are other "integration" and "performance" tests that should be considered optional. Implementing those tests requires the same pattern as shown above.

Important
It is as important to look at "ignored" tests as it is to look at ones that fail. The gremlin-test suite utilizes the Feature implementation exposed by the Graph to determine which tests to execute. If a test utilizes features that are not supported by the graph, it will ignore them. While that may be fine, implementers should validate that the ignored tests are appropriately bypassed and that there are no mistakes in their feature definitions. Moreover, implementers should consider filling gaps in their own test suites, especially when IO-related tests are being ignored.

The only test-class that requires any code investment is the GraphProvider implementation class. This class is a used by the test suite to construct Graph configurations and instances and provides information about the vendor’s implementation itself. In most cases, it is best to simply extend AbstractGraphProvider as it provides many default implementations of the GraphProvider interface.

Finally, specify the test suites that will be supported by the Graph implementation using the @Graph.OptIn annotation. See the TinkerGraph implementation below as an example:

@Graph.OptIn(Graph.OptIn.SUITE_STRUCTURE_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_COMPUTER)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_ENVIRONMENT)
public class TinkerGraph implements Graph {

Only include annotations for the suites the implementation will support. Note that implementing the suite, but not specifying the appropriate annotation will prevent the suite from running (an obvious error message will appear in this case when running the mis-configured suite).

There are times when there may be a specific test in the suite that the implementation cannot support (despite the features it implements) or should not otherwise be executed. It is possible for implementers to "opt-out" of a test by using the @Graph.OptOut annotation. The following is an example of this annotation usage as taken from HadoopGraph:

@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
@Graph.OptOut(
        test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
        method = "g_V_matchXa_hasXname_GarciaX__a_inXwrittenByX_b__a_inXsungByX_bX",
        reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
@Graph.OptOut(
        test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
        method = "g_V_matchXa_inXsungByX_b__a_inXsungByX_c__b_outXwrittenByX_d__c_outXwrittenByX_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX",
        reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
@Graph.OptOut(
        test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
        method = "shouldNotAllowBadMemoryKeys",
        reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
@Graph.OptOut(
        test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
        method = "shouldRequireRegisteringMemoryKeys",
        reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
public class HadoopGraph implements Graph {

The above examples show how to ignore individual tests. It is also possible to:

  • Ignore an entire test case (i.e. all the methods within the test) by setting the method to "*".

  • Ignore a "base" test class such that test that extend from those classes will all be ignored. This style of ignoring is useful for Gremlin "process" tests that have bases classes that are extended by various Gremlin flavors (e.g. groovy).

Also note that some of the tests in the Gremlin Test Suite are parameterized tests and require an additional level of specificity to be properly ignored. To ignore these types of tests, examine the name template of the parameterized tests. It is defined by a Java annotation that looks like this:

@Parameterized.Parameters(name = "expect({0})")

The annotation above shows that the name of each parameterized test will be prefixed with "expect" and have parentheses wrapped around the first parameter (at index 0) value supplied to each test. This information can only be garnered by studying the test set up itself. Once the pattern is determined and the specific unique name of the parameterized test is identified, add it to the specific property on the OptOut annotation in addition to the other arguments.

These annotations help provide users a level of transparency into test suite compliance (via the describeGraph() utility function). It also allows implementers to have a lot of flexibility in terms of how they wish to support TinkerPop. For example, maybe there is a single test case that prevents an implementer from claiming support of a Feature. The implementer could choose to either not support the Feature or to support it but "opt-out" of the test with a "reason" as to why so that users understand the limitation.

Important
Before using OptOut be sure that the reason for using it is sound and it is more of a last resort. It is possible that a test from the suite doesn’t properly represent the expectations of a feature, is too broad or narrow for the semantics it is trying to enforce or simply contains a bug. Please consider raising issues in the developer mailing list with such concerns before assuming OptOut is the only answer.
Important
There are no tests that specifically validate complete compliance with Gremlin Server. Generally speaking, a Graph that passes the full Test Suite, should be compliant with Gremlin Server. The one area where problems can occur is in serialization. Always ensure that IO is properly implemented, that custom serializers are tested fully and ultimately integration test the Graph with an actual Gremlin Server instance.
Caution
Configuring tests to run in parallel might result in errors that are difficult to debug as there is some shared state in test execution around graph configuration. It is therefore recommended that parallelism be turned off for the test suite (the Maven SureFire Plugin is configured this way by default). It may also be important to include this setting, <reuseForks>false</reuseForks>, in the SureFire configuration if tests are failing in an unexplainable way.

Accessibility via GremlinPlugin

gremlin-plugin The applications distributed with TinkerPop3 do not distribute with any vendor implementations besides TinkerGraph. If your implementation is stored in a Maven repository (e.g. Maven Central Repository), then it is best to provide a GremlinPlugin implementation so the respective jars can be downloaded according and when required by the user. Neo4j’s GremlinPlugin is provided below for reference.

public class Neo4jGremlinPlugin implements GremlinPlugin {

    private static final String IMPORT = "import ";
    private static final String DOT_STAR = ".*";

    private static final Set<String> IMPORTS = new HashSet<String>() {{
        add(IMPORT + Neo4jGraph.class.getPackage().getName() + DOT_STAR);
    }};

    @Override
    public String getName() {
        return "neo4j";
    }

    @Override
    public void pluginTo(final PluginAcceptor pluginAcceptor) {
        pluginAcceptor.addImports(IMPORTS);
    }
}

With the above plugin implementations, users can now download respective binaries for Gremlin Console, Gremlin Server, etc.

gremlin> g = Neo4jGraph.open('/tmp/neo4j')
No such property: Neo4jGraph for class: groovysh_evaluate
Display stack trace? [yN]
gremlin> :install org.apache.tinkerpop neo4j-gremlin 3.0.0-SNAPSHOT
==>loaded: [org.apache.tinkerpop, neo4j-gremlin, …]
gremlin> :plugin use tinkerpop.neo4j
==>tinkerpop.neo4j activated
gremlin> g = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]

In-Depth Implementations

gremlin-painting The vendor implementation details presented thus far are minimum requirements necessary to yield a valid TinkerPop3 implementation. However, there are other areas that a vendor can tweak to provide an implementation more optimized for their underlying graph engine. Typical areas of focus include:

  • Traversal Strategies: A TraversalStrategy can be used to alter a traversal prior to its execution. A typical example is converting a pattern of g.V().has('name','marko') into a global index lookup for all vertices with name "marko". In this way, a O(|V|) lookup becomes an O(log(|V|)). Please review TinkerGraphStepStrategy for ideas.

  • Step Implementations: Every step is ultimately referenced by the GraphTraversal interface. It is possible to extend GraphTraversal to use a vendor-specific step implementation.

TinkerGraph-Gremlin

<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>tinkergraph-gremlin</artifactId>
   <version>3.0.0-SNAPSHOT</version>
</dependency>

tinkerpop-character TinkerGraph is a single machine, in-memory, non-transactional graph engine that provides both OLTP and OLAP functionality. It is deployed with TinkerPop3 and serves as the reference implementation for other vendors to study in order to understand the semantics of the various methods of the TinkerPop3 API. Constructing a simple graph in Java8 is presented below.

Graph g = TinkerGraph.open();
Vertex marko = g.addVertex("name","marko","age",29);
Vertex lop = g.addVertex("name","lop","lang","java");
marko.addEdge("created",lop,"weight",0.6d);

The above graph creates two vertices named "marko" and "lop" and connects them via a created-edge with a weight=0.6 property. Next, the graph can be queried as such.

g.V().has("name","marko").out("created").values("name")

The g.V().has("name","marko") part of the query can be executed in two ways.

  • A linear scan of all vertices filtering out those vertices that don’t have the name "marko"

  • A O(log(|V|)) index lookup for all vertices with the name "marko"

Given the initial graph construction in the first code block, no index was defined and thus, a linear scan is executed. However, if the graph was constructed as such, then an index lookup would be used.

Graph g = TinkerGraph.open();
g.createIndex("name",Vertex.class)

The execution times for a vertex lookup by property is provided below for both no-index and indexed version of TinkerGraph over the Grateful Dead graph.

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> clock(1000) {g.V().has('name','Garcia').iterate()} //(1)
==>0.206167824
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> graph.createIndex('name',Vertex.class)
==>null
gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> clock(1000){g.V().has('name','Garcia').iterate()} //(2)
==>0.053134954
  1. Determine the average runtime of 1000 vertex lookups when no name-index is defined.

  2. Determine the average runtime of 1000 vertex lookups when a name-index is defined.

Important
Each graph vendor will have different mechanism by which indices and schemas are defined. TinkerPop3 does not require any conformance in this area. In TinkerGraph, the only definitions are around indices. With other vendors, property value types, indices, edge labels, etc. may be required to be defined a priori to adding data to the graph.
Note
TinkerGraph is distributed with Gremlin Server and is therefore automatically available to it for configuration.

Configuration

TinkerGraph has several settings that can be provided on creation via Configuration object:

Property Description

gremlin.graph

org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph

gremlin.tinkergraph.vertexIdManager

The IdManager implementation to use for vertices.

gremlin.tinkergraph.edgeIdManager

The IdManager implementation to use for edges.

gremlin.tinkergraph.vertexPropertyIdManager

The IdManager implementation to use for vertex properties.

gremlin.tinkergraph.defaultVertexPropertyCardinality

The default VertexProperty.Cardinality to use when Vertex.property(k,v) is called.

The IdManager settings above refer to how TinkerGraph will control identifiers for vertices, edges and vertex properties. There are several options for each of these settings: ANY, LONG, INTEGER, UUID, or the fully qualified class name of an IdManager implementation on the classpath. When not specified, the default values for all settings is ANY, meaning that the graph will work with any object on the JVM as the identifier and will generate new identifiers from Long when the identifier is not user supplied. TinkerGraph will also expect the user to understand the types used for identifiers when querying, meaning that g.V(1) and g.V(1L) could return two different vertices. LONG, INTEGER and UUID settings will try to coerce identifier values to the expected type as well as generate new identifiers with that specified type.

It is important to consider the data being imported to TinkerGraph with respect to defaultVertexPropertyCardinality setting. For example, if a .gryo file is known to contain multi-property data, be sure to set the default cardinality to list or else the data will import as single. Consider the following:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo")
==>null
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:14], standard]
gremlin> g.V().properties()
==>vp[name->marko]
==>vp[location->santa fe]
==>vp[name->stephen]
==>vp[location->purcellville]
==>vp[name->matthias]
==>vp[location->seattle]
==>vp[name->daniel]
==>vp[location->aachen]
==>vp[name->gremlin]
==>vp[name->tinkergraph]
gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@51bddd98
gremlin> conf.setProperty("gremlin.tinkergraph.defaultVertexPropertyCardinality","list")
==>null
gremlin> graph = TinkerGraph.open(conf)
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.io(gryo()).readGraph("data/tinkerpop-crew.kryo")
==>null
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:14], standard]
gremlin> g.V().properties()
==>vp[name->marko]
==>vp[location->san diego]
==>vp[location->santa cruz]
==>vp[location->brussels]
==>vp[location->santa fe]
==>vp[name->stephen]
==>vp[location->centreville]
==>vp[location->dulles]
==>vp[location->purcellville]
==>vp[name->matthias]
==>vp[location->bremen]
==>vp[location->baltimore]
==>vp[location->oakland]
==>vp[location->seattle]
==>vp[name->daniel]
==>vp[location->spremberg]
==>vp[location->kaiserslautern]
==>vp[location->aachen]
==>vp[name->gremlin]
==>vp[name->tinkergraph]

Neo4j-Gremlin

<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>neo4j-gremlin</artifactId>
   <version>3.0.0-SNAPSHOT</version>
</dependency>

Neo Technology are the developers of the OLTP-based Neo4j graph database.

Caution
Unless under a commercial agreement with Neo Technology, Neo4j is licensed AGPL. The neo4j-gremlin module is licensed Apache2 because it only references the Apache2-licensed Neo4j API (not its implementation). Note that neither the Gremlin Console nor Gremlin Server distribute with the Neo4j implementation binaries. To access the binaries, use the :install command to download binaries from Maven Central Repository.
gremlin> :install org.apache.tinkerpop neo4j-gremlin 3.0.0-SNAPSHOT
==>Loaded: [org.apache.tinkerpop, neo4j-gremlin, 3.0.0-SNAPSHOT] - restart the console to use [tinkerpop.neo4j]
gremlin> :q
...
gremlin> :plugin use tinkerpop.neo4j
==>tinkerpop.neo4j activated
gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
Note
Neo4j High Availability is currently not supported by Neo4j-Gremlin.
Tip
To host Neo4j in Gremlin Server, the dependencies must first be "installed" or otherwise copied to the Gremlin Server path. The automated method for doing this would be to execute bin/gremlin-server.sh -i org.apache.tinkerpop neo4j-gremlin 3.0.0-SNAPSHOT.

Indices

Neo4j 2.x indices leverage vertex labels to partition the index space. TinkerPop3 does not provide method interfaces for defining schemas/indices for the underlying graph system. Thus, in order to create indices, it is important to call the Neo4j API directly.

Note
Neo4jGraphStep will attempt to discern which indices to use when executing a traversal of the form g.V().has().

The Gremlin-Console session below demonstrates Neo4j indices. For more information, please refer to the Neo4j documentation:

  • Manipulating indices with Cypher.

  • Manipulating indices with the Neo4j Java API.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> graph.cypher("CREATE INDEX ON :person(name)")
gremlin> graph.tx().commit() //(1)
==>null
gremlin> graph.addVertex(label,'person','name','marko')
==>v[0]
gremlin> graph.addVertex(label,'dog','name','puppy')
==>v[1]
gremlin> g = graph.traversal()
==>graphtraversalsource[neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]], standard]
gremlin> g.V().hasLabel('person').has('name','marko').values('name')
==>marko
gremlin> graph.close()
==>null
  1. Schema mutations must happen in a different transaction than graph mutations

Below demonstrates the runtime benefits of indices and demonstrates how if there is no defined index (only vertex labels), a linear scan of the vertex-label partition is still faster than a linear scan of all vertices.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g = graph.traversal()
==>graphtraversalsource[neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]], standard]
gremlin> g.tx().commit()
==>null
gremlin> clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} //(1)
==>0.6756226369999999
gremlin> graph.cypher("CREATE INDEX ON :artist(name)") //(2)
gremlin> g.tx().commit()
==>null
gremlin> Thread.sleep(5000) //(3)
==>null
gremlin> clock(1000) {g.V().hasLabel('artist').has('name','Garcia').iterate()} //(4)
==>0.11625271699999999
gremlin> clock(1000) {g.V().has('name','Garcia').iterate()} //(5)
==>0.92949407
gremlin> graph.cypher("DROP INDEX ON :artist(name)") //(6)
gremlin> g.tx().commit()
==>null
gremlin> graph.close()
==>null
  1. Find all artists whose name is Garcia which does a linear scan of the artist vertex-label partition.

  2. Create an index for all artist vertices on their name property.

  3. Neo4j indices are eventually consistent so this stalls to give the index time to populate itself.

  4. Find all artists whose name is Garcia which uses the pre-defined schema index.

  5. Find all vertices whose name is Garcia which requires a linear scan of all the data in the graph.

  6. Drop the created index.

Multi/Meta-Properties

Neo4jGraph supports both multi- and meta-properties (see vertex properties). These features are not native to Neo4j and are implemented using "hidden" Neo4j nodes. For example, when a vertex has multiple "name" properties, each property is a new node (multi-properties) which can have properties attached to it (meta-properties). As such, the native, underlying representation may become difficult to query directly using another graph language such as Cypher. The default setting is to disable multi- and meta-properties. However, if this feature is desired, then it can be activated via gremlin.neo4j.metaProperties and gremlin.neo4j.multiProperties configurations being set to true. Once the configuration is set, it can not be changed for the lifetime of the graph.

gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@8a7db1e
gremlin> conf.setProperty('gremlin.neo4j.directory','/tmp/neo4j')
==>null
gremlin> conf.setProperty('gremlin.neo4j.multiProperties',true)
==>null
gremlin> conf.setProperty('gremlin.neo4j.metaProperties',true)
==>null
gremlin> graph = Neo4jGraph.open(conf)
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> g = graph.traversal()
==>graphtraversalsource[neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]], standard]
gremlin> g.addV('name','michael','name','michael hunger','name','mhunger')
==>v[0]
gremlin> g.V().properties('name').property('acl', 'public')
==>vp[name->michael]
==>vp[name->michael hunger]
==>vp[name->mhunger]
gremlin> g.V(0).valueMap()
==>[name:[michael, michael hunger, mhunger]]
gremlin> g.V(0).properties()
==>vp[name->michael]
==>vp[name->michael hunger]
==>vp[name->mhunger]
gremlin> g.V(0).properties().valueMap()
==>[acl:public]
==>[acl:public]
==>[acl:public]
gremlin> graph.close()
==>null
Warning
Neo4jGraph without multi- and meta-properties is in 1-to-1 correspondence with the native, underlying Neo4j representation. It is recommended that if the user does not require multi/meta-properties, then they should not enable them. Without multi- and meta-properties enabled, Neo4j can be interacted with with other tools and technologies that do not leverage TinkerPop.
Important
When using a multi-property enabled Neo4jGraph, vertices may represent their properties on "hidden nodes" adjacent to the vertex. If a vertex property key/value is required for indexing, then two indices are required — e.g. CREATE INDEX ON :person(name) and CREATE INDEX ON :vertexProperty(name) (see Neo4j indices).

Cypher

gremlin loves cypher

NeoTechnology are the creators of the graph pattern-match query language Cypher. It is possible to leverage Cypher from within Gremlin by using the Neo4jGraph.cypher() graph traversal method.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> graph.io(gryo()).readGraph('data/tinkerpop-modern.kryo')
==>null
gremlin> graph.cypher('MATCH (a {name:"marko"}) RETURN a')
==>[a:v[0]]
gremlin> graph.cypher('MATCH (a {name:"marko"}) RETURN a').select('a').out('knows').values('name')
==>josh
==>vadas
gremlin> graph.close()
==>null

Thus, like match()-step in Gremlin, it is possible to do a declarative pattern match and then move back into imperative Gremlin.

Tip
For those developers using Gremlin Server against Neo4j, it is possible to do Cypher queries by simply placing the Cypher string in graph.cypher(...) before submission to the server.

Multi-Label

TinkerPop3 requires every Element to have a single, immutable string label (i.e. a Vertex, Edge, and VertexProperty). In Neo4j, a Node (vertex) can have an arbitrary number of labels while a Relationship (edge) can have one and only one. Furthermore, in Neo4j, Node labels are mutable while Relationship labels are not. In order to handle this mismatch, three Neo4jVertex specific methods exist in Neo4j-Gremlin.

public Set<String> labels() // get all the labels of the vertex
public void addLabel(String label) // add a label to the vertex
public void removeLabel(String label) // remove a label from the vertex

An example use case is presented below.

gremlin> graph = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
gremlin> vertex = (Neo4jVertex) graph.addVertex('human::animal') //(1)
==>v[0]
gremlin> vertex.label() //(2)
==>animal::human
gremlin> vertex.labels() //(3)
==>animal
==>human
gremlin> vertex.addLabel('organism') //(4)
==>null
gremlin> vertex.label()
==>animal::human::organism
gremlin> vertex.removeLabel('human') //(5)
==>null
gremlin> vertex.labels()
==>animal
==>organism
gremlin> vertex.addLabel('organism') //(6)
==>null
gremlin> vertex.labels()
==>animal
==>organism
gremlin> vertex.removeLabel('human') //(7)
==>null
gremlin> vertex.label()
==>animal::organism
gremlin> g = graph.traversal()
==>graphtraversalsource[neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]], standard]
gremlin> g.V().has(label,'organism') //(8)
gremlin> g.V().has(label,of('organism')) //(9)
==>v[0]
gremlin> g.V().has(label,of('organism')).has(label,of('animal'))
==>v[0]
gremlin> g.V().has(label,of('organism').and(of('animal')))
==>v[0]
gremlin> graph.close()
==>null
  1. Typecasting to a Neo4jVertex is only required in Java.

  2. The standard Vertex.label() method returns all the labels in alphabetical order concatenated using ::.

  3. Neo4jVertex.labels() method returns the individual labels as a set.

  4. Neo4jVertex.addLabel() method adds a single label.

  5. Neo4jVertex.removeLabel() method removes a single label.

  6. Labels are unique and thus duplicate labels don’t exist.

  7. If a label that does not exist is removed, nothing happens.

  8. P.eq() does a full string match and should only be used if multi-labels are not leveraged.

  9. LabelP.of() is specific to Neo4jGraph and used for multi-label matching.

Important
LabelP.of() is only required if multi-labels are leveraged. LabelP.of() is used when filtering/looking-up vertices by their label(s) as the standard P.eq() does a direct match on the ::-representation of vertex.label()

Hadoop-Gremlin

<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>hadoop-gremlin</artifactId>
   <version>3.0.0-SNAPSHOT</version>
</dependency>

hadoop-logo-notext Hadoop is a distributed computing framework that is used to process data represented across a multi-machine compute cluster. When the data in the Hadoop cluster represents a TinkerPop3 graph, then Hadoop-Gremlin can be used to process the graph using TinkerPop3’s OLTP and OLAP models of graph computing.

Important
This section assumes that the user has a Hadoop 1.x cluster functioning. For more information on getting started with Hadoop, please see the Single Node Setup tutorial. Moreover, if using GiraphGraphComputer it is advisable that the reader also familiarize their self with Giraph as well via the Getting Started page.

Installing Hadoop-Gremlin

To the .bash_profile file, add the following environmental variable (of course, be sure the directories are respective of the local machine locations). The HADOOP_GREMLIN_LIBS is the location of all the Hadoop-Gremlin jars. It is possible to place developer jars into this directory for loading into the Hadoop job’s classpath. Or, better yet, note that HADOOP_GREMLIN_LIBS can be a colon-separated (:) list of locations and thus will load all jars into the cluster at all provided locations.

export HADOOP_GREMLIN_LIBS=/usr/local/gremlin-console/ext/hadoop-gremlin/lib

If using Gremlin Console, it is important to install the Hadoop-Gremlin plugin. Note that Hadoop-Gremlin requires a Gremlin Console restart after installing.

$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :install org.apache.tinkerpop hadoop-gremlin 3.0.0-SNAPSHOT
==>loaded: [org.apache.tinkerpop, hadoop-gremlin, 3.0.0-SNAPSHOT] - restart the console to use [tinkerpop.hadoop]
gremlin> :q
$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :plugin use tinkerpop.hadoop
==>tinkerpop.hadoop activated
gremlin>

Properties Files

HadoopGraph makes heavy use of properties files which ultimately get turned into Apache configurations and Hadoop configurations. The example properties file presented below is located at conf/hadoop/hadoop-gryo.properties.

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=tinkerpop-modern.kryo
gremlin.hadoop.outputLocation=output
#####################################
# GiraphGraphComputer Configuration #
#####################################
giraph.minWorkers=2
giraph.maxWorkers=2
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
giraph.maxMessagesInMemory=100000
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=local[4]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer

A review of the Hadoop-Gremlin specific properties are provided in the table below. For the respective OLAP engines (GiraphGraphComputer or SparkGraphComputer) refer to their respective documentation for configuration options.

Property Description

gremlin.graph

The class of the graph to construct using GraphFactory

gremlin.hadoop.inputLocation

The location of the input file(s) for Hadoop-Gremlin to read the graph from.

gremlin.hadoop.graphInputFormat

The format that the graph input file(s) are represented in.

gremlin.hadoop.outputLocation

The location to write the computed HadoopGraph to.

gremlin.hadoop.graphOutputFormat

The format that the output file(s) should be represented in.

gremlin.hadoop.jarsInDistributedCache

Whether to upload the Hadoop-Gremlin jars to Hadoop’s distributed cache (necessary if jars are not on machines' classpaths).

Along with the properties above, the numerous Hadoop specific properties can be added as needed to tune and parameterize the executed Hadoop-Gremlin job on the respective Hadoop cluster.

Important
As the size of the graphs being processed becomes large, it is important to fully understand how the underlying OLAP engine (e.g. Giraph, Spark, etc.) works and understand the numerous parameterizations offered by these systems. Such knowledge can help alleviate out of memory exceptions, slow load times, slow processing times, etc.

OLTP Hadoop-Gremlin

hadoop-pipes It is possible to execute OLTP operations over a HadoopGraph. However, realize that the underlying HDFS files are typically not random access and thus, to retrieve a vertex, a linear scan is required. OLTP operations are useful for peeking at the graph prior to executing a long running OLAP job — e.g. g.V().valueMap().limit(10).

Caution
OLTP operations on HadoopGraph are not efficient. They require linear scans to execute and are unreasonable for large graphs. In such large graph situations, make use of TraversalVertexProgram which is the OLAP implementation of the Gremlin language. Hadoop-Gremlin provides various GraphComputer implementations to execute OLAP computations over a HadoopGraph.
gremlin> hdfs.copyFromLocal('data/tinkerpop-modern.kryo', 'tinkerpop-modern.kryo')
==>null
gremlin> hdfs.ls()
==>rwxr-xr-x marko supergroup 0 (D) _bsp
==>rwxr-xr-x marko supergroup 0 (D) hadoop-gremlin-libs
==>rwxr-xr-x marko supergroup 0 (D) output
==>rw-r--r-- marko supergroup 781 tinkerpop-modern.kryo
gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal()
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], standard]
gremlin> g.V().count()
==>6
gremlin> g.V().out().out().values('name')
==>ripple
==>lop
gremlin> g.V().group().by{it.value('name')[1]}.by('name').next()
==>a={marko=1, vadas=1}
==>e={peter=1}
==>i={ripple=1}
==>o={lop=1, josh=1}

OLAP Hadoop-Gremlin

hadoop-furnace Hadoop-Gremlin was designed to execute OLAP operations via GraphComputer. The OLTP examples presented previously are reproduced below, but using TraversalVertexProgram for the execution of the Gremlin traversal.

Important
As of TinkerPop3 3.0.0-SNAPSHOT, when using Hadoop-Gremlin OLAP from the Gremlin Console, the only Gremlin language subset supported is Gremlin-Groovy. Future versions will support other Gremlin language dialects.

A Graph in TinkerPop3 can support any number of GraphComputer implementations. Out of the box, Hadoop-Gremlin supports three GraphComputer implementations.

  • GiraphGraphComputer: Leverages Giraph to execute TinkerPop3 OLAP computations.

    • The graph should fit within the total RAM of the Hadoop cluster (graph size restriction), though "out-of-core" processing is possible. Messages passing is coordinated via ZooKeeper for the in-memory graph (speedy traversals).

  • SparkGraphComputer: Leverages Spark to execute TinkerPop3 OLAP computations.

    • The graph may fit within the total RAM of the cluster (supports larger graphs). Message passing is coordinated via Spark map/reduce/join operations on in-memory and disk-cached data (average speed traversals).

  • MapReduceGraphComputer: Leverages Hadoop’s MapReduce to execute TinkerPop3 OLAP computations. (coming soon)

    • The graph must fit within the total disk space of the Hadoop cluster (supports massive graphs). Message passing is coordinated via MapReduce jobs over the on-disk graph (slow traversals).

Tip
gremlin-sugar For those wanting to use the SugarPlugin with their submitted traversal, do :remote config useSugar true as well as :plugin use tinkerpop.sugar at the start of the Gremlin Console session if it is not already activated.

GiraphGraphComputer

giraph-logo Giraph is an Apache Software Foundation project focused on OLAP-based graph processing. Giraph makes use of the distributed graph computing paradigm made popular by Google’s Pregel. In Giraph, developers write "vertex programs" that get executed at each vertex in parallel. These programs communicate with one another in a bulk synchronous parallel (BSP) manner. This model aligns with TinkerPop3’s GraphComputer API. TinkerPop3 provides an implementation of GraphComputer that works for Giraph called GiraphGraphComputer. Moreover, with TinkerPop3’s MapReduce-framework, the standard Giraph/Pregel model is extended to support an arbitrary number of MapReduce phases to aggregate and yield results from the graph. Below are examples using GiraphGraphComputer from the Gremlin-Console.

Warning
Be sure that the SLF4J of Hadoop matches that of Giraph or else there will be conflicts. Simply copy the following jars to the lib/ of the machines in the Hadoop cluster: slf4j-api-a.b.c.jar and slf4j-log4j12-a.b.c.jar.
Warning
Giraph uses a large number of Hadoop counters. The default for Hadoop is 120. In mapred-site.xml it is possible to increase the limit it via the mapreduce.job.counters.limit property. A good value to use is 1000. This is a cluster-wide property so be sure to restart the cluster after updating.
Warning
The maximum number of workers can be no larger than the number of map-slots in the Hadoop cluster minus 1. For example, if the Hadoop cluster has 4 map slots, then giraph.maxWorkers can not be larger than 3. One map-slot is reserved for the master compute node and all other slots can be allocated as workers to execute the VertexPrograms on the vertices of the graph.
gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer())
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], giraphgraphcomputer]
gremlin> g.V().count()
INFO  org.apache.tinkerpop.gremlin.hadoop.process.computer.giraph.GiraphGraphComputer  - HadoopGremlin(Giraph): TraversalVertexProgram[GraphStep([],vertex), CountGlobalStep, ComputerResultStep]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0049
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 33% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 66% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0049
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 28
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=3
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=132
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=1065877504
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Timers
INFO  org.apache.hadoop.mapred.JobClient  -     Total (milliseconds)=4066
INFO  org.apache.hadoop.mapred.JobClient  -     Shutdown (milliseconds)=94
INFO  org.apache.hadoop.mapred.JobClient  -     Input superstep (milliseconds)=637
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 0 (milliseconds)=346
INFO  org.apache.hadoop.mapred.JobClient  -     Setup (milliseconds)=2985
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=0
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Stats
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate finished vertices=0
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate edges=0
INFO  org.apache.hadoop.mapred.JobClient  -     Sent messages=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current workers=2
INFO  org.apache.hadoop.mapred.JobClient  -     Last checkpointed superstep=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current master task partition=0
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep=1
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate vertices=6
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=945
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=399282
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=1790
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=3
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=49814
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=0
INFO  org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph  - HadoopGremlin: CountGlobalMapReduce[~reducing]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0050
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 16%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 33%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 50%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 100%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0050
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 26
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=4
INFO  org.apache.hadoop.mapred.JobClient  -     Map output materialized bytes=156
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input records=2
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=240
INFO  org.apache.hadoop.mapred.JobClient  -     Map output bytes=312
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce shuffle bytes=156
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input groups=1
INFO  org.apache.hadoop.mapred.JobClient  -     Combine output records=2
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce output records=1
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Combine input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=1967128576
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=1336
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=1576
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=814356
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_READ=132
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=676
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -     Launched reduce tasks=4
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=54357
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=12457
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     Data-local map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=676
==>6
gremlin> g.V().out().out().values('name')
INFO  org.apache.tinkerpop.gremlin.hadoop.process.computer.giraph.GiraphGraphComputer  - HadoopGremlin(Giraph): TraversalVertexProgram[GraphStep([],vertex), VertexStep(OUT,vertex), VertexStep(OUT,vertex), PropertiesStep([name],value), ComputerResultStep]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0051
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 33% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 66% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0051
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 30
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=3
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=132
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=1067974656
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Timers
INFO  org.apache.hadoop.mapred.JobClient  -     Total (milliseconds)=4423
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 2 (milliseconds)=132
INFO  org.apache.hadoop.mapred.JobClient  -     Shutdown (milliseconds)=91
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 1 (milliseconds)=92
INFO  org.apache.hadoop.mapred.JobClient  -     Input superstep (milliseconds)=699
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 0 (milliseconds)=337
INFO  org.apache.hadoop.mapred.JobClient  -     Setup (milliseconds)=3068
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=0
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Stats
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate finished vertices=0
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate edges=0
INFO  org.apache.hadoop.mapred.JobClient  -     Sent messages=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current workers=2
INFO  org.apache.hadoop.mapred.JobClient  -     Last checkpointed superstep=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current master task partition=0
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep=3
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate vertices=6
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=945
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=407124
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=1738
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=3
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=49505
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=0
INFO  org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph  - HadoopGremlin: TraverserMapReduce[~traversers]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0052
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0052
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 16
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=240
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=2
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=586153984
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=1284
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=1524
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=275186
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=453
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=12718
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     Data-local map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=453
==>ripple
==>lop
Important
The examples above do not use lambdas (i.e. closures in Gremlin-Groovy). This makes the traversal serializable and thus, able to be distributed to all machines in the Hadoop cluster. If a lambda is required in a traversal, then the traversal must be sent as a String and compiled locally at each machine in the cluster. The following example demonstrates the :remote command which allows for submitting Gremlin traversals as a String.
gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> :remote connect tinkerpop.hadoop graph
==>useTraversalSource=graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], giraphgraphcomputer]
==>useSugar=false
gremlin> :> g.V().group().by{it.value('name')[1]}.by('name')
INFO  org.apache.tinkerpop.gremlin.hadoop.process.computer.giraph.GiraphGraphComputer  - HadoopGremlin(Giraph): TraversalVertexProgram[GraphStep([],vertex), GroupStep([LambdaMapStep(lambda)],value(name)), ComputerResultStep]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0053
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 66% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0053
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 28
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=3
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=132
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=1005584384
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Timers
INFO  org.apache.hadoop.mapred.JobClient  -     Total (milliseconds)=8870
INFO  org.apache.hadoop.mapred.JobClient  -     Shutdown (milliseconds)=157
INFO  org.apache.hadoop.mapred.JobClient  -     Input superstep (milliseconds)=3399
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 0 (milliseconds)=4113
INFO  org.apache.hadoop.mapred.JobClient  -     Setup (milliseconds)=1200
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=0
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=945
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=334461
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=1838
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Stats
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate finished vertices=0
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate edges=0
INFO  org.apache.hadoop.mapred.JobClient  -     Sent messages=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current workers=2
INFO  org.apache.hadoop.mapred.JobClient  -     Last checkpointed superstep=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current master task partition=0
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep=1
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate vertices=6
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=3
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=62853
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=0
INFO  org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph  - HadoopGremlin: GroupMapReduce[~reducing]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0054
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 25%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 33%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 66%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 100%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0054
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 26
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=12
INFO  org.apache.hadoop.mapred.JobClient  -     Map output materialized bytes=430
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=240
INFO  org.apache.hadoop.mapred.JobClient  -     Map output bytes=370
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce shuffle bytes=430
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input groups=4
INFO  org.apache.hadoop.mapred.JobClient  -     Combine output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce output records=4
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Combine input records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=2159017984
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=1384
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=1624
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=684524
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_READ=406
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=908
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -     Launched reduce tasks=4
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=86809
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=18560
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     Data-local map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=908
==>[a:[marko, vadas], e:[peter], i:[ripple], o:[lop, josh]]
gremlin> result
==>result[hadoopgraph[gryoinputformat->gryooutputformat],memory[size:2]]
gremlin> result.memory.runtime
==>85969
gremlin> result.memory.keys()
==>gremlin.traversalVertexProgram.voteToHalt
==>~reducing
gremlin> result.memory.get('~reducing')
==>a={marko=1, vadas=1}
==>e={peter=1}
==>i={ripple=1}
==>o={lop=1, josh=1}

SparkGraphComputer

spark-logo Spark is an Apache Software Foundation project focused on general-purpose OLAP data processing. Spark provides a hybrid in-memory/disk-based distributed computing model that is similar to Hadoop’s MapReduce model. Spark maintains a fluent function chaining DSL that is arguably easier for developers to work with than native Hadoop MapReduce. While Spark has a shorter startup time between "jobs" (a scatter/gather-step), the actual message passing algorithm (as designed by TinkerPop) is less efficient than that of Giraph. For small graphs, Spark will typically be much faster than Giraph, but as the graph becomes larger, the Hadoop MapReduce startup time incurred by Giraph will amortize as more time is spent passing messages (i.e. traversers) between the vertices of the graph.

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>6
gremlin> g.V().out().out().values('name')
==>lop
==>ripple

For using lambdas in Gremlin-Groovy, simply provide :remote connect a TraversalSource which leverages SparkGraphComputer.

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> :remote connect tinkerpop.hadoop graph g
==>useTraversalSource=graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
==>useSugar=false
gremlin> :> g.V().group().by{it.value('name')[1]}.by('name')
==>[a:[marko, vadas], e:[peter], i:[ripple], o:[josh, lop]]

The SparkGraphComputer algorithm leverages Spark’s caching abilities to reduce the amount of data shuffled across the wire on each iteration of the VertexProgram. When the graph is loaded as a Spark RDD (Resilient Distributed Dataset) it is immediately cached as graphRDD. The graphRDD is a distributed adjacency list which encodes the vertex, its properties, and all its incident edges. On the first iteration, each vertex (in parallel) is passed through VertexProgram.execute(). This yields an output of the vertex’s mutated state (i.e. updated compute keys — propertyX) and its outgoing messages. This viewOutgoingRDD is then reduced to viewIncomingRDD where the outgoing messages are sent to their respective vertices. If a MessageCombiner exists for the vertex program, then messages are aggregated locally and globally to ultimately yield one incoming message for the vertex. This reduce sequence is the "message pass." If the vertex program does not terminate on this iteration, then the viewIncomingRDD is joined with the cached graphRDD and the process continues. When there are no more iterations, there is a final join and the resultant RDD is stripped of its edges and messages. This mapReduceRDD is cached and is processed by each MapReduce job in the GraphComputer computation.

spark algorithm
Important
If the vendor/user wishes to bypass using Hadoop InputFormats for pulling data from the underlying graph system, it is possible to leverage Spark’s RDD constructs directly. There is a gremlin.hadoop.graphInputRDD configuration that references a Class<? extends InputRDD>. An InputRDD provides a read method that takes a SparkContext and returns a graphRDD. Likewise, to bypass OutputFormat, use gremlin.hadoop.graphOutputRDD and the respective OutputRDD with its write-based method.

MapReduceGraphComputer

COMING SOON

Input/Output Formats

adjacency-list Hadoop-Gremlin provides various I/O formats — i.e. Hadoop InputFormat and OutputFormat. All of the formats make use of an adjacency list representation of the graph where each "row" represents a single vertex, its properties, and its incoming and outgoing edges.


Gryo I/O Format

  • InputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat

  • OutputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

Gryo is a binary graph format that leverages Kryo to make a compact, binary representation of a vertex. It is recommended that users leverage Gryo given its space/time savings over text-based representations.

Note
The GryoInputFormat is splittable.

GraphSON I/O Format

  • InputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat

  • OutputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

GraphSON is a JSON based graph format. GraphSON is a space-expensive graph format in that it is a text-based markup language. However, it is convenient for many developers to work with as its structure is simple (easy to create and parse).

The data below represents an adjacency list representation of the classic TinkerGraph toy graph in GraphSON format.

{"id":1,"label":"person","outE":{"created":[{"id":9,"inV":3,"properties":{"weight":0.4}}],"knows":[{"id":7,"inV":2,"properties":{"weight":0.5}},{"id":8,"inV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":0,"value":"marko"}],"age":[{"id":1,"value":29}]}}
{"id":2,"label":"person","inE":{"knows":[{"id":7,"outV":1,"properties":{"weight":0.5}}]},"properties":{"name":[{"id":2,"value":"vadas"}],"age":[{"id":3,"value":27}]}}
{"id":3,"label":"software","inE":{"created":[{"id":9,"outV":1,"properties":{"weight":0.4}},{"id":11,"outV":4,"properties":{"weight":0.4}},{"id":12,"outV":6,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":4,"value":"lop"}],"lang":[{"id":5,"value":"java"}]}}
{"id":4,"label":"person","inE":{"knows":[{"id":8,"outV":1,"properties":{"weight":1.0}}]},"outE":{"created":[{"id":10,"inV":5,"properties":{"weight":1.0}},{"id":11,"inV":3,"properties":{"weight":0.4}}]},"properties":{"name":[{"id":6,"value":"josh"}],"age":[{"id":7,"value":32}]}}
{"id":5,"label":"software","inE":{"created":[{"id":10,"outV":4,"properties":{"weight":1.0}}]},"properties":{"name":[{"id":8,"value":"ripple"}],"lang":[{"id":9,"value":"java"}]}}
{"id":6,"label":"person","outE":{"created":[{"id":12,"inV":3,"properties":{"weight":0.2}}]},"properties":{"name":[{"id":10,"value":"peter"}],"age":[{"id":11,"value":35}]}}

Script I/O Format

  • InputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

  • OutputFormat: org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptOutputFormat

ScriptInputFormat and ScriptOutputFormat take an arbitrary script and use that script to either read or write Vertex objects, respectively. This can be considered the most general InputFormat/OutputFormat possible in that Hadoop-Gremlin uses the user provided script for all reading/writing.

ScriptInputFormat

The data below represents an adjacency list representation of the classic TinkerGraph toy graph. First line reads, "vertex 1, labeled person having 2 property values (marko and 29) has 3 outgoing edges; the first edge is labeled knows, connects the current vertex 1 with vertex 2 and has a property value 0.4, and so on."

1:person:marko:29 knows:2:0.5,knows:4:1.0,created:3:0.4
2:person:vadas:27
3:project:lop:java
4:person:josh:32 created:3:0.4,created:5:1.0
5:project:ripple:java
6:person:peter:35 created:3:0.2

There is no corresponding InputFormat that can parse this particular file (or some adjacency list variant of it). As such, ScriptInputFormat can be used. With ScriptInputFormat a script is stored in HDFS and leveraged by each mapper in the Hadoop job. The script must have the following method defined:

def parse(String line, ScriptElementFactory factory) { ... }

ScriptElementFactory provides the following 4 methods:

Vertex vertex(Object id); // get or create the vertex with the given id
Vertex vertex(Object id, String label); // get or create the vertex with the given id and label
Edge edge(Vertex out, Vertex in); // create an edge between the two given vertices
Edge edge(Vertex out, Vertex in, String label); // create an edge between the two given vertices using the given label

An appropriate parse() for the above adjacency list file is:

def parse(line, factory) {
    def parts = line.split(/ /)
    def (id, label, name, x) = parts[0].split(/:/).toList()
    def v1 = factory.vertex(id, label)
    if (name != null) v1.property('name', name) // first value is always the name
    if (x != null) {
        // second value depends on the vertex label; it's either
        // the age of a person or the language of a project
        if (label.equals('project')) v1.property('lang', x)
        else v1.property('age', Integer.valueOf(x))
    }
    if (parts.length == 2) {
        parts[1].split(/,/).grep { !it.isEmpty() }.each {
            def (eLabel, refId, weight) = it.split(/:/).toList()
            def v2 = factory.vertex(refId)
            def edge = factory.edge(v1, v2, eLabel)
            edge.property('weight', Double.valueOf(weight))
        }
    }
    return v1
}

The resultant Vertex denotes whether the line parsed yielded a valid Vertex. As such, if the line is not valid (e.g. a comment line, a skip line, etc.), then simply return null.

ScriptOutputFormat Support

The principle above can also be used to convert a vertex to an arbitrary String representation that is ultimately streamed back to a file in HDFS. This is the role of ScriptOutputFormat. ScriptOutputFormat requires that the provided script maintains a method with the following signature:

def stringify(Vertex vertex) { ... }

An appropriate stringify() to produce output in the same format that was shown in the ScriptInputFormat sample is:

def stringify(vertex) {
    def v = vertex.values('name', 'age', 'lang').inject(vertex.id(), vertex.label()).join(':')
    def outE = vertex.outE().map {
        def e = it.get()
        e.values('weight').inject(e.label(), e.inV().next().id()).join(':')
    }.join(',')
    return [v, outE].join('\t')
}

Interacting with HDFS

The distributed file system of Hadoop is called HDFS. The results of any OLAP operation are stored in HDFS accessible via hdfs.

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> :remote connect tinkerpop.hadoop graph
==>useTraversalSource=graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], giraphgraphcomputer]
==>useSugar=false
gremlin> :> g.V().group().by{it.value('name')[1]}.by('name')
INFO  org.apache.tinkerpop.gremlin.hadoop.process.computer.giraph.GiraphGraphComputer  - HadoopGremlin(Giraph): TraversalVertexProgram[GraphStep([],vertex), GroupStep([LambdaMapStep(lambda)],value(name)), ComputerResultStep]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0055
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 66% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0055
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 28
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=3
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=132
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=1006108672
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Timers
INFO  org.apache.hadoop.mapred.JobClient  -     Total (milliseconds)=9158
INFO  org.apache.hadoop.mapred.JobClient  -     Shutdown (milliseconds)=160
INFO  org.apache.hadoop.mapred.JobClient  -     Input superstep (milliseconds)=3352
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep 0 (milliseconds)=4244
INFO  org.apache.hadoop.mapred.JobClient  -     Setup (milliseconds)=1395
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=0
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=945
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=334392
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=1838
INFO  org.apache.hadoop.mapred.JobClient  -   Giraph Stats
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate finished vertices=0
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate edges=0
INFO  org.apache.hadoop.mapred.JobClient  -     Sent messages=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current workers=2
INFO  org.apache.hadoop.mapred.JobClient  -     Last checkpointed superstep=0
INFO  org.apache.hadoop.mapred.JobClient  -     Current master task partition=0
INFO  org.apache.hadoop.mapred.JobClient  -     Superstep=1
INFO  org.apache.hadoop.mapred.JobClient  -     Aggregate vertices=6
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=3
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=61448
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=0
INFO  org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph  - HadoopGremlin: GroupMapReduce[~reducing]
INFO  org.apache.hadoop.mapred.JobClient  - Running job: job_201506300811_0056
INFO  org.apache.hadoop.mapred.JobClient  -  map 0% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 0%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 33%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 83%
INFO  org.apache.hadoop.mapred.JobClient  -  map 100% reduce 100%
INFO  org.apache.hadoop.mapred.JobClient  - Job complete: job_201506300811_0056
INFO  org.apache.hadoop.mapred.JobClient  - Counters: 26
INFO  org.apache.hadoop.mapred.JobClient  -   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient  -     Spilled Records=12
INFO  org.apache.hadoop.mapred.JobClient  -     Map output materialized bytes=430
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Map input records=6
INFO  org.apache.hadoop.mapred.JobClient  -     SPLIT_RAW_BYTES=240
INFO  org.apache.hadoop.mapred.JobClient  -     Map output bytes=370
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce shuffle bytes=430
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce input groups=4
INFO  org.apache.hadoop.mapred.JobClient  -     Combine output records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Reduce output records=4
INFO  org.apache.hadoop.mapred.JobClient  -     Map output records=6
INFO  org.apache.hadoop.mapred.JobClient  -     Combine input records=0
INFO  org.apache.hadoop.mapred.JobClient  -     Total committed heap usage (bytes)=2166882304
INFO  org.apache.hadoop.mapred.JobClient  -   File Input Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Read=1384
INFO  org.apache.hadoop.mapred.JobClient  -   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_READ=1624
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_WRITTEN=684410
INFO  org.apache.hadoop.mapred.JobClient  -     FILE_BYTES_READ=406
INFO  org.apache.hadoop.mapred.JobClient  -     HDFS_BYTES_WRITTEN=908
INFO  org.apache.hadoop.mapred.JobClient  -   Job Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Launched map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -     Launched reduce tasks=4
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_REDUCES=84714
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     SLOTS_MILLIS_MAPS=17909
INFO  org.apache.hadoop.mapred.JobClient  -     Total time spent by all maps waiting after reserving slots (ms)=0
INFO  org.apache.hadoop.mapred.JobClient  -     Data-local map tasks=2
INFO  org.apache.hadoop.mapred.JobClient  -   File Output Format Counters
INFO  org.apache.hadoop.mapred.JobClient  -     Bytes Written=908
==>[a:[marko, vadas], e:[peter], i:[ripple], o:[lop, josh]]
gremlin> hdfs.ls()
==>rwxr-xr-x marko supergroup 0 (D) _bsp
==>rwxr-xr-x marko supergroup 0 (D) hadoop-gremlin-libs
==>rwxr-xr-x marko supergroup 0 (D) output
==>rw-r--r-- marko supergroup 781 tinkerpop-modern.kryo
gremlin> hdfs.ls('output')
==>rwxr-xr-x marko supergroup 0 (D) ~reducing
gremlin> hdfs.ls('output/~reducing')
==>rw-r--r-- marko supergroup 0 _SUCCESS
==>rwxr-xr-x marko supergroup 0 (D) _logs
==>rw-r--r-- marko supergroup 154 part-r-00000
==>rw-r--r-- marko supergroup 372 part-r-00001
==>rw-r--r-- marko supergroup 154 part-r-00002
==>rw-r--r-- marko supergroup 228 part-r-00003
gremlin> hdfs.head('output/~reducing', ObjectWritable)
==>a        {marko=1, vadas=1}
==>e        {peter=1}
==>i        {ripple=1}
==>o        {lop=1, josh=1}

A list of the HDFS methods available are itemized below. Note that these methods are also available for the local variable:

Method Description

hdfs.ls(String path)

List the contents of the supplied directory.

hdfs.cp(String from, String to)

Copy the specified path to the specified path.

hdfs.exists(String path)

Whether the specified path exists.

hdfs.rm(String path)

Remove the specified path.

hdfs.rmr(String path)

Remove the specified path and its contents recurssively.

hdfs.copyToLocal(String from, String to)

Copy the specified HDFS path to the specified local path.

hdfs.copyFromLocal(String from, String to)

Copy the specified local path to the specified HDFS path.

hdfs.mergeToLocal(String from, String to)

Merge the files in path to the specified local path.

hdfs.head(String path)

Display the data in the path as text.

hdfs.head(String path, int lineCount)

Text display only the first lineCount-number of lines in the path.

hdfs.head(String path, int totalKeyValues, Class<Writable> writableClass)

Display the path interpreting the key values as respective writable.

A Command Line Example

pagerank logo

The classic PageRank centrality algorithm can be executed over the TinkerPop graph from the command line using GiraphGraphComputer.

$ hadoop fs -copyFromLocal data/tinkerpop-modern.json tinkerpop-modern.json
$ hadoop fs -ls
Found 2 items
-rw-r--r--   1 marko supergroup       2356 2014-07-28 13:00 /user/marko/tinkerpop-modern.json
$ hadoop jar target/hadoop-gremlin-3.0.0-SNAPSHOT-job.jar org.apache.tinkerpop.gremlin.hadoop.process.computer.giraph.GiraphGraphComputer conf/hadoop-graphson.properties
14/07/29 12:08:27 INFO giraph.GiraphGraphComputer: HadoopGremlin(Giraph): PageRankVertexProgram[alpha=0.85,iterations=30]
14/07/29 12:08:28 INFO mapred.JobClient: Running job: job_201407281259_0041
14/07/29 12:08:29 INFO mapred.JobClient:  map 0% reduce 0%
14/07/29 12:08:51 INFO mapred.JobClient:  map 66% reduce 0%
14/07/29 12:08:52 INFO mapred.JobClient:  map 100% reduce 0%
14/07/29 12:08:54 INFO mapred.JobClient: Job complete: job_201407281259_0041
14/07/29 12:08:54 INFO mapred.JobClient: Counters: 57
14/07/29 12:08:54 INFO mapred.JobClient:   Map-Reduce Framework
14/07/29 12:08:54 INFO mapred.JobClient:     Spilled Records=0
14/07/29 12:08:54 INFO mapred.JobClient:     Map input records=3
14/07/29 12:08:54 INFO mapred.JobClient:     SPLIT_RAW_BYTES=132
14/07/29 12:08:54 INFO mapred.JobClient:     Map output records=0
14/07/29 12:08:54 INFO mapred.JobClient:     Total committed heap usage (bytes)=347602944
14/07/29 12:08:54 INFO mapred.JobClient:   Giraph Timers
14/07/29 12:08:54 INFO mapred.JobClient:     Shutdown (milliseconds)=385
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 1 (milliseconds)=89
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 23 (milliseconds)=28
14/07/29 12:08:54 INFO mapred.JobClient:     Input superstep (milliseconds)=1127
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 27 (milliseconds)=30
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 10 (milliseconds)=34
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 5 (milliseconds)=43
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 22 (milliseconds)=31
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 14 (milliseconds)=35
14/07/29 12:08:54 INFO mapred.JobClient:     Total (milliseconds)=4023
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 2 (milliseconds)=50
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 18 (milliseconds)=29
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 11 (milliseconds)=35
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 24 (milliseconds)=32
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 28 (milliseconds)=32
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 15 (milliseconds)=34
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 6 (milliseconds)=37
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 19 (milliseconds)=31
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 25 (milliseconds)=27
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 8 (milliseconds)=33
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 12 (milliseconds)=44
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 20 (milliseconds)=31
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 16 (milliseconds)=31
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 9 (milliseconds)=36
14/07/29 12:08:54 INFO mapred.JobClient:     Setup (milliseconds)=1119
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 3 (milliseconds)=50
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 7 (milliseconds)=38
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 13 (milliseconds)=36
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 29 (milliseconds)=37
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 26 (milliseconds)=40
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 0 (milliseconds)=293
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 21 (milliseconds)=46
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 17 (milliseconds)=32
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep 4 (milliseconds)=39
14/07/29 12:08:54 INFO mapred.JobClient:   File Input Format Counters
14/07/29 12:08:54 INFO mapred.JobClient:     Bytes Read=0
14/07/29 12:08:54 INFO mapred.JobClient:   Giraph Stats
14/07/29 12:08:54 INFO mapred.JobClient:     Aggregate finished vertices=0
14/07/29 12:08:54 INFO mapred.JobClient:     Aggregate edges=0
14/07/29 12:08:54 INFO mapred.JobClient:     Sent messages=6
14/07/29 12:08:54 INFO mapred.JobClient:     Current workers=2
14/07/29 12:08:54 INFO mapred.JobClient:     Last checkpointed superstep=0
14/07/29 12:08:54 INFO mapred.JobClient:     Current master task partition=0
14/07/29 12:08:54 INFO mapred.JobClient:     Superstep=30
14/07/29 12:08:54 INFO mapred.JobClient:     Aggregate vertices=6
14/07/29 12:08:54 INFO mapred.JobClient:   FileSystemCounters
14/07/29 12:08:54 INFO mapred.JobClient:     HDFS_BYTES_READ=2488
14/07/29 12:08:54 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250470
14/07/29 12:08:54 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2719
14/07/29 12:08:54 INFO mapred.JobClient:   Job Counters
14/07/29 12:08:54 INFO mapred.JobClient:     Launched map tasks=3
14/07/29 12:08:54 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/07/29 12:08:54 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/07/29 12:08:54 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=31907
14/07/29 12:08:54 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/07/29 12:08:54 INFO mapred.JobClient:   File Output Format Counters
14/07/29 12:08:54 INFO mapred.JobClient:     Bytes Written=0
$ hadoop fs -cat output/~g/*
{"id":1,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.15000000000000002}],"name":[{"id":0,"value":"marko"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":3.0}],"age":[{"id":1,"value":29}]}}
{"id":5,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.23181250000000003}],"name":[{"id":8,"value":"ripple"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"lang":[{"id":9,"value":"java"}]}}
{"id":3,"label":"software","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.4018125}],"name":[{"id":4,"value":"lop"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":0.0}],"lang":[{"id":5,"value":"java"}]}}
{"id":4,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],"name":[{"id":6,"value":"josh"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],"age":[{"id":7,"value":32}]}}
{"id":2,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.19250000000000003}],"name":[{"id":2,"value":"vadas"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":0.0}],"age":[{"id":3,"value":27}]}}
{"id":6,"label":"person","properties":{"gremlin.pageRankVertexProgram.pageRank":[{"id":35,"value":0.15000000000000002}],"name":[{"id":10,"value":"peter"}],"gremlin.pageRankVertexProgram.edgeCount":[{"id":6,"value":1.0}],"age":[{"id":11,"value":35}]}}

Vertex 4 ("josh") is isolated below:

{
  "id":4,
  "label":"person",
  "properties": {
    "gremlin.pageRankVertexProgram.pageRank":[{"id":39,"value":0.19250000000000003}],
    "name":[{"id":6,"value":"josh"}],
    "gremlin.pageRankVertexProgram.edgeCount":[{"id":10,"value":2.0}],
    "age":[{"id":7,"value":32}]}
  }
}

Hadoop-Gremlin for Vendors

Hadoop-Gremlin is centered around InputFormats and OutputFormats. If a 3rd-party vendor wishes to leverage Hadoop-Gremlin (and its respective GraphComputer engines), then they simply need to provide, at minimum, a Hadoop 1.x InputFormat<NullWritable,VertexWritable> for their graph system. If the vendor wishes to persist computed results back to their graph system (and not just to HDFS via a FileOutputFormat), then a vendor-specific OutputFormat<NullWritable,VertexWritable> must be developed as well.

Conceptually, HadoopGraph is a wrapper around a Configuration object. There is no "data" in the HadoopGraph as the InputFormat specifies where and how to get the graph data at OLAP (and OLTP) runtime. Thus, HadoopGraph is a small object with little overhead. Vendors should realize HadoopGraph as the gateway to the OLAP features offered by Hadoop-Gremlin. An example, vendor-specific Graph.compute(Class<? extends GraphComputer> graphComputerClass)-method may look as follows:

public <C extends GraphComputer> C compute(final Class<C> graphComputerClass) throws IllegalArgumentException {
  if(AbstractHadoopGraphComputer.class.isAssignableFrom(graphComputerClass))
    return HadoopGraph.open(this.configuration()).compute(graphComputerClass);
  else if(...) // vendor specific graph computer classes
    // return vendor specific instance
  else
    throw Graph.Exceptions.graphDoesNotSupportProvidedGraphComputer(graphComputerClass);
}

Note that the configurations for Hadoop are assumed to be in the Graph.configuration() object. If this is not the case, then the Configuration provided to HadoopGraph.open() should be dynamically created within the compute()-method. It is in the provided configuration that HadoopGraph gets the various properties which determine how to read and write data to and from Hadoop. For instance, gremlin.hadoop.graphInputFormat and gremlin.hadoop.graphOutputFormat.

Important
A vendor’s OutputFormat should implement the PersistResultGraphAware interface which determines which persistence options are available to the user. For the standard file-based OutputFormats provided by Hadoop-Gremlin (e.g. GryoOutputFormat, GraphSONOutputFormat, and ScriptInputOutputFormat) ResultGraph.ORIGINAL is not supported as the original graph data files are not random access and are, in essence, immutable. Thus, these file-based OutputFormats only support ResultGraph.NEW which creates a copy of the data specified by the Persist enum.

Conclusion

tinkerpop-character The world that we know, you and me, is but a subset of the world that Gremlin has weaved within The TinkerPop. Gremlin has constructed a fully connected graph and only the subset that makes logical sense to our traversing thoughts is the fragment we have come to know and have come to see one another within. But there are many more out there, within other webs of logics unfathomed. From any thought, every other thought, we come to realize that which is — The TinkerPop.

Acknowledgements

yourkit-logo YourKit supports the TinkerPop open source project with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. YourKit’s leading software products: YourKit Java Profiler and YourKit .NET Profiler

egg-logo Apache TinkerPop is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. Apache TinkerPop is distributed under the Apache License v2.0.

ketrina-tinkerpop3 Ketrina Yim — Designing Gremlin and his friends for TinkerPop was one of my first major projects as a freelancer, and it’s delightful to see them on the Web and all over the documentation! Drawing and tweaking the characters over time is like watching them grow up. They’ve gone from sketches on paper to full-color logos, and from logos to living characters that cheerfully greet visitors to the TinkerPop website. And it’s been a great time all throughout!

…in the beginning.