_images/logo.png

Tutorial

Imagine that you have a workflow made up of three tasks “A”, “B”, and “C”, and the tasks must always be perfomed in the right order, because task “C” depends on the output of task “A”, and also depends on the output of task “B”. Further, imagine that the individual tasks are time-consuming, so that you don’t want to execute a task unless it’s really necessary: if something has changed that only affects task “C”, and tasks “A” and “B” have already been completed, then you should only need to redo task “C”. Over time, keeping track of which tasks need to be executed can become extremely complex as your workflow grows, branches, and merges.

Graphcat is a tool that allows you to explicitly capture a workflow in a computational graph, managing the details of executing each task in the proper order and at the proper time, no matter the state of the tasks or the complexity of the workflow. Graphcat doesn’t care what kind of data your graph manages, doesn’t dictate how you name the entities in the graph, provides advanced functionality like expression-based tasks, and is easy to learn.

Intrigued? Let’s look at some code!

The Basics

First, we import graphcat, which includes all of the functionality for managing computational graphs. If you’re using Graphcat in your scripts, this will likely be all you need. For this tutorial we also import graphcat.notebook, so we can see the state of our graphs as we work.

[1]:
import graphcat
import graphcat.notebook

Next, let’s reproduce the example workflow from above, starting with an (initially empty) computational graph:

[2]:
graph = graphcat.StaticGraph()

Next, we will add tasks to the graph, identified using unique string names:

[3]:
graph.add_task("A")
graph.add_task("B")
graph.add_task("C")

Note that a task name can be any hashable object, not just a string - we used strings in this case because they map well to our particular problem.

Now, we can define the links that determine which tasks depend on previous tasks:

[4]:
graph.add_links(source="A", targets="C")
graph.add_links(source="B", targets="C")

There are two ways to think about links. One way is to picture data “flowing” through the links from the source tasks to the target tasks, which is why we sometimes call the sources “upstream” and the targets “downstream”. Alternatively, you can say that the target of a link “depends on” the source - anytime the source changes, the target needs to change, along with all of its targets, and-so-on. Both viewpoints are completely valid, and you will find that both are useful, depending on the context.

Finally, because a picture is worth \(1\times10^3\) words, let’s see what the graph looks like so far:

[5]:
graphcat.notebook.display(graph)
_images/tutorial_11_0.svg

Notice that each task is drawn as a box, labelled with the task name, and the links are drawn as arrows that point from sources to targets, i.e. the arrows point in the direction of data flow.

Of course, all we’ve done so far is define how our tasks relate to one another - we haven’t actually executed any of them. Before we do so, let’s introduce some logging so we can see what Graphcat is doing under the hood. We’ll import the standard Python logging module and configure it to log informational messages. Then, we create a special graphcat.Logger object that will watch the computational graph and log events as they happen:

[6]:
import logging
logging.basicConfig(level=logging.INFO)
logger = graphcat.Logger(graph)

By default, newly-created tasks are considered unfinished, because they haven’t been executed yet. Let’s finish task “A” by updating it:

[7]:
graph.update("A")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task A executing. Inputs: {}
INFO:graphcat.common:Task A finished. Output: None
_images/tutorial_15_1.svg

The call to update executes the unfinished task, which we see in the second line of logging; once the task has been executed, the third line in the log shows that its state is now finished (ignore the “Inputs: …” and “Output: …” text in the log, we will explain their meaning shortly). Note that in our visualization task “A” is now rendered with a black background to show that the task is finished.

Continuing on, let’s update task “C” and see what happens:

[8]:
graph.update("C")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task B executing. Inputs: {}
INFO:graphcat.common:Task B finished. Output: None
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: None, None: None}
INFO:graphcat.common:Task C finished. Output: None
_images/tutorial_17_1.svg

Looking closely at the log, we see that Task “C” is executed, but only after Task “B”. Task “A” isn’t executed, because it was already finished before update was called. Note that this conforms to our original goals for our workflow: tasks “A” and “B” must be completed before task “C”, and we never re-execute tasks that are already finished.

To reinforce this point, let’s look at what happens if a task becomes unfinished again. Imagine that some outside change has made the results of task “A” obsolete. We can notify Graphcat that this has happened using mark_unfinished:

[9]:
graph.mark_unfinished("A")
graphcat.notebook.display(graph)
_images/tutorial_19_0.svg

Notice that both “A” and “C” have become unfinished: because “A” is unfinished and “C” depends on “A”, “C” becomes unfinished too. “B” is unaffected because it doesn’t depend on “A”. Let’s update “C” again:

[10]:
graph.update("C")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task A executing. Inputs: {}
INFO:graphcat.common:Task A finished. Output: None
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: None, None: None}
INFO:graphcat.common:Task C finished. Output: None
_images/tutorial_21_1.svg

This time “C” is executed, but only after “A”. As expected, “B” isn’t executed because it was already finished.

Hopefully, we’ve convinced you that Graphcat always knows which tasks to execute, and in what order. This is true no matter how complex your computational graph becomes. In the next section, we will explore how to configure the graph to perform real work.

Task Functions

In the previous section, we learned how to represent our workflow using tasks and links, but the tasks themselves didn’t actually do anything when executed. To rectify this, we will assign task functions that define what a task does when executed. A task function is simply a Python function (technically: a Python callable) that is called when a task is executed, returning a value that is stored as the output for the task. When downstream tasks are executed, their task functions have access to the outputs from their upstream dependencies. Thus, upstream task function outputs become downstream task function inputs.

Let’s turn our current example into a simple calculator. Tasks “A” and “B” will have task functions that return numbers, and task “C” will return the sum of its inputs. First, we define the task functions for each task:

[11]:
def task_a(graph, name, inputs):
    return 2

def task_b(graph, name, inputs):
    return 3

def add(graph, name, inputs):
    return sum([value() for value in inputs.values()])

Note that every task function must accept three keyword arguments: graph, name and inputs. The graph argument is the graph that this task is a part of; name is the name of the task being executed, and is useful for logging or changing the function’s behavior based on the task’s identity; inputs is an object that behaves like a Python dict and contains the outputs from upstream tasks.

Don’t worry too much about how add() is implemented, we’ll discuss that in detail in a bit. Let’s assign our task functions to each task in the graph:

[12]:
graph.set_task("A", task_a)
graph.set_task("B", task_b)
graph.set_task("C", add)
graphcat.notebook.display(graph)
_images/tutorial_26_0.svg

Notice that changing the task functions with set_task also marks the tasks as unfinished. This is an example of how Graphcat always ensures that changes to the graph will propagate to its results. Let’s update the graph and see what happens:

[13]:
graph.update("C")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task A executing. Inputs: {}
INFO:graphcat.common:Task A finished. Output: 2
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task B executing. Inputs: {}
INFO:graphcat.common:Task B finished. Output: 3
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: 2, None: 3}
INFO:graphcat.common:Task C finished. Output: 5
_images/tutorial_28_1.svg

Now, the full meaning of the log messages should be clearer - tasks “A” and “B” have no inputs when they execute, returning the values 2 and 3 respectively as their outputs. Those outputs become the inputs to “C” when it executes, where they are summed, so that the output of “C” is 5, as expected.

Of course, you normally want to retrieve the outputs from your graph so you can do something with them. So far, all we’ve seen are log messages. To retrieve the most recent output for a task, use output instead of update:

[14]:
print("Result:", graph.output("C"))
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task C updating.
Result: 5

Note that output implicitly calls update for you, so you can simply use it whenever you need to execute your graph and retrieve an output.

Now that our graph is performing a real (albeit trivial) task, let’s look at some ways to simplify setting it up:

First, it is extremely common for a graph to have “parameter” tasks that simply return a value, as tasks “A” and “B” do in our example. Having to create a separate function for every parameter would be perverse. Fortunately, Graphcat provides a helper function, graphcat.constant, that you can use instead:

[15]:
graph.set_task("A", graphcat.constant(4))
graph.set_task("B", graphcat.constant(5))
print("Result:", graph.output("C"))
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task A executing. Inputs: {}
INFO:graphcat.common:Task A finished. Output: 4
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task B executing. Inputs: {}
INFO:graphcat.common:Task B finished. Output: 5
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: 4, None: 5}
INFO:graphcat.common:Task C finished. Output: 9
Result: 9

graphcat.constant is a factory for task functions that always return a value you provide, eliminating the need to create dedicated task functions of your own for parameters. Use graphcat.constant with set_task any time you need to change the parameters in your workflow, whether due to user input, changes in the environment, network traffic, or any other externality that affects your workflow outputs.

Next, you may wonder why we had to call both add_task and set_task just to create a working task. In fact, we didn’t - either method can create a task and assign its function in a single step:

[16]:
graph.set_task("D", graphcat.constant(6))

The difference between add_task and set_task is that the former will fail if a task with the given name already exists, while the latter will quietly overwrite it.

Let’s connect our newly created task “D” to “C”, and see that it integrates nicely with the rest of the computation:

[17]:
graph.set_links(source="D", targets="C")
print("Result:", graph.output("C"))
graphcat.notebook.display(graph)
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task D updating.
INFO:graphcat.common:Task D executing. Inputs: {}
INFO:graphcat.common:Task D finished. Output: 6
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: 4, None: 5, None: 6}
INFO:graphcat.common:Task C finished. Output: 15
Result: 15
_images/tutorial_36_2.svg

Named Inputs

By now, you should have questions about the way inputs are passed to task functions. From the log message in the preceding example - {None: 4, None: 5, None: 6} - it’s obvious that the results from “A”, “B”, and “D” are passed to “C” using something that looks like a dict, but what’s with the key None, and why does it appear multiple times (something that can’t happen with an actual dict)?

What’s happening is that when you create a link between a source and a target, you also - implicitly or explicitly - specify a named input on the target. When the target task function is executed, the named inputs become the keys used to access the corresponding values. This makes it possible for task functions with multiple inputs to tell those inputs apart. If you don’t specify a named input when you create a link, the name defaults to None.

Let’s look back at the implementation of the add() function:

def add(graph, name, inputs):
  return sum([value() for value in inputs.values()])

Here, the function doesn’t need to know the names of its inputs, since all it does is add them together. That is why it uses the values() method of the inputs object - like a normal Python dict, values() provides access to just the values, ignoring the input names. Note though, that unlike a Python dict, the objects returned by values() aren’t the values themselves - they are callables that have to be executed to return the values - which is why the code is sum([value() ... instead of sum([value ....

Let’s modify our current example to access inputs by name. Instead of adding values, we’ll create a new task function that generates a familiar greeting:

[18]:
def greeting(graph, name, inputs):
    return f"{inputs.getone('greeting')}, {inputs.getone('subject')}!"

Note that the greeting() task function uses two inputs named "greeting" and "subject". Each call to inputs.getone(<name>) will return the value of the named input. If there isn’t an input with the given name, or there’s more than one, the call will fail.

Now we can setup the parameter and greeting task functions for our existing graph:

[19]:
graph.set_task("A", graphcat.constant("Hello"))
graph.set_task("B", graphcat.constant("World"))
graph.set_task("C", greeting)

And we’ll replace our existing links with links that connect to the named inputs required by the greeting() function (note that set_links replaces all of the outgoing links for a given source, instead of add_links, which adds new links):

[20]:
graph.set_links(source="A", targets=("C", "greeting"))
graph.set_links(source="B", targets=("C", "subject"))

… instead of passing just a task name as the target for set_links, we pass a (task name, input name) tuple instead. Like task names, input names don’t have to be strings - they can be any hashable object. Let’s see the result:

[21]:
print("Result:", graph.output("C"))
graphcat.notebook.display(graph)
INFO:graphcat.common:Task D updating.
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task A executing. Inputs: {}
INFO:graphcat.common:Task A finished. Output: Hello
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task B executing. Inputs: {}
INFO:graphcat.common:Task B finished. Output: World
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: 6, greeting: Hello, subject: World}
INFO:graphcat.common:Task C finished. Output: Hello, World!
Result: Hello, World!
_images/tutorial_45_2.svg

Note that the notebook diagram links are labelled when they’re connected to inputs with names other than None.

Now, the input dict for “C” printed to the log should make more sense - it contains all of the named inputs and corresponding upstream outputs for the task. Note that task “D” is still connected to input None, but it’s ignored by the greeting() implementation.

It should also be clear now why a name can appear more than once in a task’s inputs: you can connect multiple tasks to a single input, one task to multiple inputs, or any combination of the two.

By examining the input object, a task function can implement any desired behavior, from very strict (failing unless the input contains a specific set of names, numbers, and types of values) to very permissive (adjusting functionality based on names, numbers, and types of values in the input dict), or anywhere in-between.

Errors

What happens when things go wrong and your task function fails? Let’s find out, using a special Graphcat helper function for generating task functions that throw exceptions:

[22]:
graph.set_task("D", graphcat.raise_exception(RuntimeError("Whoops!")))

(In case you’re wondering, we use this for testing and debugging)

[23]:
try:
    print("Result:", graph.output("C"))
except Exception as e:
    print(f"Exception: {e!r}")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task D updating.
INFO:graphcat.common:Task D executing. Inputs: {}
ERROR:graphcat.common:Task D failed. Exception: Whoops!
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task C updating.
Exception: RuntimeError('Whoops!')
_images/tutorial_50_2.svg

As always, Graphcat ensures that task states are always consistent - when a task functions fails (“D” in this case), execution stops, the task and its dependents are marked as being in the “error” state, and the update or output methods that initiated the update re-raise the exception. This will keep happening as long as the error condition persists:

[24]:
try:
    print("Result:", graph.output("C"))
except Exception as e:
    print(f"Exception: {e!r}")
graphcat.notebook.display(graph)
INFO:graphcat.common:Task D updating.
INFO:graphcat.common:Task D executing. Inputs: {}
ERROR:graphcat.common:Task D failed. Exception: Whoops!
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task C updating.
Exception: RuntimeError('Whoops!')
_images/tutorial_52_2.svg

Once, the error is cleared-up, things will return to normal:

[25]:
graph.set_task("D", graphcat.constant(42))
print("Result:", graph.output("C"))
graphcat.notebook.display(graph)
INFO:graphcat.common:Task D updating.
INFO:graphcat.common:Task D executing. Inputs: {}
INFO:graphcat.common:Task D finished. Output: 42
INFO:graphcat.common:Task A updating.
INFO:graphcat.common:Task B updating.
INFO:graphcat.common:Task C updating.
INFO:graphcat.common:Task C executing. Inputs: {None: 42, greeting: Hello, subject: World}
INFO:graphcat.common:Task C finished. Output: Hello, World!
Result: Hello, World!
_images/tutorial_54_2.svg