Data Modeling - lenards/axos-star GitHub Wiki

Beginner's Mind

Shoshin is a Zen concept that is commonly called "Beginner's Mind", sometimes called "Don't Know Mind".

Wikipedia's entry summarizes it as:

It refers to having an attitude of openness, eagerness, and lack of preconceptions when studying a subject, even when studying at an advanced level, just as a beginner in that subject would.

The reason I mention this is that I've been told in everything I've read and watched (McFadin video series) that you need to jettison all your relation viewpoints and come to data modeling in Cassandra with a fresh mindset or a clear perspective.

There is even a recent article discussing the mindsets that need to be changed (or abandoned) as you approach data modeling with Cassandra.

How am I going to do that?

Start with the questions
Write what you wish was there
Backfill as needed

Quick Background (A Bit About the Use Case...)

Interesting, the schema for the TARDIS was created with any thought about the queries or usage of the results from those queries. So, I'm going to just start abstraction with the sorts of data that will be sent and how someone might request that back.

First, we're building a "provenance" system. It's similar to, maybe, how you'd think about web analytics (like the data tracked via Google Analytics, KISSmetrics, Keen.io, etc). But provenance is a whole area of research in academics. It is mainly about the ability to re-build all the operations that have been taken on a piece of data (or a data product of a software analysis). Heck, W3C has a whole specification for a data model (PROV-DM).

The unique contribution to the TARDIS project was the schema, so I'm going to ignore other data models and ontologies for provenance for now and just focus on that existing representation.

Start with the questions

The data we'll represent...

We're going to have services that will produce events about objects and those objects may have relationships with each other, and the events done by a person or a service that is, well, an actor. It's worth noting that objects might be shared between people.

Some initial thoughts about questions...

What are the last 20 events done by an actor (person or service)?
What are the last 20 objects acted (given a person or service)?
What are the objects related to a given object?
Or: Who are the ancestors/descendants for a given object?
Who is the parent for a given object?
What services has a person used? (or What are the last 5 services used by a person?)

I'm sure there are more questions, but this is more than enough to start with.

Working through each of the questions...

Let's take the first question, and see what information we think the client making the query will have:

What are the last 20 events done by an actor (person or service)?

We expect the client to have the person or service - so either a unique text or ID for those actors.

A simple CQL query might be:

SELECT * FROM events WHERE created_by = :person_id LIMIT 20;

Implications for our schema

We're ordering the events in a way that we can easily get the last 20, so we're going to want doing ORDER BY on the event's creation time. This means it will be called out in the CLUSTERING ORDER BY for the CREATE TABLE, and that it will be part of a compound PRIMARY KEY.

If we consider some data that might be provided, an initial sketch for events might be:

CREATE TABLE events (
  event_id uuid,
  event_name text,
  event_type text,
  created_at timestamp,
  created_by text,
  actor_id int,
  -- ... 
  PRIMARY KEY (event_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

We don't know if we'll have an actor_id or some unique text, like created_by. But this question helps us see the need for the time component, created_at, to be part of the PRIMARY KEY and the query results indicate the need for ordering.

There is a problem though. We can't include created_by in the WHERE clause. We want to be about to query "events by person", but the person (the value of created_by) is not included as a primary key and not part of the cluster index. We can match by a primary, we do range queries on a clustering index, or we can fetch a slice of data using (again) the clustering index.

What do we do?

Perhaps we scratch out that first sketch of our schema and try something else. We still want our query to look something like "... WHERE created_by = person". So we can name or structure the table to support the query for that desire:

CREATE TABLE events_by_creator (
  event_id uuid,
  event_name text,
  event_type text,
  created_at timestamp,
  created_by text,
  actor_id int,
  -- ... 
  PRIMARY KEY (created_by, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

SELECT * FROM events_by_creator WHERE created_by = :person_id LIMIT 20;

Also, there is an unresolved bit here is how are the events representing the objects? That's not clear to me yet. We could have object(s) included as dynamic columns as we might have one to many. We could have objects represented as a collection, say a set of object-ids. I'm not sure how I want to deal with that yet, so I'm going to embrace the unknown and leave that unresolved for a bit.

What are the last 20 objects acted (given a person or service)?

Again, we expect the client to have the some identifying information about a person or service to provide.

A simple CQL query might be:

SELECT * FROM objects WHERE created_by = :person_id LIMIT 20;

This is pretty close to our query for (original) events table. But I'm guessing that objects will have different bits in it. So the object-id, a name, a description, a parent object, a created by, and a created at.

Implications for our schema

Again, we want the last N, where N is 20 for our query. And, we want to use the person in the WHERE, so that will change how we define the PRIMARY KEY. So, we have pretty much the same implications as events_by_creator.

CREATE TABLE objects_by_creator (
  object_id uuid,
  object_name text,
  object_desc desc,
  parent uuid, 
  created_at timestamp,
  created_by text,
  actor_id int,
  -- ...
  PRIMARY KEY (created_by, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

One thing that I can't shake is that parent fells like a foreign key and I'm not sure if there is a better way to handle this in the Cassandra Data Modeling world. We'll see.

What are the objects related to a given object? (or Who are the ancestors for a given object?)

This question brings up another challenge. We'd like to do something like:

SELECT * FROM objects WHERE parent = :some_object_id;
-- we'd *like* do to this, but can't...

The reason why is the use parent in the WHERE clause. It is not a partition key, and it is not a cluster key. An instinct might be to say, "oh - secondary index". But I believe the cardinality of a uuid is going to quite high and for Secondary Indexes, we want low cardinality things like "States name in the United States".

We'll want to consider other approaches to tackle this questions. Let's pass on it for now.

Next question for now...

Who is the parent for a given object?

We would have had a seemingly simple answer to this query if we'd done objects table similar to our originals thoughts on events. It's not clear if we should go that way, but we know objects_by_creator will not allow object_id as a field in the WHERE clause. We might also consider how this question relates to object ancestors & descendants questions we just passed on.

Again, we cannot just do this:

-- **won't work**
SELECT parent FROM objects_by_creator WHERE object_id = :obj_id;

Without object_id being part of the definition of a PRIMARY KEY (either as a partition key or a clustering key), it should be used in the WHERE clause.

We may want to group the questions around objects and their relations to each other and consider those as a subset. Let's defer this question like we did the other ones about object lineage (ancestors/descendants).

Our final question (for now).

It's not clear how we're capturing the services, the programs that are reporting the activity in the way of events by actors. But, there is not anything in events nor objects that provide us with a way to answer this question. We'd like need to have something like a materialized view, and it's cool if we duplicate some data.

What services has a person used? (or What are the last 5 services used by a person?)

SELECT * FROM service_activity WHERE used_by = :person_id;
SELECT * FROM service_activity WHERE used_by = :person_id LIMIT 5;

We're not going to just have services as our rows. So we're keying off of the person and then we're going to have want to order again so that the latest activity is easily fetchable.

CREATE TABLE service_activity (
  used_by text,
  service_name text,
  occurrence timestamp,
  PRIMARY KEY (used_by, occurrence)
) WITH CLUSTERING ORDER BY (occurrence DESC);

Now, the services that are used_by someone will be dynamically inserted as columns (we've created a wide row - it's just like "Time Series Pattern 1" discussed in Patrick McFadin's blog post).

Let's kick the tires on this table a bit more:

cqlsh> CREATE TABLE service_activity (
   ...   used_by text,
   ...   service_name text,
   ...   occurrence timestamp,
   ...   PRIMARY KEY (used_by, occurrence)
   ... ) WITH CLUSTERING ORDER BY (occurrence DESC);
cqlsh> INSERT INTO service_activity (used_by, service_name, occurrence)  VALUES ('lenards', 'file-upload', unixTimestampOf(now()));
... 
cqlsh> SELECT * FROM service_activity WHERE used_by = 'lenards'; 

 used_by | occurrence               | service_name
---------+--------------------------+--------------
 lenards | 2014-08-22 14:59:13-0700 |  file-upload
 lenards | 2014-08-22 14:59:13-0700 |  file-upload
 lenards | 2014-08-22 14:59:12-0700 |  file-upload
 lenards | 2014-08-22 14:59:11-0700 |  file-upload
 lenards | 2014-08-22 14:59:11-0700 |  file-upload
 lenards | 2014-08-22 14:59:10-0700 |  file-upload
 lenards | 2014-08-22 14:59:09-0700 |  file-upload
 lenards | 2014-08-22 14:59:09-0700 |  file-upload
 lenards | 2014-08-22 14:59:07-0700 |  file-upload

(9 rows)

Even though the visual output shows this growth dimension as going down, it really is growing within the row (which is what McFadin shows for "Time Series Pattern 1").

used_by  | ... | occurrence:2014-08-22 14:59:13-0700 | occurrence:2014-08-22 14:59:13-0700 | ... |
---------+-----+-------------------------------------+-------------------------------------+-----+
 lenards | ... | service_name: file-upload           | service_name: file-upload           | ... |

It took me long time to overcome this. I was told that wide rows grow horizontally, but approaching from CQL I was really never quite sure when I was dealing with a wide row or not.

Also, We could use "Time Series Pattern 2" to break up the wide row and split it into segments by, say, year. You might call that partitioning, and that'd be right - but it seems partition is overloaded and I see it more clearly as "segments" (might just be me).

Our answer to the question...

With the final question of this section,

What services has a person used? (or What are the last 5 services used by a person?)

we can see that our query,

SELECT * FROM service_activity WHERE used_by = 'lenards' LIMIT 5;

actually does answer our question:

cqlsh> SELECT service_name, occurrence FROM service_activity WHERE used_by = 'lenards' LIMIT 5;

 service_name | occurrence
--------------+--------------------------
  file-upload | 2014-08-22 15:04:33-0700
  file-upload | 2014-08-22 15:04:33-0700
  file-upload | 2014-08-22 15:04:32-0700
  file-upload | 2014-08-22 14:59:13-0700
  file-upload | 2014-08-22 14:59:13-0700

(5 rows)

And another thing...

Since we have a cluster key, occurrence, we can also do range queries.

cqlsh> SELECT * FROM service_activity WHERE used_by = 'lenards'
   ... AND occurrence > '2014-08-22 14:59:10'
   ... AND occurrence < '2014-08-22 15:00:00';

 used_by | occurrence               | service_name
---------+--------------------------+--------------
 lenards | 2014-08-22 14:59:13-0700 |  file-upload
 lenards | 2014-08-22 14:59:13-0700 |  file-upload
 lenards | 2014-08-22 14:59:12-0700 |  file-upload
 lenards | 2014-08-22 14:59:11-0700 |  file-upload
 lenards | 2014-08-22 14:59:11-0700 |  file-upload
 lenards | 2014-08-22 14:59:10-0700 |  file-upload

(6 rows)

cqlsh> SELECT * FROM service_activity WHERE used_by = 'lenards' and occurrence > '2014-08-22 15:00:00';

 used_by | occurrence               | service_name
---------+--------------------------+--------------
 lenards | 2014-08-22 15:04:33-0700 |  file-upload
 lenards | 2014-08-22 15:04:33-0700 |  file-upload
 lenards | 2014-08-22 15:04:32-0700 |  file-upload

(3 rows)

... below be placeholders

Resolving the connection between `events` & `objects`

Resolve object/parent relationships

What are the objects related to a given object? ** Or: Who are the ancestors/descendants for a given object?