Notes & Ideas - akvo/akvo-core-services GitHub Wiki

---------------------------------------- NOTES 29 Oct Paul, Iván, Mark -------------------------- Links:

Triple store database structures

business logic in meta-data driven way

domain experts: adrian, laura, lissy, etc. We need to provide framework that empowers these type of users. model-driven development. In the same application, you describe entities, relationships etc. Based on that information, you create UI, etc.

When doing upgrade, the business logic / domain expertise remain intact, implementation changes.

It gets hard to understand RSR's data. Relationships are too complex to reason about by developers. Bottleneck: the domain experts can't manipulate the data structures, the developers can't reason about them because of the complexity involved.

Solving the meta problem: build something that solves the problem at hand, and also other types of data. Don't only store data, but also meaning.

Distinction of raw data and views on the data (functions of the data)

Can we store data about FLOW / RSR in a single way?

neo4j - graph database edge - predicate

Can we use the attribute approach of datomic in postgres? In-between relational and noSQL.

Database as a value. Just a pointer to a transaction, from which you can derive the current state of the database.

opendatom - mimick datamic way of storing and handling data, in an open source way one node will be datom: entity, value, time

library that presents you an immutable facade, with the API of datomic

First transaction in database: what are the attributes that will be used.

name, cardinality, type, uniqueness, index, docstring.

Attribute needs to exist before you can use it in an transaction.

Example: Force.com technical description.

Datomic datom: entity (id), attribute (id), value (string), transaction(id)

Only a few tables are required in such an approach

We can use a naive implementation of datomic that can be performant for our demands

Force.com is a multitenant system.

First bit of identifier is tenant to which this data belongs

For example: UUID with first part replaced by consecutive tenant id.

Postgres also has PostGIS - might be useful

Postgres might have scalability problems

Concrete steps: Iván to model datomic way of storing data in postgres

--------------------------------------- NOTES 3 Oct ---------------------------

goal: find a homogeneous architecture that minimises complexity, allows sharing skills and knowledge, and allows building products with less effort
we now have Python stack + JVM stack + Apache/Nginx/Google App Engine etc
June Amsterdam meeting: more service-orientated architecture.
Needs to be a team effort
Needs to be able to handle scale. We don't have a big data problem now or in a year from now. Postgress, for example, would be sufficient

First problem: storing data and retrieving

Paul - Data is the lowest common denominator: FLOW + RSR are data capture and reporting services. Carl - service breakdown is most important - each service could have its own datastore Iván - try to achieve for devops an easy to maintain datastore stack. Paul - Geo consideration: for example Couch + geo, JSON document store

Services we run are mostly read-focussed: retrieve data in a transformed way

-------------------------------------- Architecture description -------------- Edward Tufte on Dashboards: "When thinking about dashboards, you should start by considering the intellectual problems that the displays are supposed to help with. The point of information displays is to assist thinking; therefore, ask first of all: What are the thinking tasks that the displays are supposed to help with?"

#General considerations A data system answers questions based on information that was acquired in the past. You answer questions on your data by running functions that take data as input.

####Definitions

Information - is the general collection of knowledge relevant to your Data System. It is synonymous with the colloquial usage of the word "data".
Data - will refer to the information that can't be derived from anything else. Data serves as the axioms from which everything else derives.
Queries - are questions you ask of your data. For example, you query your financial transaction history to determine your current bank account balance.
Views - are information that has been derived from your base data. They are built to assist with answering specific types of queries.

What do the tools do

Capture - devices, web interfaces, external sources (IATI)
Analyse - aggregate by properties, trends
Delivery - dashboards, widgets, websites, reports, other export formats

What data do we have

Open Aid: 70.000 projects = ~ 700.000 transactions
RSR: 1000 projects, 1000 orgs, 3000 updates
FLOW: 250.000 surveyInstances = ~5.000.000 facts

If we scale up by a factor of 10, we will have something like 50.000.000 "facts".

Possible shared services

authentication, logging, workflow, storage, memory, API

Sources and inspiration:

Big Data, Principles and best practices of scalable realtime data systems, by Nathan Marz

Components

The overall aim is to combine simple components that:

transform
move (queue)
route
record/remember
display

Authentication

Datastore

The data system should manage:

the storage and querying of data with a lifetime measured in years,
encompassing every version of the application to ever exist,
every hardware failure,
and every human mistake ever made.

####Properties of data

rawness: the data we care about are the facts, not the aggregates, which are views on data
immutability / perpetuity: a fact has a time, and does not change: it is true forever

####Desired properties of the data system

Robust and human fault-tolerant - immutable data, possibility of recomputation
Low latency reads and updates - people expect changes to data to propagate immediately
Scalable
General - compute arbitrary views on datasets
Extensible - adding a new view should be easy
Allows ad hoc queries
- Easy to generate reports
Minimal maintenance
Debuggable

small data (projects, surveys, question defintions) versus large data (facts, sensor data), which need different approaches

Paul - http api preferably. prefer not to build what already has been build.

Common goals:

not to reinvent the wheel, use existing tech where available
reusable core services that can be shared
favour simplicity & composiblity
ability to create reports from data
find tech that can reduce sys admin needs

FLOW requirements & goals:

to not use GAE, Amazon S3 (prefer open tools & systems instead) (current architecture makes it difficult to scale)
data immutability & versioning so that survey changes can be tracked & audited
API for reading FLOW data

RSR requirements & goals:

RESTfull api: CRUD. (Create Read Update Delete) Immutable API: CR only Question - are all of these possible to achieve at the same time? Simplicity is at the core

Question is if immutability is exposed in the API.

immutability is an implementation detail, and things are exposed as CRUD

Reports - often a combination of systems that work together Akvo Dashboard might need a batch + real time system

Carl, Gabriel: how to go from abstract to practical. Users in mind. How big is the data Big data is not only about volume, but also about nature of data (esp. when the structure of that data is unknown) [do we really have data for which we don't know the structure?] the views and analysis needed are quite unknown

Iñigo: FLOW needs better APIs to retrieve data Lynn - simplicity key, and speed of data display Oliver - functional requirements should come first. composable systems + layers Mark - we need to find some tools to think this through. Carl: http://12factor.net/, http://logstash.net/ - we will do analysis from user requirements as well Iván - need to think it through as a joint team. Thomas - think of what do we do next/next steps? keep everyone involved.

next steps: identify logical candidates for shared services: authentication & autherisition / roles & responsibilties, identity / trust / social graph, logging , notifications / email messaging, queuing / coodination http://logstash.net/

####Advantages of using immutable data

simplicity: the dataset is a an ever-growing list of facts
the dataset is queryable at any time in its history
the data is human fault tolerant
the data storage and query processing layers are separate
simplicity: no need for index / update, only append.

####Desired datastore capabilities - based on Datomic

immutability - only assert/retract facts, no update / delete
transactionality
consistency

Our systems are more heavy in reads then in writing, so high scalability in writing is not necessary.

As we would like to have the properties of immutability, transactionality, consistency, and also want to use open source tools, we might need a database layer that exposes an existing open source (mutable) database as an append-only, immutable datastore.

Possible candidates: HDFS, Postgres + database access layer?

Queueing

Messaging makes applications loosely coupled by communicating asynchronously, which also makes the communication more reliable because the two applications do not have to be running at the same time

Reasons:

Performance - improve response times by doing some tasks asynchronously
Decoupling - reduce complexity by decoupling and isolating applications
Scalability - build smaller apps that are easier to develop, debug, test, and scale, and distribute tasks across machines based on load
High availability - get robustness and reliability through message queue persistence, potentially with zero-downtime redeploys

Possible candidates: HornetQ (used by datomic), RabbitMQ, ActiveMQ (older), Apollo, Qpid, ZeroMQ, ZooKeeper Largest use: rabbitMQ (Erlang), followed by activeMQ (http://www.google.de/trends/explore?hl=en-US#q=activemq,+hornetq,+rabbitmq,+qpid&cmpt=q)

Caching

Possible candidates: Redis, memcached