Notes & Ideas - akvo/akvo-core-services GitHub Wiki
---------------------------------------- NOTES 29 Oct Paul, Iván, Mark -------------------------- Links:
- http://zotonic.com/docs/0.9/manuals/datamodel/example.html
- http://clojureneo4j.info/
- http://www.neo4j.org/
- http://www.neo4j.org/learn
- http://www.youtube.com/watch?v=mxdpqr-loyA&feature=youtu.be
Triple store database structures
business logic in meta-data driven way
domain experts: adrian, laura, lissy, etc. We need to provide framework that empowers these type of users. model-driven development. In the same application, you describe entities, relationships etc. Based on that information, you create UI, etc.
When doing upgrade, the business logic / domain expertise remain intact, implementation changes.
It gets hard to understand RSR's data. Relationships are too complex to reason about by developers. Bottleneck: the domain experts can't manipulate the data structures, the developers can't reason about them because of the complexity involved.
Solving the meta problem: build something that solves the problem at hand, and also other types of data. Don't only store data, but also meaning.
Distinction of raw data and views on the data (functions of the data)
Can we store data about FLOW / RSR in a single way?
neo4j - graph database edge - predicate
Can we use the attribute approach of datomic in postgres? In-between relational and noSQL.
Database as a value. Just a pointer to a transaction, from which you can derive the current state of the database.
opendatom - mimick datamic way of storing and handling data, in an open source way one node will be datom: entity, value, time
library that presents you an immutable facade, with the API of datomic
First transaction in database: what are the attributes that will be used.
name, cardinality, type, uniqueness, index, docstring.
Attribute needs to exist before you can use it in an transaction.
Example: Force.com technical description.
Datomic datom: entity (id), attribute (id), value (string), transaction(id)
Only a few tables are required in such an approach
We can use a naive implementation of datomic that can be performant for our demands
Force.com is a multitenant system.
First bit of identifier is tenant to which this data belongs
For example: UUID with first part replaced by consecutive tenant id.
Postgres also has PostGIS - might be useful
Postgres might have scalability problems
Concrete steps: Iván to model datomic way of storing data in postgres
--------------------------------------- NOTES 3 Oct ---------------------------
- goal: find a homogeneous architecture that minimises complexity, allows sharing skills and knowledge, and allows building products with less effort
- we now have Python stack + JVM stack + Apache/Nginx/Google App Engine etc
- June Amsterdam meeting: more service-orientated architecture.
- Needs to be a team effort
- Needs to be able to handle scale. We don't have a big data problem now or in a year from now. Postgress, for example, would be sufficient
First problem: storing data and retrieving
Paul - Data is the lowest common denominator: FLOW + RSR are data capture and reporting services. Carl - service breakdown is most important - each service could have its own datastore Iván - try to achieve for devops an easy to maintain datastore stack. Paul - Geo consideration: for example Couch + geo, JSON document store
Services we run are mostly read-focussed: retrieve data in a transformed way
-------------------------------------- Architecture description -------------- Edward Tufte on Dashboards: "When thinking about dashboards, you should start by considering the intellectual problems that the displays are supposed to help with. The point of information displays is to assist thinking; therefore, ask first of all: What are the thinking tasks that the displays are supposed to help with?"
#General considerations A data system answers questions based on information that was acquired in the past. You answer questions on your data by running functions that take data as input.
####Definitions
- Information - is the general collection of knowledge relevant to your Data System. It is synonymous with the colloquial usage of the word "data".
- Data - will refer to the information that can't be derived from anything else. Data serves as the axioms from which everything else derives.
- Queries - are questions you ask of your data. For example, you query your financial transaction history to determine your current bank account balance.
- Views - are information that has been derived from your base data. They are built to assist with answering specific types of queries.
What do the tools do
- Capture - devices, web interfaces, external sources (IATI)
- Analyse - aggregate by properties, trends
- Delivery - dashboards, widgets, websites, reports, other export formats
What data do we have
- Open Aid: 70.000 projects = ~ 700.000 transactions
- RSR: 1000 projects, 1000 orgs, 3000 updates
- FLOW: 250.000 surveyInstances = ~5.000.000 facts
If we scale up by a factor of 10, we will have something like 50.000.000 "facts".
Possible shared services
authentication, logging, workflow, storage, memory, API
Sources and inspiration:
Components
The overall aim is to combine simple components that:
- transform
- move (queue)
- route
- record/remember
- display
Authentication
Datastore
The data system should manage:
- the storage and querying of data with a lifetime measured in years,
- encompassing every version of the application to ever exist,
- every hardware failure,
- and every human mistake ever made.
####Properties of data
- rawness: the data we care about are the facts, not the aggregates, which are views on data
- immutability / perpetuity: a fact has a time, and does not change: it is true forever
####Desired properties of the data system
- Robust and human fault-tolerant - immutable data, possibility of recomputation
- Low latency reads and updates - people expect changes to data to propagate immediately
- Scalable
- General - compute arbitrary views on datasets
- Extensible - adding a new view should be easy
- Allows ad hoc queries
- Easy to generate reports
- Minimal maintenance
- Debuggable
small data (projects, surveys, question defintions) versus large data (facts, sensor data), which need different approaches
Paul - http api preferably. prefer not to build what already has been build.
Common goals:
- not to reinvent the wheel, use existing tech where available
- reusable core services that can be shared
- favour simplicity & composiblity
- ability to create reports from data
- find tech that can reduce sys admin needs
FLOW requirements & goals:
- to not use GAE, Amazon S3 (prefer open tools & systems instead) (current architecture makes it difficult to scale)
- data immutability & versioning so that survey changes can be tracked & audited
- API for reading FLOW data
RSR requirements & goals:
RESTfull api: CRUD. (Create Read Update Delete) Immutable API: CR only Question - are all of these possible to achieve at the same time? Simplicity is at the core
Question is if immutability is exposed in the API.
immutability is an implementation detail, and things are exposed as CRUD
Reports - often a combination of systems that work together Akvo Dashboard might need a batch + real time system
Carl, Gabriel: how to go from abstract to practical. Users in mind. How big is the data Big data is not only about volume, but also about nature of data (esp. when the structure of that data is unknown) [do we really have data for which we don't know the structure?] the views and analysis needed are quite unknown
Iñigo: FLOW needs better APIs to retrieve data Lynn - simplicity key, and speed of data display Oliver - functional requirements should come first. composable systems + layers Mark - we need to find some tools to think this through. Carl: http://12factor.net/, http://logstash.net/ - we will do analysis from user requirements as well Iván - need to think it through as a joint team. Thomas - think of what do we do next/next steps? keep everyone involved.
next steps: identify logical candidates for shared services: authentication & autherisition / roles & responsibilties, identity / trust / social graph, logging , notifications / email messaging, queuing / coodination http://logstash.net/
####Advantages of using immutable data
- simplicity: the dataset is a an ever-growing list of facts
- the dataset is queryable at any time in its history
- the data is human fault tolerant
- the data storage and query processing layers are separate
- simplicity: no need for index / update, only append.
####Desired datastore capabilities - based on Datomic
- immutability - only assert/retract facts, no update / delete
- transactionality
- consistency
Our systems are more heavy in reads then in writing, so high scalability in writing is not necessary.
As we would like to have the properties of immutability, transactionality, consistency, and also want to use open source tools, we might need a database layer that exposes an existing open source (mutable) database as an append-only, immutable datastore.
Possible candidates: HDFS, Postgres + database access layer?
Queueing
Messaging makes applications loosely coupled by communicating asynchronously, which also makes the communication more reliable because the two applications do not have to be running at the same time
Reasons:
- Performance - improve response times by doing some tasks asynchronously
- Decoupling - reduce complexity by decoupling and isolating applications
- Scalability - build smaller apps that are easier to develop, debug, test, and scale, and distribute tasks across machines based on load
- High availability - get robustness and reliability through message queue persistence, potentially with zero-downtime redeploys
Possible candidates: HornetQ (used by datomic), RabbitMQ, ActiveMQ (older), Apollo, Qpid, ZeroMQ, ZooKeeper Largest use: rabbitMQ (Erlang), followed by activeMQ (http://www.google.de/trends/explore?hl=en-US#q=activemq,+hornetq,+rabbitmq,+qpid&cmpt=q)
Caching
Possible candidates: Redis, memcached