Break out QA into a separate service #65

rossjones · 2014-06-23T09:29:05Z

What?

Unlike datastore, I think QA is a viable candidate for breaking out into a separate service as there is no hard-requirement for it to be performant (not a real word, I know) and the bindings to CKAN could be conceptually looser than current implementations imply.

Why?

The ckanext-qa/ckanext-archiver combo is hard to set up and can consume a rather large amount of resource. More value can be provided to the wider open data community by opening up the QA processing.

How?

The QA Service should be easier to setup with a simpler interface to the caller. Having a separate service that responds to POST requests with asynchronous responses would provide value to others not using CKAN as well. QA/Archiver doesn't have any hard dependencies on CKAN (other than various branches writing direct to DB, or using the API) and could be modified to just POST a response in a known format.

The QA extension itself would just become a thin extension that would POST changes to a resource, and an API for receiving responses writing to a known extra fields in the resource.

davidread · 2014-06-23T09:49:32Z

This is a good aim, to provide QA on all CKANs everywhere.

It might be worth breaking it down a bit to decide what you want from the Archiver/QA features as it would be simpler to only have some:

broken link checks (archiver)
cache of the data (archiver)
file hash / length (archiver)
format detection (qa)
five stars of openness (qa)
feeding the file into other processes, e.g. to check particular CSVs against a schema etc. (archiver)

Doing the QA of a file is pretty quick - you can do 50k resources in a couple of hours or so, since it is just a matter of running file/magic on it or other analysis on the first 1k. The slow bits are: downloading the file (actually done by archiver) and if you start going over the API, reading of the resource and writing it back.

There really are two different use cases - one-offs and the whole of CKAN in bulk. For the latter case, you could read the resources in one batch by getting hold of the CKAN dump, and writing the results in batch would be good.

One more complication that could be usefully dealt with is dealling with files that change every time you read them, meaning you write to the resource every time you check (hash, file length etc). That makes no sense, creates excessive revisions/history and false alerts to watchers.

nigelbabu · 2014-07-04T05:33:29Z

Do you mean it'd be a separate service like datapusher is right now?

rossjones · 2014-07-04T08:23:53Z

Perhaps, but I was really thinking about something you can install independently of CKAN with a nice simple API/interface, that had no dependency on CKAN itself.

There's nothing in QA that needs CKAN, and I think it is valuable to others - like csvlint.io but more complete.

rufuspollock · 2014-07-04T08:53:30Z

It might be good to start by thinking independently of the exact implementation and focus on getting a good list of user stories that we're trying to solve and then work from that to the simplest/best solution.

rossjones · 2014-07-04T11:39:48Z

My first pass, off the top of my head.

QA User Stories

As a user
I want to see the quality of datasets
So that I don't need to download/preview it and check it.

As a user
I want to be able to see that the data is in the correct format
So that I don't get disappointed when the CSV is actually a photo, of a browser, on a laptop, inside a PDF file.

As a user
I want to know that the data is valid
So that I don't have to find someone to munge it.

As a user
I want to be able to see the five star rating for data
So that I know how useful the data is.

As a user,developer,sysadmin
I want to be able to see the full history of a data file
So that I can see how it has changed over time.

As a data-portal developer
I want to be able to host a QA/Archive service
So that my data is archived and checked for quality

As a data-portal developer
I want to have a low-impact experience when obtaining quality info
So that I don't have to spend a lot of time upgrading/testing/changing the software

As a data-portal developer
I want to the option to use someone else's service
So that I don't have to run my own

As a data-portal developer
I want the service to be able to POST back results
So that my integration point is simple.

As a data-portal developer
I want the service to be able to use webhooks
So that I can trigger another process once the file is archived and checked.

As a data-portal developer/sysadmin
I want to receive notifications when files become 404s
So that I can do something about it.

As a developer
I want to have a simple API to submit data for checking
So that I don't have to maintain a lot of code around QA

As a system administrator
I want to know my datasets are archived
So that I can ensure they are always available

As a system administrator
I want to know when data files are now generating 404s
So that I can take steps to fix it, or amend my metadata.

As a system administrator, data-portal developer
I want the service to pre-generate a JSON-based preview of the data (if tabular)
So that I don't need to depend on regenerating it each time.

rossjones · 2014-07-04T11:57:48Z

I think what I'm suggesting is that ckanext-qa become a very simple extension, delegating the work out to a centralised service (although obviously it should be possible to host your own). Something like.

The 'magic' section is the archiving, pregenerating preview, and QA checking, although it could incorporate things like csvlint.io. Also possibly s/token/apikey/

Although the more I think about it, the more this is really not very CKAN specific, and wonder if this is really the right place to discuss it.

rufuspollock · 2014-08-01T11:57:34Z

@rossjones I think this is spot-on and totally agree with the separation of concerns. In fact, original QA design had this and we were focusing on the "microformat" for storing QA results on datasets and resources.

I would emphasize KISS here in terms of next steps - whilst it is true that one could make something very generic (and very powerful) it might be worth focusing on just a few use cases (e.g. just QA and leave out previewing) and try and implement that.

wardi · 2014-09-11T18:39:51Z

If the services are always calling back to a ckan instance then auth on the services could be quite simple.

Each service would have a list of CKAN instances (API URLs) it's willing to do work for. Every time a job is started from the instance it would contact the service and prove that the work request came from this instance by having the service "phone back" to the instance API to verify the request and establish a token for the job. This makes it possible to handle all user authorization on the ckan side, keeping the services simple. We probably don't even API keys for the services.

rossjones · 2014-09-11T20:27:21Z

That would be nice and clean, and would make it easy to generalise a lot of the CKAN side for handling those tokens.

seanh · 2014-09-12T16:20:19Z

I've started work on a broken link checking "service" for CKAN here: https://github.com/seanh/ckanext-deadoralive that's similar to Ross's idea. Feedback welcome.

(Much of the design decisions are Ross and Ian's ideas from IRC...)

It's just for checking broken links at the moment, not doing five-star ratings or detecting file formats etc. That's all we need for the project I'm doing this for. It could easily be extended to do the other things as well (but I wonder if each should be a separate service?)

I hope this'll be much simpler and more reliable than ckanext-qa.

It's implemented in two parts: A CKAN plugin (one that's trivial to install, has no dependencies or setup) and a CKAN API client designed to be run manually in a terminal or by cron (also no dependencies or setup - just copy-paste a line into crontab).

I'm doing it as a cron job rather than a web service because it's much quicker and simpler to implement, gets you 90% of the way there, and is very instructive for how a web service could be designed. As long as it communicates with CKAN purely via the API and can be run on a different machine from where CKAN is installed (so not a paster command), then the code should be completely reusable if you were to wrap it in a web service.

The way it works is:

Cron or a human runs deadoralive.py, passing it a CKAN site's URL and an API key.
deadoralive.py makes an HTTP request to the CKAN API: "Give me N resources to check for broken links" (this is a custom CKAN API endpoint provided by the ckanext-deadoralive extension).
CKAN sends back N resource IDs
deadoralive.py loops over the resource IDs. For each one it: calls CKAN's resource_show() API to get the resource URL. Checks if the URL is working or not. Posts the result to CKAN (another custom API endpoint).
The CKAN plugin saves all the results in a database table, and uses them to display broken link reports.

The link checker is completely dumb, has no state.

The CKAN plugin is quite careful about which resource IDs it returns. It's designed so that you can potentially have many deadoralive.py link checker processes running at the same time, either on the same machine or distributed, but neither the link checker services nor CKAN need to know how many link checkers there are. Each one just asks for as many resource IDs as it wants to process, and CKAN sends different resource IDs to each of them.

It first sends resources that have never been checked (ordered oldest resources first), then if there are no more of those it starts re-sending resources that were last checked more than 24 hours ago. After those it will resend resources that have a pending check (i.e. CKAN hasn't received the result back from the link checker yet) from more than 2 hours ago. (It has to keep track of pending checks so that if multiple link checkers are requesting links to check simultaneously CKAN doesn't send the same link to two different checkers at the same time.)

Eventually it will send back empty lists, if all the resources have been checked recently. But then after while it'll start sending back resource IDs again, once it's time to re-check them. The checkers are dumb: they just keep periodically asking for more links to check,

So this is resilient to link checker service failures and can handle an unknown number of link checkers running simultaneously.

This is rough but implemented and working now. The major piece that isn't done is the display layer: I need to add code to the CKAN plugin that builds on top of the link check results it has in the db and provides "this resource is broken" notices on resource pages, a report page listing datasets with broken resources grouped by organization, etc.

That should be quite straightforward although it may have to query multiple tables of the CKAN db directly to build reports like a summary of datasets with broken links grouped by organization.

Thoughts

In Ross's diagram he has CKAN posting "please check this link" to the QA service (presumably CKAN will do this each time a new resource is created, or a resource URL is changed?), and then the QA service calling a callback URL with the check result. I've got it the other way round: the service calls CKAN to say "please give me some links to check." I think my way avoids a few problems, in Ross's version:

What if someone creates a new resource and CKAN sends a "please check this link" request, and then CKAN never gets a result back? You can't assume that the link checker service is 100% reliable. CKAN would have to periodically check for datasets that it doesn't have results for and request them again. But we want to avoid having any kind of period task running on the CKAN end (we don't want to have to setup and maintain Celery or anything like that)
We don't just want to check links when resources are created or updated, we want to check them periodically. A link may be working when the resource is created, then go down, then come up again. We want to keep checking each link every day or week, but again we don't want any periodic tasks on the CKAN end.

With my approach, the CKAN plugin has no dependencies and isn't running any long-running, periodic or asynchronous tasks. The link checker service runs periodically.

In the current implementation, the link checker gets a list of resource IDs from CKAN and has to call resource_show() on each of them to get the URL. Why doesn't CKAN just send the URLs along with the resource IDs in the first place? The reason is that it may be sending hundreds of resource IDs to a link checker at once and there may be a period of time between a resource ID being sent, and the link checker actually checking it. The resource's URL may change during this time. So this way the link checker always gets the current URL right before checking it.

This also avoids a problem with private resources. If there was a bug in the deadoralive plugin that made it send the IDs of private resources to link checkers (and currently, there is such a bug!) then it's only leaking the resource ID and not the URL. The link checker will get refused when it tries to call resource_show() and will move on to the next resource.

Still, this does mean that each link checker has to call the resource_show() API before checking each link. But how much do we really care about checking lots of links very fast?

Being coupled to CKAN: The deadoralive.py link checker API script is CKAN-specific. It's calling the API of the deadoralive CKAN plugin, using CKAN resource IDs, and calling the CKAN core API.

Particularly if we were going to make this into a web service, could we change the design to make it not tied to CKAN?

Would it even be worthwhile to do so?

Are there likely to be any non-CKAN users of a QA service?

It seems any QA microservice is going to have to define an API of its own, and then we're gonna have to write a CKAN plugin to make CKAN use that API, and any non-CKAN applications are gonna need their own plugins to use it as well. Is this likely to happen? Or realistically are we just reimplementing ckanext-qa for CKAN in a nicer way?

What are the pros of extending this to become a web service?

As far as I can think, the only advantage to a web service over a cron job is: it can support a "check this resource / dataset / organization for broken links now" button in the CKAN HTML interface for sysadmins, kicking off ad-hoc link-checkers tests in response to user interaction (because CKAN can send a request to the link checker service to do the check).

The primary mode of operation would still be automatic periodic checks, but admins could request ad-hoc checks as well.

That is an advantage and I do think a web service would ultimately be better than a cron job.

But I think the effort-benefit ratio is quite bad because a web service is a lot more work.

seanh · 2014-09-12T16:24:54Z

Another pro of doing a web service is that CKAN can automatically request an ad-hoc check when a user creates or updates a resource, so the user can see their link check result right away. With the cron version, they have to wait until the next time a link checker cron job runs.

seanh · 2014-09-12T16:33:35Z

Each service would have a list of CKAN instances (API URLs) it's willing to do work for. Every time a > job is started from the instance it would contact the service and prove that the work request came
from this instance by having the service "phone back" to the instance API to verify the request and
establish a token for the job. This makes it possible to handle all user authorization on the ckan side,
keeping the services simple. We probably don't even API keys for the services.

This sounds good.

Any thoughts on how we'd implement the web service? I'm thinking Flask (very nice for quick, small web apps with JSON APIs) and Celery with the Periodic Tasks feature. I haven't done much Celery coding though, and it does require you to run a celery worker in terminal in development, and to setup e.g. supervisord to run celery in production. Something with less setup would be nice, but it needs to be able to run a "check for new links to check or re-check" periodically.

Maybe it should be one link checker script that can be called by a web service for ad-hoc tasks, but can also be called by a cron job for periodic tasks!

But it'd be nice if all you had to do was install the Flask app, not install the Flask app and setup a cron job.

davidread · 2014-09-12T21:28:56Z

Cool! What do the results look like?

seanh · 2014-09-15T10:01:54Z

A link checker result (as posted back to CKAN by the link checker service) currently looks like:

{
  "resource_id": "<resource id>",
  "alive": True|False
}

That goes into a db table with columns:

id (autoincrementing unique id)
resource_id
alive
datetime

I may add a "reason" string but so far I don't think it's strictly necessary.

rossjones · 2014-09-15T10:05:34Z

It'd be handy to have a reason I think, for when alive=False, so we could have "Authentication required", or "Resource does not exist" or "An error occurred on the server".

rufuspollock · 2014-09-15T10:11:34Z

BTW is it worth checking back at our original spec re Data Quality info structure: http://web.archive.org/web/20111221052527/http://wiki.ckan.org/Data_Quality

In particular, field had:

last_checked: timestamp of last checked time
check_status: status code (see below)
last_good_date = MAX(resource_availability.checked_date) WHERE resource_status = ok
failure_count = COUNT(1) WHERE resource_status != ok
cached_date - date of resource that we have successfully downloaded and cached (somewhere)
validation: tbd (need to specific error info)

davidread · 2014-09-15T10:21:14Z

What about the WebUI? I agree with Ross and Rufus that it is not good enough just to give one result.
Trust me before we had detailed reports we constantly got push back from managers at publishers. e.g. messages forwarded back to us saying "No it's not broken" or "It must be a one-off" or "What is the problem?" or "And when did you do the test" or "We'll look at it sometime soon".

It is much more robust and indeed helpful to be able to respond like this: "It has been working on our weekly tests we've done since 4th March, and then the last two gave, the latest of which was yesterday at 4pm, gave an error "504 Server Error". The full results are on web page X and there is a public message on the dataset page to warn people before they try your link, which will be removed as soon as you fix it. Your publisher is below average in the public rankings, so let's sort it out before someone notices".

See example: http://data.gov.uk/dataset/financial-transactions-data-advantage-west-midlands/resource/9c1b2fc3-aff5-4ba6-b77c-9690b54d41a7

adamamyl · 2014-09-15T10:23:01Z

It'd be handy to have a reason I think, for when alive=False, so we could have "Authentication required", or "Resource does not exist" or "An error occurred on the server".

Maybe store the HTTP status code?

seanh · 2014-09-15T10:38:41Z

I've put some thought into defining a non-CKAN specific protocol for this: ckan/ckanext-deadoralive#1 feedback welcome

seanh · 2014-09-15T10:38:59Z

I'm happy to add a reason string.

seanh · 2014-09-15T10:44:12Z

What about the WebUI? I agree with Ross and Rufus that it is not good enough just to give one result. Trust me before we had detailed reports we constantly got push back from managers at publishers. e.g. messages forwarded back to us saying "No it's not broken" or "It must be a one-off" or "What is the problem?" or "And when did you do the test" or "We'll look at it sometime soon".

The way I'm planning to implement the web UI is that if a link has been down for at least 3 checks in a row over a period of at least 3 days (both numbers configurable) then we mark it as broken. That's why the CKAN plugin is storing a history of link checker results, not just the most recent one.

Since we're keeping a history of results we could potentially show detailed info about the history of checks like in the example you link to.

The only thing is that I currently have it deleting results that are more than 1 week (configurable) old to stop the db table from growing indefinitely.

seanh · 2014-09-15T10:52:25Z

last_checked: timestamp of last checked time
check_status: status code (see below)
last_good_date = MAX(resource_availability.checked_date) WHERE resource_status = ok
failure_count = COUNT(1) WHERE resource_status != ok
cached_date - date of resource that we have successfully downloaded and cached (somewhere)
validation: tbd (need to specific error info)

Right. So I still need to define my "reporting" APIs that will be used to put the link checking info into CKAN's HTML. That's what this seems to apply to. So note that I don't consider this to be part of the link checker web service protocol - this is just part of what the CKAN plugin does with the results.

I can compute last_good_date and failure_count from what I have in the db, just like you're doing here. But my deleting db rows over a week old would break them. I wonder if it's worth adding these two as actual db columns and computing them at write time?

I don't have cached_date or validation either. Caching isn't part of what this extension is doing, imo. Not sure what validation is supposed to be for.

rossjones · 2014-09-15T10:57:42Z

Actually, I think CKAN should do the reporting of what has happened - but the link-checker needs to supply enough information for it to do that. IMO keep the link checker simple and neutral (no CKAN specific code).

I'd suggest you need to add:

status code
a reason (a textual explanation of the status code)
the timestamp for when the check was made.

CKAN can then (if it wishes) store this information somewhere and calculate the other information based on the timestamps and status codes.

davidread · 2014-09-15T10:58:15Z

Good, so you have the test history covered.

BTW We're really happy with just storing the what gets reported, rather than the full details of every old test. i.e. just storing the date of the last succesful test and number of fails, rather than the dates of all the tests. I didn't see the justification for storing the full details of older tests, and it would be another order of magnitude of data to store. But maybe you have the use cases.

seanh · 2014-09-15T11:13:57Z

One more thought about the protocol between the CKAN plugin and the link checker service: one of the diagrams above has the web service supporting an optional "check status of a link checker task" API (where the CKAN site makes a request to the web service to get the status of a task, e.g. queued, succeeded, failed etc).

I don't think we should support this:

It requires CKAN to know which link checker service it gave which task to (ok this is not difficult but currently it's not required)
It requires the link checker service to keep track of what tasks it has done, so it needs storage / state. I'm trying to keep it stateless.
It requires CKAN to do an async / periodic task: Presumably it's going to periodically ask the link checker for the status of a task, until it gets a final result?

I'm trying to get all the storage / state in the CKAN plugin, and all the async and periodic stuff in the service, I think this keeps both things simpler: CKAN already provides a db to the plugin so it can have storage easily. CKAN doesn't provide a good easy way to do async or periodic so that's what the service should do, but to keep the service as simple as possible it shouldn't do anything else.

seanh · 2014-09-15T11:16:12Z

Actually, I think CKAN should do the reporting of what has happened - but the link-checker needs to
supply enough information for it to do that. IMO keep the link checker simple and neutral (no
CKAN specific code).

I'd suggest you need to add:

status code
a reason (a textual explanation of the status code)
the timestamp for when the check was made.

I'm happy to add status and reason.

I think CKAN can easily compute the timestamp itself though (it just does datetime.datetime.utcnow() when it receives a result) and this means we don't have to add a timestamp to the protocol (which could be complicated, need to define a format and validate it).

seanh · 2014-09-15T11:17:53Z

BTW We're really happy with just storing the what gets reported, rather than the full details of every
old test. i.e. just storing the date of the last succesful test and number of fails, rather than the dates
of all the tests. I didn't see the justification for storing the full details of older tests, and it would be
another order of magnitude of data to store. But maybe you have the use cases.

I think you might be right here. Currently I store every result for each resource, and delete the ones older than a week. But it might be much simpler to store only the latest result for each resource but add last successful and number of fails columns that get recomputed on each write, instead of computing these at read time.

seanh · 2014-10-14T16:02:01Z

This is what became of the link checker in the end:

https://github.com/ckan/deadoralive
https://github.com/ckan/ckanext-deadoralive
http://seanh.cc/posts/ckanext-deadoralive/
http://seanh.cc/posts/background-tasks-as-simple-web-services/

It's "finished for now" (fulfils our particular client's needs, bugs notwithstanding) but still plenty of potential enhancements in the github issues.

the42 · 2015-03-29T11:40:25Z

Wanted to let you know that I am running a broken link checker against CKAN. It is available here and works by harvesting some metadata, fetching all sorts of URIs (not just ressources but also contact points, etc.) and reporting into PostgreSQL. As I am not good at UI I stopped at aggregating statistics into Redis.

It's available here https://github.com/the42/ogdat/tree/master/ogdatwatcher

rossjones mentioned this issue Sep 10, 2014

Make harvester implementation simpler/more reliable #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break out QA into a separate service #65

Break out QA into a separate service #65

rossjones commented Jun 23, 2014

davidread commented Jun 23, 2014

nigelbabu commented Jul 4, 2014

rossjones commented Jul 4, 2014

rufuspollock commented Jul 4, 2014

rossjones commented Jul 4, 2014

rossjones commented Jul 4, 2014

rufuspollock commented Aug 1, 2014

wardi commented Sep 11, 2014

rossjones commented Sep 11, 2014

seanh commented Sep 12, 2014

seanh commented Sep 12, 2014

seanh commented Sep 12, 2014

davidread commented Sep 12, 2014

seanh commented Sep 15, 2014

rossjones commented Sep 15, 2014

rufuspollock commented Sep 15, 2014

davidread commented Sep 15, 2014

adamamyl commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

rossjones commented Sep 15, 2014

davidread commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Oct 14, 2014

the42 commented Mar 29, 2015

Break out QA into a separate service #65

Break out QA into a separate service #65

Comments

rossjones commented Jun 23, 2014

davidread commented Jun 23, 2014

nigelbabu commented Jul 4, 2014

rossjones commented Jul 4, 2014

rufuspollock commented Jul 4, 2014

rossjones commented Jul 4, 2014

QA User Stories

rossjones commented Jul 4, 2014

rufuspollock commented Aug 1, 2014

wardi commented Sep 11, 2014

rossjones commented Sep 11, 2014

seanh commented Sep 12, 2014

Thoughts

seanh commented Sep 12, 2014

seanh commented Sep 12, 2014

davidread commented Sep 12, 2014

seanh commented Sep 15, 2014

rossjones commented Sep 15, 2014

rufuspollock commented Sep 15, 2014

davidread commented Sep 15, 2014

adamamyl commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

rossjones commented Sep 15, 2014

davidread commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Sep 15, 2014

seanh commented Oct 14, 2014

the42 commented Mar 29, 2015