Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make harvester implementation simpler/more reliable #80

Open
wardi opened this issue Sep 10, 2014 · 9 comments
Open

Make harvester implementation simpler/more reliable #80

wardi opened this issue Sep 10, 2014 · 9 comments

Comments

@wardi
Copy link
Contributor

wardi commented Sep 10, 2014

The harvester extension has good interfaces for viewing job progress and submitting jobs, but implementing reliable harvesting jobs is tricky and when a job stops it can be hard to debug the cause.

Can we fix or replace the current harvester back end to make it easier to extend and debug?

@rufuspollock
Copy link
Member

As a general point would it not make sense to move the "harvesting" code itself (aka scraping) stuff out of CKAN core. One could keep the UI and the task reports in but keep the scripts elsewhere?

To do this I imagine we need to define a good service interface and agreement on how to report the results of a given harvest in a useful way ...

@rossjones
Copy link
Contributor

I don't think any of the harvester is in core. It's pretty much all in ckanext-harvest.

There was some discussion this afternoon with @seanh and @wardi about how CKAN could/should show/interact/handle out-of-bounds/async style tasks more generally, not just for the harvester. It would be good to get the ideas down in a doc/issue somewhere.

Perhaps it might be simpler to allow a single-stage harvester in addition to those more complex 3-stage ones?

@wardi
Copy link
Contributor Author

wardi commented Sep 10, 2014

I was trying to champion turning the harvester web interface into just a dashboard that can trigger calls via some newly defined API to a set of user-chosen services. Those services do whatever they need need to in however much time it takes. The services need to be able to accept jobs, be polled to check the status, and call back when something is completed or failed permanently.

I don't know if that's too loosely-defined to be useful, but it means people could implement their long running services in whatever they like. And those services could be reliable, or not (leaving it up to the client to resubmit)

@rossjones
Copy link
Contributor

I came to same conclusion with #65

I think, and I hesitate to write it but, 'microservices' might be a good way to go. It'd certainly make a lot of extensions more light-weight and possibly even easier to configure (at the expense of managing another process).

I kind of like this idea for harvesting too as it would be reasonably straight-forward (perhaps) to delegate the work to morph.io (for instance).

@davidread
Copy link

I think the harvester basically suits its job of harvesting datasets. It offers lots of useful functionality. e.g. harvester configuration, harvest job configuration, displaying errors per-job as well as per-dataset, displays progress, robustness for large jobs, list of datasets harvested, stats on how many datasets added/updated per job. These functions are all centralized and when you write a harvester you get them for free, which I think is good.

However I agree that the writing of harvesters is painful due to the 2 queues and the run/gather/fetch/import sequence. It needs simplification. I'd suggest a single queue, with one task doing one whole harvest job, and we get rid of the need to do a 'run' to get it started. It just sacrifices robustness, so if you stop the harvester during a job that takes hours then you would have to restart the whole job. But I think it is worth it in simplifying code and development of harvesters.

@amercader
Copy link
Member

I've been working a lot with the harvesting extension so I thought I'd give my view on this.

Some general comments (answers on a separate comment)

I see different aspects of harvesting being mentioned in this thread (UI, internal implementation, robustness, extensibility...). I think that these are all obviously related but each has different implications and ways that could be improved.

General architecture / complexity

Let's start with what the ckanext-harvest extension aims to support. The main use case is importing metadata from other catalogs periodically, efficiently and offering some sort of UI to manage it. In many cases we are talking about big volumes of metadata, and many different metadata formats.

This of course might be overkill in some cases, and I suspect that those who want "simpler" might do better with a script that uses the API to import stuff.

The harvest extension provides a framework for developing harvesters for different metadata sources (I guess we all agree that we can not support all sources by default). This framework provides access to an async queue (rabbitmq or redis, the latter being preferred), internal tables and logic to store sources, jobs and objects (remote documents), a plugin interface to write the actual harvesters and the UI to manage it all.

I think that moving all this to an external service a la DataPusher would be complicated, but I'd interested to see ways in which you could combine "traditional" harvesters (ie current harvest plugins) with external services under a common management interface.

Of course if we had a generic way for CKAN to interact with async tasks the harvest extension should totally use that rather than its current implementation.

Writing custom harvesters

I agree with @davidread that people wanting to develop a harvester will find some or all the previously mentioned features helpful at some point. Having a way to record errors and showing them on reports, scheduling jobs, etc would need to be rewritten if an external service was used (which again, I'm not opposed to integrate with if people don't need this kind of stuff)

@amercader
Copy link
Member

Some answers to previous comments:

As a general point would it not make sense to move the "harvesting" code itself (aka scraping) stuff out of CKAN core. One could keep the UI and the task reports in but keep the scripts elsewhere?

All the harvesting stuff lives in ckanext-harvest

There was some discussion this afternoon with @seanh and @wardi about how CKAN could/should show/interact/handle out-of-bounds/async style tasks more generally, not just for the harvester. It would be good to get the ideas down in a doc/issue somewhere.

Agree, I think it would be incredibly useful to have access to a generic queue. Once we have this defined we can see how we would migrate the harvesting to it

Perhaps it might be simpler to allow a single-stage harvester in addition to those more complex 3-stage ones?

Happy to consider it as long as we keep supporting the existing ones (BTW it's a bit of a hack but you can do whatever you want on the first stage, including your whole process, and return False or [] to cancel the other 2)

I was trying to champion turning the harvester web interface into just a dashboard that can trigger calls via some newly defined API to a set of user-chosen services. Those services do whatever they need need to in however much time it takes. The services need to be able to accept jobs, be polled to check the status, and call back when something is completed or failed permanently.

See above for my thoughts on separate services. It could be good to combine both current and external harvesters under the same UI

I think, and I hesitate to write it but, 'microservices' might be a good way to go. It'd certainly make a lot of extensions more light-weight and possibly even easier to configure (at the expense of managing another process).

With regards to external (micro)services, that was the whole point of the CKAN Service Provider, which is currently used by the DataPusher. It provides a common interface for registering and running jobs and checking their status. The idea was to offshore tasks like DataStore importing (DataPusher), link checking, potentially harvesting... to external services that had a common API to interact with CKAN. It hasn't seen much movement but perhaps we can use it as a base for something more updated.

I think the harvester basically suits its job of harvesting datasets. It offers lots of useful functionality. e.g. harvester configuration, harvest job configuration, displaying errors per-job as well as per-dataset, displays progress, robustness for large jobs, list of datasets harvested, stats on how many datasets added/updated per job. These functions are all centralized and when you write a harvester you get them for free, which I think is good.

I agree, for many users these are nice to have and not worry about.

However I agree that the writing of harvesters is painful due to the 2 queues and the run/gather/fetch/import sequence. It needs simplification. I'd suggest a single queue, with one task doing one whole harvest job, and we get rid of the need to do a 'run' to get it started. It just sacrifices robustness, so if you stop the harvester during a job that takes hours then you would have to restart the whole job. But I think it is worth it in simplifying code and development of harvesters.

I disagree in some aspects, when dealing with large or even moderately big amounts of remote datasets depending on the source it makes sense to split the gathering of documents and the importing, and the more granular the process is the less time it takes to replicate and debug if something goes wrong. The run command is an inherent part of an async process (we need to start and finish jobs somewhere) but maybe I'm missing something. As I mentioned on the previous comment I agree 100% that writing harvesters is hard but I think we can improve it in other ways.

@rossjones
Copy link
Contributor

Point taken about the CKAN Service Provider - but I think it is the wrong level of abstraction for defining how services should interact with core CKAN. Which is part of the reason for me wanting to discuss the services at such a high level without discussing existing Python code.

I think we should be discussing what the interface itself looks like, not how you write your Python code to make use of a specific library. Defining interfaces (perhaps at the HTTP level) means more flexibility in your choice of platform when writing services, and possibly/maybe more contributions from outside the CKAN/Python community.

@amercader
Copy link
Member

Point taken about the CKAN Service Provider - but I think it is the wrong level of abstraction for defining how services should interact with core CKAN

Sorry, I meant to link to this page which has the HTTP API interface for how you interact with the service provider:

http://ckan-service-provider.readthedocs.org/en/latest/

I agree that we should not focus on code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants