Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery for background tasks? #66

Closed
davidread opened this issue Jul 1, 2014 · 44 comments
Closed

Celery for background tasks? #66

davidread opened this issue Jul 1, 2014 · 44 comments
Labels

Comments

@davidread
Copy link

CKAN has long had some integration with Celery for performing background tasks. However there have been issues and some people say we should do something else. It would be good to resolve this as we need it for ckan/ckan#1796 among other things.

Celery for:

  • it is the most used queue for python. Is default for Django.
  • it has lots of features you might want, like robustness, retries.

Celery against:

  • past problems with using RabbitMQ as a backend. @kindly and @davidread have seen it occasionally just lock up and not run new tasks on the queue until it is restarted. This appeared to be with the version of RabbitMQ (2.7.1 - that comes with Ubuntu 12.04) and may be fixed in more recent version. In the meantime, kindly switched data.gov to harvest using the Redis backend and has not looked back. davidread wants to switch DGU.
  • seems a bit difficult to see what's on the queue, what's going on, control. Maybe we just need to read the docs and learn it properly, although basically it is quite complicated.

Alternatives:

  • Pika - amqp library written in python, which might make it easier to use
  • RQ (python-rq) - a thin wrapper on Redis to make it work as a queue

Tryggvi suggested these Celery tips: https://denibertovic.com/posts/celery-best-practices/ such as using Flower to monitor Celery nicely.

@davidread
Copy link
Author

Tryggvi suggested in one of his projects the benefit in separating the code between front-end and back-end tasks (and Pika better for that than Celery), but I think in CKAN having a separate install process for back-end tasks is going to be more hassle than it's worth.

@wardi
Copy link
Contributor

wardi commented Jul 2, 2014

Anyone have any feelings about Skytools/PgQ ?

@rossjones
Copy link
Contributor

I don't like the idea of using a database as a queue. I like the idea of a redis backed queue (because we can also use redis for sessions).

While we're throwing out suggestions, and not technically a queue, but what about http://gearman.org/? It'd be nice to open up background processing to other languages and you get a choice about where you persist data (memcached, pg etc).

@wardi
Copy link
Contributor

wardi commented Jul 2, 2014

Does this really count as "using a database as a queue"? It's custom queueing code used by Skype that just happens to be available via SQL commands on a db we already have.

@rossjones
Copy link
Contributor

Perhaps. I'm just nervous about things I've never heard of before, as it often means it isn't very widely used. Maybe I'm just being pessimistic :) It does seem to be reasonably active though - https://github.com/markokr/skytools

What's the setup/install like?

@wardi
Copy link
Contributor

wardi commented Jul 2, 2014

apt-get install skytools seems to be an option for installation

@nickstenning
Copy link

Cross-posted from ckan/ckan#1796

I'm in favour of having a mechanism for processing delayed jobs in CKAN core. Celery is the go-to for such a system in a Python application, so unless there are clear and well-argued reasons for doing anything else, let's use that.

As for backend, Redis is certainly simpler to deploy and manage than Rabbit, and can be configured to have appropriate persistence properties for a queue (you should use AOF mode when using Redis as a queue).

(In a perfect world, I'd also kill ckan-service-provider and datapusher in favour of such a system, but I think that's a different discussion).

@davidread
Copy link
Author

@wardi Redis works mostly in memory, which is more appropriate for adding and removing things from a queue frequently, compared to a more disk intensive relational database. But I imagine you had your reasons for suggesting Skytools, so let's hear them.

@nickstenning I'm very happy for encourage in all debates plenty of partially-formed reasons and gut-reactions - let's keep this open. And I think we're settled on Redis - there is no proposal to change back to rabbitmq. Good tip on the AOF - we can add that when developing the docs for background tasks.

@davidread
Copy link
Author

btw what's ckan-service-provider? And what does datapusher use for a queue?

@wardi
Copy link
Contributor

wardi commented Jul 3, 2014

@davidread celery and redis are new things for me, and I'm an extremely lazy person.

skytools is also new for me, but seems less scary because it's based on something I do know.

I understand how to scale out wsgi processes, and I can set up replication and fail-over with postgres. solr doesn't seem to have any distributed options, so I just rebuild it if it goes away (but there's no data lost so no big deal). What's the best way to run redis so that we don't lose jobs?

@nickstenning
Copy link

btw what's ckan-service-provider? And what does datapusher use for a queue?

Datapusher uses its own queue which it stores (by default) in a SQLite database, built on top of APScheduler. I'm sure it's fine as far as it goes, but it smells to me strongly of NIH and could be easily replaced with a short Celery task.

skytools is also new for me, but seems less scary because it's based on something I do know.

Absolutely, but there's a huge amount of code you'd need to write if you want to use this. As I understand it skytools is a thin Python wrapper over a bunch of PL/PGSQL and C, and exposes a generic consumer/producer queue API. That's a long way from being a complete job runner, which I would expect to provide such features as:

  • log collection and archival
  • job timeouts
  • retries
  • periodic and repeat scheduling

Celery provides all of these and more, whereas skytools provides approximately none (which is fine, as it's not trying to fill the same space -- it's a much lower-level tool).

@wardi
Copy link
Contributor

wardi commented Jul 3, 2014

distributed operation, timeouts and retries sounds good for the sort of thing datapusher does. Also for what the qa extension does.

I was thinking of background tasks like "update the organization information for 10K datasets in a local SOLR core". To me that calls for something simpler.

@nickstenning
Copy link

What's the best way to run redis so that we don't lose jobs?

It rather depends on what scenario you're imagining. Probably the most common failure mode will be a celeryd crash. To protect against this you need a protocol which supports message acknowledgements, such as AMQP: hence Rabbit. With Redis in AOF mode with CELERYD_PREFETCH_MULTIPLIER=1 then a celeryd crash will lose at most N jobs, where N is the number of celery workers. (As I understand it -- although it's quite possible that there are worse scenarios.)

Other possible failure modes:

  • Redis craps all over its own database: no idea how likely this is, I've never seen it happen
  • Hardware failure underneath Redis: you'll need to look into a distributed Redis setup -- see the sentinel documentation but there are all kinds of exciting crevices to fall into here, and I would freely accept that I'd rather use Postgres in this scenario.

Unfortunately as far as I'm aware there just isn't a decent background job library for Python that works with Postgres yet. (Although with the addition of NOTIFY and LISTEN in 9.3 there's no particular reason you couldn't implement a passable queue on top of it.)

@nickstenning
Copy link

I was thinking of background tasks like "update the organization information for 10K datasets in a local SOLR core". To me that calls for something simpler.

Well, maybe, but one task scheduler is probably simpler than two, and you're certainly not obliged to use all of Celery's features!

@wardi
Copy link
Contributor

wardi commented Jul 3, 2014

full disclosure of my biases:

  1. For data.gc.ca we don't/won't likely use datapusher or the qa extension so don't need celery's advanced features
  2. Getting new software/services approved for use on our servers is like pulling teeth, only much slower

So, I probably shouldn't participate in this discussion :-)

@nigelbabu
Copy link

I haven't use celery enough to comment. I only have one point to make. Whatever we pick, let's please consistently use that for doing background tasks across CKAN, which makes it less of a pain.

@davidread
Copy link
Author

@wardi Celery is just python code, so would it need approval from your organization? Redis is pretty mainstream, so getting approval shouldn't be any more tough than other things I imagine. And I guess you could use postgres as a back end for Celery. But it's surely a good reason to avoid chopping and changing in the future.

Since we're going with queues in core ckan, then I think we should embrace it for indexing of all packages. This would be better than running a paster command that takes an hour or so before returning when you restore a database. And we could even put a progress bar in the package search UI, for a sysadmin to keep tabs on the indexing and to explain the low package count. It's not very necessary, but would ensure the queue software gets installed correctly and give devs a clear view of it how it works.

@wardi
Copy link
Contributor

wardi commented Apr 28, 2015

@brew Here's the ticket mentioned at the meeting this morning. As discussed above let's settle on Celery + Redis (non-distributed) as the standard approach for queues in ckan. I'm planning to build in that direction with my docker stuff.

@rossjones
Copy link
Contributor

I know this seems like it has already been decided, but having looked deeper at it, http://python-rq.org looks very interesting. It's easy to install and configure, and seems widely used (and suggested by) Heroku and is actively developed (https://github.com/nvie/rq).

@davidread
Copy link
Author

RQ has a small code-base, which is good, and we don't make use of the Celery features it leaves out: AMQP routing/delivery rules, tasks written in non-python languages. However install, setup of tasks and running tasks seem very similar to Celery (particular the versions newer than the one we're on at DGU), so I can't see much advantage in switching on the face of it. But if you do get a chance to convert archiver across to it and see if it is any simpler in reality, then great!

@TkTech
Copy link
Member

TkTech commented Oct 7, 2015

I use both rq and celery on a variety of projects. Both have their places, and celery is significantly more featured than rq.

For CKANs use case, rq is completely sufficient and easy to integrate. Its code complexity is far below that of celery and debugging it is downright enjoyable compared to celery. Its performance is also excellent (mostly because it tightly binds itself to redis instead of trying to support a wide variety of brokers and result stores).

In the cases where you need tens of thousands of workers across thousands of cores, extremely complex routing and highly scaled queues, I would definitely recommend celery.

For CKAN, where the general usage will likely be periods of heavy bulk loading followed by periodic bulk updates and individual record updates, I would just go with rq and keep it as simple as possible - probably just two queues (queue-default (for all tasks) and queue-ui (for user-triggered events such as reindexing a single dataset)).

@davidread
Copy link
Author

@TkTech thanks v. much for weighing in on this. It sounds very much like we should give it a shot with rq.

@wardi
Copy link
Contributor

wardi commented Oct 7, 2015

@TkTech If I want to schedule jobs like I would with cron, how would I do that with rq? I haven't found a nice way to run the cron daemon in the foreground (for use in docker) and I was hoping there would be a solution for periodic jobs in our queue of choice.

@TkTech
Copy link
Member

TkTech commented Oct 7, 2015

@wardi You would typically do that with cron (in the case of rq) or with beat (in the case of celery). In both cases, a separate process needs to be run to start the jobs (you can technically run beat inside of a worker, but you would never do this except for local development).

There is also the rq-scheduler 3rd party project which is stable, popular, and extremely easy to use.

For integration, it's easy to embed both rq-schedular and rq workers into a paster command (or some other convenience). For example, here is how I run workers using the same command line as I use for most general tasks, while using the configuration from a flask app:

    if args['worker']:
        with app.app_context():
            with Connection(Redis.from_url(app.config['BROKER_URL'])):
                qs = [Queue(n) for n in args['--names']] or [Queue()]
                w = Worker(qs)
                w.work()
synd-cli worker --names=queue-default --names=queue-ui

@deniszgonjanin
Copy link

Good to see rq has a scheduler. It's extremely important to have one for CKAN I think. Cron jobs are too simple, and their configuration lives outside of CKAN source, db, or config, introducing state where state shouldn't be. They have to be set up manually, and they have to be migrated manually as well. Cron really becomes inadequate for anything but the simplest CKAN deployments.

It sounds like we could give rq a shot - it seems like a better alternative, but before it's decided we should address:

  • Celery works well enough right now w/ redis as a back-end. Is rq better by enough of a margin to justify the work involved in switching?
  • Assuming we don't want to keep celery support as a legacy, switching to rq is a non-backwards compatible change. Extensions will need to be updated. As with the previous point, is this worth the hassle?

@TkTech
Copy link
Member

TkTech commented Oct 7, 2015

Celery is hardly used at the moment and I could find no extension that depends on it. The change should have no API impact.

@thriuin
Copy link

thriuin commented Oct 7, 2015

+1 for rq

@deniszgonjanin
Copy link

It's used in a few extensions that I know of:

We can migrate those easily enough, but celery is also used in at least a few large CKAN projects of orgs and governments that don't always release their code publicly. Don't assume that this change won't impact anybody, that's a terrible way to build a good open source project.

@rossjones
Copy link
Contributor

I agree, perhaps abstracting it out might be best.

But if govs are not releasing their code related to ckan, they are breaching the licence :(

@thriuin
Copy link

thriuin commented Oct 7, 2015

Just for the record (and I know no one said otherwise) Cdn open data does release all of our code to GitHub - unless there is something Ian isn't telling me ;-)

@deniszgonjanin
Copy link

@rossjones @thriuin that's a good point. We could send out a notice to ckan-dev asking if anybody knows where celery is being used, and to point us to the code. If we don't find (m)any cases, we can move to rq?

@rossjones
Copy link
Contributor

We can always implement rq alongside, move the core extensions and let people know then should move before release 2.x.y if they're depending on celery?

@amercader
Copy link
Member

Came hear to talk about schedulers, I'm glad rq has that covered :)

In terms of deprecating celery I don't think that is a major issue. Celery is not even a requirement for CKAN so if somebody or some extension is using it they will be already taking care of installing it. We can announce deprecation (once rq support is implemented and tested!) and keep the celery code on core for a release (what @rossjones said essentially). Maybe write a short guide about how to migrate jobs from celery to rq if necessary.

@CarlQLange
Copy link

CarlQLange commented Apr 19, 2016

Just dropping in to ask if there's a definitive doc for using celery with CKAN? All I've been able to find is cobbled together from readmes of a few extensions. Cheers!

@torfsen
Copy link

torfsen commented Apr 19, 2016

@CarlQLange: There is a section about background tasks in the CKAN documentation.

@CarlQLange
Copy link

@torfsen Aha! Thank you so much!

@rossjones
Copy link
Contributor

This idea now has a Bountysource

@torfsen
Copy link

torfsen commented Jul 14, 2016

There is now a new PR for this, see ckan/ckan#3165.

@amercader
Copy link
Member

Background jobs are now merged to master thanks to the brilliant job by @torfsen. Check the docs for more details:

http://docs.ckan.org/en/latest/maintaining/background-tasks.html

@CarlQLange
Copy link

Wow, that looks fantastic. Great job @torfsen.

@jqnatividad
Copy link
Contributor

Hi @amercader,
I know this issue has been closed, but just wondering if https://github.com/datacats/ckanext-webhooks is still usable given that it uses celery?

Also, the background-tasks doc uses webhooks as the first example of what background tasks are useful for. Is that a "for example" an aspirational or concrete example :)

@torfsen
Copy link

torfsen commented Feb 16, 2017

@jqnatividad, the old Celery system is deprecated but still available. Hence anything that is working now should continue to work. AFAIK there is currently no time plan for removing the Celery system.

@amercader
Copy link
Member

@jqnatividad, @torfsen wrote a great section on migrating to the new queue framework and on how to support both systems so it should be really easy to update ckanext-webhooks to support it.

@torfsen
Copy link

torfsen commented Feb 16, 2017

@jqnatividad, the documentation @amercader is talking about is here: Migrating from CKAN’s previous background job system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests