AnsweredAssumed Answered

Failed timers and due timers causing login and restart problems

Question asked by jflukkien on Aug 11, 2016
Latest reply on Aug 12, 2016 by warper
Hi all,

The last 3 days i've been working on an issue we've encountered when using activiti and i'm wondering how to improve our setup to be more resilient to this issue popping up again.

Last saturday we had 30 processes failing because of a mismatch in what our activiti instance was expecting in data and what it got back. The processes all got to the state where they weren't getting retried and our entire application became unresponsive. When trying to log in it would just hang and not respond. And upon inspection of the host where the application was running we saw that there were a lot of futex calls and the amount of locks on the postgres instances behind activiti were becoming high enough for us to become worried.

At first we thought it might be an infrastructure problem. But this morning we were able to get our application up and running again, and i did this by manually setting all the job timers of the failed processes to the future such that it wouldn't try to execute them on startup anymore.

We are running activiti with two application nodes and a replicated postgres database behind it. And we use tomcat to serve the activiti web interface. The only thing we're developing are the bpmns and the java classes which are used in the bpmns.

It seems like our problem was caused by the guarantee that activiti runs jobs which duedates have passed coupled with our activiti application going belly up due to the exceptions it encountered while running our processes.

Now i'm left wondering, how can we prevent this from happening again. Our approach to activiti has been up till now to not develop anything directly in activiti, not touching the database behind it, and doing most of our work with java delegates. But ofcourse something can always go wrong. This is why i'm wondering if other people have encountered something like this. And how they solved it.

Thanks in advance for reading this long post :P

kind regards,