I recently came across a scenario where I needed to implement a long running job, that could probably take more than 8 hours to finish the desired task. After a ton of unpleasant experiences with Web Jobs, I made Hangfire as my go to library for long running jobs. But this isn’t about Hangfire, or any specific library. This approach would work as well for any alternative that does not readily provide batch continuation (like Hangfire Pro does, but it comes at a price $$$).
Here’s the scenario. I am consuming a web service that provides a list of notifications that my app has missed. Each notification object contains a URL of the content that needs to be retrieved and analyzed.
Solution I: One job to rule them all
I think the diagram below is self explanatory.
Executing a long running job in a single thread like this can be very time consuming. While a long running job is bound to run into problems, even with re-entrant code this could waste hours of CPU time.
Solution II: Divide and conquer
We can improve our original solution by passing each notification object to a new job (Job B) that will further fetch the content object and analyze it. Thus splitting Job A into multiple jobs. Here we have the benefit of utilizing multiple threads, speeding up the execution of Job A.
Hangfire Jobs have to be re-entrant, or idempotent based on your level of expertise in programming theory. If Job A fails in retrieving the notifications due to a network or service unavailable error, Hangfire’s transient fault tolerance mechanism will kick in, and trigger the job again after a pre-configured duration. This behavior puts us at risk of creating duplicate jobs. A simple solution would be to ensure Job B is re-entrant such that duplicates do not interfere with business logic, but this comes at the cost of CPU accompanied in most cases with Bandwidth and Storage. Imagine this happening for a million incoming notification objects. Back to the whiteboard.
Solution III: New world order
We were pretty close to solving the problem in the previous solution. We will simply modify it a bit.
- In Job A, we will create a Unique ID (GUID) on every invoke, and use this as a Batch ID to refer to.
- Instead of triggering Job B directly; we store the Content URL (Job description) in a database with the Batch ID.
- We create a new Job definition (Job X) that accepts a Batch ID and will fetch all Job descriptions with the Batch ID.
- Each Job description is then iterated through and the Content URL is fetched to enqueue an instance of Job B.
- After each successful enqueue, we will mark the Job description as done.
If Job A fails while iterating over notifications, Job B duplicates will not be created as we have only stored their Job description. On the next successful respawn of Job A a new Batch ID will be created. If Job X fails, while iterating through Job descriptions, we have been marking Job descriptions as done. So on the next respawn only the Job descriptions that were previously not added to queue will be picked up. Order restored.
There is certainly room for improvement, by making Job A more efficient at handling its own state when dealing with a powerful API such as OData. This case is fruitful for all scenarios irrespective of the flexibility provided by the service Job A is dependent on. Got suggestions or improvement, feel free to buzz me or leave a comment below.