Thursday 19 March 2015

Don’t win the race, to lose it all to a race!

It was 3 in the morning but the office was buzzing! Absolute mayhem! You would think we were at war. We were faced with an issue like never before, and we were no closer to the root cause than Leonardo di Caprio was to winning an Oscar! Sure, the issue was simple enough.  On two separate instances, customers had been charged twice for the same transaction, and if that weren’t enough, the attachment logging payment attempts was not reflecting the number of payment attempts. To cut a long story short, the customers’ money was simply vanishing in a black-hole with absolutely no traces on the Salesforce instance whatsoever! Yeah, THAT!

Now, it’s not for nothing that we are called the ‘fire-fighting’ unit internally but our reputation was facing some serious threats! We were at it for almost 20 hours by now, and the only possible explanation we had figured out so far was the one hiding behind two daemon (demon for us really) batches overlapping within the same time frame. And then, over a cup of magical coffee, we had our ‘Eureka’ moment! “What if the batch is reading what they are not supposed to?” “What if another batch has processed a particular record and the first batch still sees the unprocessed record?” In this case, the batch was triggered at the same time each day, processing records for that day. Quite a common, every-day use case. What was different, however, was that the batch started on day 1, extended its execution on day 2, and by then the second batch had already been triggered, resulting in two batch threads executing simultaneously. The answer lied on the thin edge of newline (\n) character between the start() and execute() Batch class methods. A lot more than what met the eye was happening between the executions of these methods!!!

The code was developed based on the incorrect assumption that a set of batch triggered will ALWAYS finish before another batch set. The records were being processed twice because Salesforce caches records queried in start( ) across start and execute invocations. In other words, Batch 2 dirty reads Record 1 state as it was during the time Batch2.start( ) was executing, and not the current database state after Batch1.execute( ), which has already processed the record.



How we got our Mojo back

Before I tell you how we fixed it, here’s a thumb rule for future reference, which I learnt the hard way: Never, ever depend on the state of records in Batch.start() method, but query IDs in start() and generate the QueryLocator accordingly.

In Batch.execute(), we re-queried the database on the set of IDs retrieved in Batch.start() including the where clauses. Including the where clause again in Batch.execute() was very important because the same records which passed the where clause in Batch.start() may fail to satisfy the where condition(s) in Batch.execute(), as another batch transaction may have modified the records in the interim.


Sample Code Snippet



The final word…

As programmers, we are often guilty of linear thinking, building batches after batches on the same track, without anticipating runtime interaction between multiple batch threads. Comprehensive testing is a key factor, although it’s not given that you will be successful in uncovering all issues/bugs during the testing phase itself. The best possible way to tackle such scenarios is a two-way magic mantra: 1) to never assume the order of execution of the batches and 2) to follow defensive programming techniques as mentioned above.

Over the next few weeks, I will discuss a few other race scenarios like locking records in the same transaction to avoid race conditions, trigger race conditions, all aimed at making the job of developers a tad easier.


Written by Swapnil Shrikhande, Salesforce Champion at Eternus Solutions

2 comments: