Over the course of the last few months building our backend, we’ve learned a lot about working with Akka to build a stable, resilient cluster. We went down some dead-ends and most definitely had some hair-raising moments of frustration, especially when it came to cluster stability. This post will briefly cover some of the lessons we learned in the hope that others don’t experience our exasperation.
(Shameless plug: To join Conspire and get a personalized weekly update about your email relationships, sign up at goconspire.com.)
Cap the amount of processing caused by each message
Our original implementation simply processed all of a user’s message headers at the same time. Generally this isn’t a problem but some of our users have millions of messages. When those users hit—or worse, when multiple users of this scale hit at the same time—our nodes would become unresponsive and marked as unreachable by the rest of our Akka cluster despite eventually completing successfully. This issue only manifested itself on AWS. Our first attempts to solve this problem involved tweaking Akka’s configuration settings: increasing the threshold for failure detection and moving its internal clustering and remoting to their own dispatchers. Sometimes this helped but we still ran into too many failures. Major problems occurred when a Stop the World garbage collection was triggered—the entire node would become unresponsive. Given how we generated huge amounts of data which often manage to graduate to tenured generations, a full garbage collection would take quite some time. Tweaking JVM settings helped but wasn’t a long term solution. If you aren’t familiar with garbage collection on the JVM, I highly recommend reading Oracle’s documentation.
Ultimately we had to move to incremental processing and cap the amount of work done concurrently. We chunked our message headers into blocks of 60,000, equivalent to roughly 30 megabytes of uncompressed JSON. At this size any given chunk would be processed fast enough (and cleaned up quickly enough) to avoid rendering the worker node unresponsive. This significantly ameliorated our stability issues without seriously compromising performance.
Lesson: Don’t try to do too much work at once, especially on a VPS.
Tune your cluster dispatcher
This recommendation is quite common on the Akka mailing list, we’re just repeating it here for emphasis: move clustering to its own dispatcher. This won’t solve all problems and won’t survive a major garbage collection but it will improve stability with clustering in a cloud environment.
Lesson: The cloud is a scary place, adjust accordingly.
Improve testability by not creating actors directly
Creating actors responsible for interfacing with external systems directly isn’t conducive for testing. If your processing actor creates its own persistence actor you’re going to have to jump through hoops in your unit tests to avoid including that external system in your test.
We favor two approaches: first, pass in such actors at creation. If you’re dealing with a SQL database this approach also more easily keeps your persistence actors on the own dispatcher so as to avoid blocking within the default dispatcher as well as manage the number of actors handling database connections at once. In testing you can pass in mock actors while passing in concrete implementations in production.
The second approach is to use the cake pattern. We aren’t using this at the moment but are considering moving parts of our codebase to it in the future.
Lesson: Avoid patterns that prohibit or complicate testability.
Make sure your nodes can spin themselves back up
JVMs crash and cloud environments can be unreliable and noisy. We tuned our failure detection settings to allow for some leeway in this regard but we don’t want to be so lax in failure detection that we miss legitimate failures. We mitigate this problem with a two-pronged approach: ensure your nodes can restart themselves and monitor for failure of the cluster. In our case, nodes which fall out of the cluster (that is, mark every other node as unreachable) kill their actor systems and shut down their JVM. Ubuntu’s upstart daemon monitors the process and restarts the JVM if it exists with a non-0 code. This allows us to accept and be comfortable with unexpected failures—the node will simply rejoin and any lost work will be marked for reprocessing by the supervisor.
Lesson: Make sure your system can heal itself.
Automated deployment is a necessity
Spinning up multiple EC2 instances and provisioning manually isn’t feasible. Mistakes will be made—you must have automated deployment if you want to save your sanity. We use Chef and Vagrant to provision and set up new EC2 instances. We can’t say we’re huge fans of Chef; it’s definitely got its warts but for the moment we’re content. Writing the recipes and getting everything configured correctly took us quite some time but once we got to a working state this setup has proved quite stable.
Lesson: Tying in with the previous lesson, automated deployment will make it easier to both scale and heal your cluster. We can add a new worker node with one command via Vagrant.
The FSM is your friend
One of Akka’s best features is its finite-state machine helper. Strictly speaking, the FSM isn’t needed but in certain situations it lends itself to far more readable code as well as guiding your thought process towards more predictable, maintanable code. We also found that the FSM handler better helps us catch corner cases we didn’t initially predict. This structure isn’t a magic bullet but writing a complex process with the FSM mixin will force you to clearly enunciate each state and transition within the process and more quickly find potential issues or holes in your logic.
Lesson: Your mental model of your system is very, very important and using helpers like the FSM can help keep everything straight when you translate from thought to code.
Rogue concurrency: the overreaching actor (aka be very very careful mixing actors and futures)
One of our biggest problems was what we term “rogue concurrency.” This commonly occurs when intensive work is done in the callback of a future. Recall that an actor processes one message at a time and immediately processes the next message upon completion of the current. Also recall that a future operates on a different dispatcher and an actor which triggers a future within its receive still returns immediately. These are good things for scalability and performance but also form a potential trap.
Imagine the following scenario: a request for CPU intensive work is sent to a worker node. In order to begin, the worker node sends an asynchronous request to the persistence layer to retrieve a large amount of data (say, 50 megabytes); when this future is fulfilled, work begins and can take quite some time.
Now imagine that 100 requests for such work come in at roughly the same time. Because each request is non-blocking, the actor immediately starts working on all 100 requests at once. All 100 requests for data are sent off leading to five gigabytes of data coming in from your persistence service. Even if you don’t get an OutOfMemoryError (we’ll assume you’re using a beefy machine with 32GB RAM), processing that much data at once causes your node to become unresponsive or run out of threads (which also causes an OutOfMemoryError). In either case, your node either dies unexpectedly or is marked unreachable by the rest of the cluster. We sometimes found on AWS m.medium instances that the JVM would die silently with no error logged—ultimately, rogue concurrency was found to be the culprit.
Futures aren’t required for this problem to occur—this same problem would happen if we never operated within a future’s callback but relied on Akka’s pipeTo helper and only did processing within an actor’s receive function. The 100 requests for data would go out and be piped into an actor for processing. This actor may only process one at a time but letting that amount of data pile up in the actor’s mailbox can also cause problems. Piping to a router of actors may help but not in every circumstance (see next lesson). What if the actor you pipe to delegates CPU intensive work to a child actor? Rogue concurrency.
There are several ways around this but our favorite is to always pull work rather than push it. This allows you to better manage the amount of work in progress at a given time.
Lesson: Avoid mixing concurrency patterns and always keep an eye on the amount of work done at any given time.
Pull, don’t push
In the early days of development we blindly pushed work from the supervisor node to a router of workers. This approach is terrible. When (not if) your nodes crash, any in-progress work is lost with no record of the queue. Without any throttling your nodes can fall into rogue concurrency where far more work is attempted at once than is feasible. We’ll go into greater detail on this approach in part 5 of this series. For now, read about this pattern for balancing work from the Akka blog for the basis of this pattern.
Lesson: Pushing work to workers contributes to rogue concurrency and inhibits resilience without some additional form of record-keeping. Letting workers pull instead both limits the amount of work done at once and makes recovery simpler in case a worker dies while processing.
Most of our lessons revolve around keeping the cluster stable and resilient. Akka certainly offers a lot of help in this regard but it’s not perfect and careful thought is still required. Ultimately, we’re now very happy with where our backend is but we had to go through some very hairy moments to get here.