Aleyant Services Dashboard

One of our application modules crashed in our servers one at a time in different moments, causing the application to restart. The warmup process may take a minute and, cause some requests to fail due to timeouts (504 errors). We do have a healthcheck procedure in place, but it only removes the failed instances when they fail to respond for three times in a row with an expected time of 15s. This process may take up to 45s to detect that a node is having issues and only after that will wait for it to recover, what happens automatically. The crash dumps were collected and will be analysed by our engineering team.

28th August 2018