Over the past week, a number of Second Life customers may have noticed that they were not being billed promptly for their Premium membership subscriptions, mainland tier fees, and monthly private region fees, with some customers inadvertently receiving delinquent balance notices by email, as we described on our status blog.
This incident has now been corrected, and our nightly billing system has since processed all users that should have been billed over the past week.
I wanted to share with you some of the details of what happened to cause this outage from our internal post mortem and more importantly, what we’re doing to prevent this happening in the future.
Every night, one of our batch processes collects a list of users that should be billed on that day, and processes that list through one of our internal data service subsystems. Internally, we refer to this process as the 'Nightly Biller'.
A regularly scheduled deploy to that same data service subsystem for a new internal feature inadvertently contained a regression which prevented this Nightly Biller process from running to completion.
On February 1st, 2016, we began a rolling deploy of code to one of our internal dataservice subsystems. For this particular deploy, we opted to deploy the code to the backend machines over four days, deploying to six hosts each day. The deploy was completed on February 4th.
The first support issue regarding billing issues was raised on February 8th, however, as we only had one incident reported to our payments team, we decided to wait and see if they were billed correctly the next night.
However, we were notified on the morning of February 9th that 546 private regions had not been billed, and an internal investigation began with a team assembled from Second Life Engineering, Payments, QA and Network Operations teams. This team identified the regression by 9am, and had pushed the required code fixes to our build system. By noon, the proposed fix was pushed to our staging environment for testing.
Unfortunately, testing overnight uncovered a further problem with this new code that would have prevented new users from being able to join Second Life. On February 10th, investigation into this failure, and how it was connected with both the Nightly Biller system, and our new internal tool code continued.
By February 11th, we had made the decision to roll back to the previous code version that would have allowed the Nightly Biller to complete successfully, but would have disabled our new internal feature. One final review of the new code uncovered an issue with an outdated version of some Linden-specific web libraries. Once these libraries were updated and deployed to our staging environment, our QA team were able to successfully complete the tests for our Nightly Biller, our new internal tool, and the User Registration flow.
The new code was pushed out to our production dataservice subsystem by 7pm on February 11th, and the Payments team were able to confirm that the Nightly Biller ran successfully later that evening.
As a result of this incident, we’re making some internal process changes:
- Firstly, we’ll be changing our build system to ensure that when new code is built, we’re always using the latest version of our internal libraries.
- Secondly, we are implementing changes to our workflow around code deploys to ensure that such regressions do not occur in the future
We're always striving for low risk software deploys at the Lab, and each code deploy request made is evaluated for a potential risk level. Further reducing our risk is internal documentation that describes the release process. Unfortunately, a key step was missed in the process, which inadvertently led to a high risk situation, and the failure of our nightly biller. The above changes are already in progress which will reduce the likelihood of incidents such as this recurring.