Jump to content
Sign in to follow this  
WolfBaginski Bearsfoot

Server Restarts: Have they hidden a problem?

Recommended Posts

I suppose we all know that restarting the sim server is one of the common last-resort fixes for problems in SL, and we also expect this to happen at least once a week, as a side effect of the regular roll-outs of new code.

From my point of view, not privy to the vast amount of information about what else, besides the sim servers, is getting fixed, fettled, and fscked by the Lindens, it looks as though the extended time between sim restarts, over the last six weeks or so, has greatly increased the problems of SL.

In particular, while a sim that is routinely occupied, mainland or private island, has a chance of a restart being requested, my experience of the growing number of dreadful sim crossings is that areas of open water, all "Protected Land", may not be adequately monitored.

My hypothesis is that the routine restarts have been hiding some serious underlying problems in the sim code. While it may not be formally classed as a memory leak, the gross symptoms are similar, in that at some point the accumulated effects slow down the sim. With a memory leak, the proximate cause of the slowdown may be a need to use virtual memory. I suppose you could have similar effects from a badly-coded logfile system, but I would have thought programmers have got past that carelessness long ago.

The viewer code is, at least, able to be studied by outsiders, and hence we know that it has had problems. Some problems, over the last six months, are consistent with a "big picture" failure which misses warning signs, but people outside Linden Labs were able to spot the causes pretty quickly.

The sim server code doesn't have that advantage. What's the fix for that? A project in the style of Viewer 2, to re-write the code from scratch? If so, all I can say is that I hope you have a solid specification od what the server code does.

 

Share this post


Link to post
Share on other sites

Long story, short; pretty much yes.  Frequent rolling restarts are masking isues which take a week or more to be observable. (._.)

If anything, the lab should (and hopefully does) maintain a small collection of sims which run a certain degree of endurance testing.  Getting code reviewed by an external QA is a risk many closed-model software companies won't take.  So, it's pretty general in practice to gather performance issues from compiled code only.  Not saying it's the best way to work out issues, but, going open sources isn't the beat all. end all either. (>_<)

In a way, though, this is a rather old topic.  When Havok4 hit the grid, sandboxes began becoming plagued with asset issues.  Ghosted prims, flaky inventory performance, and general overall mess.  Was it all result of some mad rush of bad code?  Likely not.  What was occurring was that Havok1 was craching sims so often that most sandboxes would get a daily restart.  With the newfound stability of Havok4, old problems became more apparent.  (._.)

This is a pretty typical growth process for something so complex.  Each solution can possibly lead to a number of pre-existting problems being unmasked.  So, in a way, what we're experiencing right now isn't as alien as it may seem.  This isn't to say that the sim code doesn't need any in depth review.  But, given what I got from a brief conversation with Bagman at SLCC, the lab has their work cut out for them.  With that in mind, there are no easy answers... (._.)

 

 

Share this post


Link to post
Share on other sites

For over 6 months I have been trying to tell the Lindens about exactly the issue you described, Wolf. But the term I use is "sleep", because the homestead open water sims you mentioned are acting exactly like they go to sleep after a period of non-use.

Theoretically, one would study the problem, collect data, present it on the JIRA, and the Lindens would take it from there. However, I tried that and it didn't work. It appears that they don't want any new JIRA's about this issue, they want all comments to be placed at the end of a 4.5 year old "catch all" JIRA, which allows them to ignore the comments if they choose to so do. When it comes to this particular problem, they have definitely chosen to do so.

They either:

A) Do not know how bad it is because they think all the region crossing problems are related to SL just being SL, or

B) They do realize the sims are "going to sleep" but have no idea how to fix it.

I tried to open a specific JIRA on this "going to sleep" issue, which started about 9 months ago, but it was closed as a duplicate JIRA of the 4.5 year old "catch-all" JIRA.

Since I sell a product related to sim crossings I have put in a lot of time studying this particular issue, I have a wealth of information about it. But I have *no way* to get that information to someone who matters.

 ETA: There is a silver lining here, I believe. The Lindens tweaked region crossing performance not too long ago, and now I see they are doing it again. Tweaking is fine, so once we finally convince them that unused homestead regions are going to sleep (and they fix it) region crossings will be so tweaked they'll be absolutely awesome! :)

Share this post


Link to post
Share on other sites

As Sideways said you are pretty much correct.

The Lindens do know that the weekly restarts are hiding problems. During the holidays they did restarts even when code was not rolling out. The last couple of weeks the release channel code has been recycled and not made it to the main grid. I did not see any restart notices in the first part of January. So, they may be letting them run without restart and problems are building up.

The SLEEP problem you are talking about is very much like the Time Warp problem they worked on for over a year to track down. The server OS upgrades in late 2011 were the fix for the Time Warp problem. Depending on when you tried to fire a JIRA they may have assumed Time Warp was what you were talking about.

Some of the menory leaks and other problems they may not be chasing too hard. When you are rewriting massive parts of the system as the Lab is doing, it seems reasonable for them to hope the problem is in the code being replaced.

Lots of work is going on to improve performance and fix problems. The user group meetings with the Lindens let us ask what's new. For a time now the answer is always... or mostly... stability and performance.

If you want to see how well they do or don't handle JIRA's, look at: the SVC Project in the JIRA.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this  

×
×
  • Create New...