+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 11 to 18 of 18

Thread: Reseller6 Problems

  1. #11
    Join Date
    Jul 2006
    Posts
    1,502

    Default

    Quote Originally Posted by TonyD
    Hmmm... wonder what exactly is recorded as "up-time" then. httpd had been going down several times a day over the last couple weeks on that server according to WHM. I suppose if you're counting the amount of time the system itself was actually turned on and running, then 99.6% makes sense. If we're talking about the amount of time that websites were accessible on that server, there is simply no way (as I said, even if we just take the time listed in the forum for those two reboots, we still don't come up with that high a number).

    But this is all a side issue anyway. As I said, I understand problems, and I understand that these problems never come one at a time. I'm glad to know what your up-time policy is, even if it is lower than what many other hosts have. I believe you hit the problem on the head in the thread regarding the network issues... communication. Expectations of good communication were my own primary reason for joining WHB.

    I like what others have suggested regarding some other type of alert system. These forums work great... but if the forums are down, like they were this week... then they do no good.

    Sincerely,

    TonyD
    You will not find better uptime at a better price, that I am sure. The figures I posted are httpd uptime, and come from checks performed every minute, 24 hours a day.
    Matt R.
    WHB Chief Ninja

  2. #12

    Default

    Tony, in Matt's defense, your posts are wicked long and involve too much math so I don't blame him for just going by the figures he got from the monitoring software.

    I also agree that 99.6% is not such a horrible uptime for the price. If you expect 99.9% uptime then you should also expect to pay more. You can't go with one of the cheapest hosts out there and then complain that it's not perfect.

    However, Matt, if 99.621% is the figure you got from Nagios, then I think you might want to look into hosting the Nagios cluster at a different datacenter than your main locations -- or at least get one outside server to use for Nagios -- so that it can test the actual remote availability of HTTP.

    I say this because Alertra has you at 99.243% uptime for April. That's about 5 hours of downtime.

    If your Nagios cluster lives in the same datacenter as the servers it's checking, then it will only show downtime caused by internal server issues. You'll never see downtime caused by things like routing problems or service provider failures, because Nagios never has to leave the datacenter in order to contact the servers. It can just use the internal network. That could be the source of the discrepancy.

    This might also explain why in my "Downtime" thread, when I first mentioned the downtime, you said you hadn't even noticed it -- and that was after all 3 WHB sites had been going down over and over again, along with ALL the servers at FortressITX.

    Now this Alertra figure is from the Alertra monitoring that's included in the FindMyHosting.com report, and I'm pretty sure what they do is just test your main company URL. However, since it seems Tony's server suffered the same downtime as the main WHB site did in that incident a few days ago, then they must both be hosted at FortressITX, and this Alertra reading should be accurate for BOTH the WHB sites AND reseller6 (as well as every other server at Fortress).

    At least, it should be accurate in that the actual downtime for a server at Fortress is AT LEAST what Alertra says it is. Alertra wouldn't pick up on downtime caused by maintenance on one specific server, unless it happened to be the server where the main company site resides. That means that this Alertra figure only represents the downtime caused by that routing disaster. The downtime caused by reboots and maintenance on reseller6 should be ADDED to that figure.

    So I apologize for getting into math when I criticized Tony for it, but I'm going to anyway. Matt's figure from Nagios doesn't seem to include downtime caused by the routing failure, and Alertra's doesn't include the maintenance on reseller6. Therefore we should be able to put them together to get the total downtime reseller6 suffered this month, so far. The Nagios figure of 99.6% means 0.4% downtime, and Alertra's 99.25% means 0.75% downtime. The grand total is 1.05% downtime, 98.95% uptime.

    That's about 8 hours of downtime by my rough estimate. Still a far cry from 15 hours as Tony claims, but, 8 hours is still a lot.

    Hey now my post is wicked long and has a lot of math too. Oh well.

    http://fmh.alertra.com/fmhuptime/?id1=125006
    Last edited by equazcion; 04-28-2007 at 03:04 PM.

  3. #13
    Join Date
    Jul 2006
    Posts
    1,502

    Default

    Hi All,

    Our Nagios is clustered and I believe those to be accurate results. One thing that may cause the discrepancies is that we monitor at 1 minute intervals. So if a reboot takes 2 minutes, Nagios will record 2 minutes.

    I suspect your Alertra monitoring monitors every 5 minutes. So the same reboot, if it occurs when Alertra is scheduled to check, it's going to show much more than the 2 minutes it actually took. I know Alertra can monitor at 1 minute intervals but it's very expensive to do so.

    Either way, the downtime this month wasn't welcomed or expected and it was ironic that it had to happen when the two most senior members of staff were travelling, and travelling back from the data center in question. They are usually more stable, and we maintain an excellent relationship with them but we have called in our SLA this month to ensure we keep them on their toes.

    We are also moving to a more independant network solution within the coming weeks and we'll announce when we do this. Wayne (our senior admin) will be doing this in conjunction with datacenter personel.

    It is our intention to hit 100%, or as close to 100%, uptime each and every month whilst still maintaining current pricing. We have always pioneered the affordability level of hosting and we will continue to do so - at a ground breakin price, and with performance we are all happy about. At the end of the month, we'll have 9-5 phone support and 24x7 live chat support too. I do not know of another host with our pricing that will offer all of this, with the reliable service you are accustomed to.

    And lastly (this is possibly the longest forum post I've ever made!), we are going to be launching "business class" hosting probably next quarter. We are looking at a number of ways to implement high availability through software and hardware load balancers to the mass market. Our goal is to have entry level pricing starting at under $10 / month. There was significant interest in this product in the recent poll we did, which reinforced our belief that a product such as this would be extremely popular.
    Matt R.
    WHB Chief Ninja

  4. #14

    Default

    Clusters are great, but as I mentioned, if they are located within the same datacenter as the servers they're monitoring, then they are not picking up downtime caused by routing issues.

    And it doesn't matter what interval Alertra is using. Alertra is not monitoring your individual servers, it is only monitoring webhostingbuzz.com. You're talking about an inaccuracy due to the rebooting of reseller6, and I'm saying there's no way it would have even known reseller6 was down. It's only showing downtime from the routing issue. The discrepancy is caused by the fact that Alertra and Nagios were each logging downtime from 2 separate issues occuring at 2 separate times. Neither one shows the actual total downtime from both.

  5. #15
    Join Date
    Jul 2006
    Posts
    1,502

    Default

    I believe the Nagios cluster is intelligent (as we have 2 Nagios servers). I'll check, and if not, we'll definitely set it up this way.
    Matt R.
    WHB Chief Ninja

  6. #16

    Default

    I'm sure it's intelligent...

    I admit I don't know a lot about what Nagios is capable of, but I'm pretty sure that it's not possible for any software to do what you're hoping it does.

    One computer trying to connect to another will always choose the shortest possible route. If both computers are within the same local network, they will end up connecting via the local router. There's just no way around that, at least for software. It would take a good many custom router configurations to somehow force the request to go out onto the internet before coming back to a local server.

    This is why people use services like Alertra -- because the only way to monitor remote connectability is by actually connecting from a remote location. Nagios monitors services and tells you if they go down, as that's what it's meant for, but it can't tell you anything about server reachability from outside the datacenter.

    There is another way which I've used to do this from my home network, to test my own local web server using its remote address, and that is by using a remote proxy server. If I connect through a remote proxy, then I'm making a request to the proxy, rather than to my local server, so the request actually leaves the local network. This still involves using a remote machine though. I don't think there's any way around that.

    If you care, I have a suggestion for you. Since you now have servers at two different locations, Texas and NJ, I would take advantage of that. Set up one Nagios server in Texas, and one in NJ. Have the Texas one monitor the servers in NJ, and have the NJ one monitor the Texas servers. That way you'll always know if anything is inaccessible for any reason whatsoever.

    Anyway. I'll shut up now.
    Last edited by equazcion; 04-28-2007 at 05:54 PM.

  7. #17
    Join Date
    Jul 2006
    Posts
    1,502

    Default

    Yeah, that's what we have. Alertra uses Nagios too, just fyi. It is incredibly powerful (and complex).
    Matt R.
    WHB Chief Ninja

  8. #18

    Default

    And whenever your proxy is down, downtime is reported as well

    Now we know what uptime you think is reasonable, what server load do you think is acceptable?

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts