We're back! What happened?

Frogboy

Join Date 03/2001

+2420

We're back! What happened?

You might’ve noticed we had a serious outage that took down our website, forums—everything.

Some people have speculated on what could be so catastrophic as to take us down for WEEKS?

For legal reasons, I can't get into the details other than to say it wasn't ransomware but it was a total data loss. This catastrophe wiped out everything at our data center, including the on-site backups, so we lost over three decades of data in one hit.

Fortunately, we run nightly offsite backups, but they’re enormous—about 34 terabytes. That’s 34,000 gigabytes, all of which has to be downloaded, scanned, extracted, and then reuploaded to new servers. Just the download alone took over a week. Then came the challenge of figuring out which parts needed immediate restoring, in which order, and whether we should rebuild them piece by piece, create entirely new services, or move to a cloud-based infrastructure to avoid having a single colocation ever again.

We’re talking about a giant library of websites, databases, skins, themes, icons, wallpapers, videos, and more. Some of it’s ancient, from when we started well before Google or Facebook existed. Imagine sifting through tens of thousands of gigabytes to find a single legacy web service, built decades ago, that needs to run on a specific OS. It’s a painstaking process.

This outage has been extremely difficult. Everything from old box art for our products to OS/2 programs I wrote in my college days—gone, at least until offsite backups did their job. We had fallback backups on hard drives, DVDs, tapes, and so on, but for a while, it wasn’t entirely clear how much of that would be usable.

A fun fact some may not know: we have one of the oldest continuously used forums around, migrated from Usenet eons ago. That entire environment was wiped, so we’re rebuilding it from offsite storage. Not everything is back yet, and it looks like a few forum user accounts will be lost. That’s not related to customer data, but still worth noting.

We appreciate everyone’s patience. Getting services running again has been the top priority. It’s been a monumental effort, but we’re seeing real progress each day, and the community’s understanding means a lot.

Thanks for sticking with us,

-Brad (Founder & CEO)

galciv iv dev journals

+1 Karma

Advanced Search

Watch this post

Do not email me updates for this post Email me updates for this post

Successfully updated karma reason!

StormingKiwi

Did a lot of tech support on various levels, and this is a perfect example of the first three rules of computing.

Rule 1: Backup
Rule 2: BACKUP
Rule 3: See Rules 1 & 2

Additional advice, have multiple backups, including something offsite, you never know what will happen, and things can happen to entire buildings or more...

Now for the stupid story from when I worked tech support for a particular company

Took the next call, and the caller informed me that live update wasn't working. I mentioned it was this morning, and there haven't been any notices of an outage yet, but if he could give me a moment I'd check. Sure enough, it was out so I informed him.
He asked if I knew what was wrong, to which I replied, "No, haven't even gotten internal notice of it yet, for as far as I know right now the servers are flooded, but as they're on the second floor of the building next to me, I really don't think that's likely."
Sent off the internal queries/notices and about an hour later we got informed that a water pipe on the 3rd floor had broken and flooded the servers out...
Ummm.... Talk about an unexpected mind blown incident.

Successfully updated karma reason!

satoru1

Join Date 07/2008

+75

I feel your pain

I've had to restore from backups and its generally horribly painful. I'm already doing restore from disk and doing our existing 13TB database restore takes.... way too long. We even have to give reports on annual tests on how long the restore actually takes.

If it makes you feel better, my friend used to work at IBM and a telco company lost the ENTIRE CUSTOMER DATABASE. All of it. It took them like an entire month to rebuild that thing from bubble gum, duct tape, with a dash of hope and dreams. I'm pretty sure they sacrificed some goats half way through. They were losing a LOT of money every single day that database was down.

Successfully updated karma reason!

satoru1

Join Date 07/2008

+75

Quoting barasawa9144,

reply 26

Did a lot of tech support on various levels, and this is a perfect example of the first three rules of computing.

Rule 1: Backup
Rule 2: BACKUP
Rule 3: See Rules 1 & 2

Its kind of getting better these days in that, before it was very very hard to justify the very very expensive backup costs. Especially now that data is exploding, backup infrastructure is getting more insane to even maintain. This then becomes a cost analysis of "how badly do you like your data for what circumstances". Its sort of funny that our old CIO when we were looking at colo sites, looked me dead in the eyes and asked "So what if someone takes an RPG to the back of this datacenter". I was thinking "I'm getting the hell out of here because I don't get paid enough to handle a guy with an RPG". Then like 2 years later, it wasn't a guy with an RPG, it was an electrician who took out both of our redundant independent power companies lines, the backup diesel generators, and then the battery backup power, to then then power cycle the power company lines over and over again so the datacenter looked like it was hosting a rave. I would have preferred the guy with the RPG..... But anyway its more about how much do yuou want to pay for the super ultra edge cases for backups? And people really don't want to pay for stuff if it has very low probability. Well its low probability until its not. And you can do everything right, up until the electrician starts a rave in your datacenter and shorts everything.

One thing we're seeing now is that ransomware's first target isn't actually data, but the backups. They hit the backup infrastructure first, lock it out, then lock the real data. In essence they encrypt or corrupt the backups first, then ransomware the data so you can't restore from backup once you find out the production data is locked out. As such we're getting a lot more requirements for immutable backups. Its sort of interesting to see the backup guys getting way more budget and clout than they're used to.

Successfully updated karma reason!

Frogboy

Join Date 03/2001

+2420

Quoting barasawa9144,

reply 26

Took the next call, and the caller informed me that live update wasn't working. I mentioned it was this morning, and there haven't been any notices of an outage yet, but if he could give me a moment I'd check. Sure enough, it was out so I informed him.
He asked if I knew what was wrong, to which I replied, "No, haven't even gotten internal notice of it yet, for as far as I know right now the servers are flooded, but as they're on the second floor of the building next to me, I really don't think that's likely."
Sent off the internal queries/notices and about an hour later we got informed that a water pipe on the 3rd floor had broken and flooded the servers out...
Ummm.... Talk about an unexpected mind blown incident.

Yep. We had backups on site. We had backups off site. It was restoring from the offsite backups that took so long. Just so big. 34TB is no joke.

Successfully updated karma reason!

satoru1

Join Date 07/2008

+75

Quoting Frogboy,

reply 29

Yep. We had backups on site. We had backups off site. It was restoring from the offsite backups that took so long. Just so big. 34TB is no joke.

Most of our off site backups are in IronMountain. I think the last time anyone actually truly needed something from there it took like 2-3 days just to find the damn tape to restore from.

Successfully updated karma reason!

Yeah, there are always "issues".
Some common backup failures I've seen is places that only keep one or two, or they keep reusing the same tapes over and over, not realizing those things are only reliable for a couple of reuses, and never seem to test the backups.
The worst situation I ran into was 9/11. A good 2/3rds of our support people just didn't come in for the week after that, and those of us who did had a really weird week.
Turns out a LOT of companies had their servers in those two buildings, and now their IT guys that were like those of us who came in, were trying to rebuild their companies entire network. Everybody on both sides were rather subdued, and we didn't get any of the usual "consultants" or secretaries calling.
Yes, plenty of "bosses" will screw stuff up, and then have the secretary try to work with tech support to fix it, all the while demanding unreasonable conditions or they'll be fired. Most common are being unreachable while never having given the secretary the password to the laptop, or key to the server room, or whatever they need as a minimum to do anything with it. To make matters worse, most of those secretaries known nothing about computers. Being able to use the suite of office tools is great for their job, but if you need file properties, or checking access rights, or even worse yet, have to go into a partition table or the like, they are NOT skilled in that, nor is it needed for their job, that's what IT is for. After all, you don't expect the bus driver to be able to overhaul the engine of the bus, and giving them threats of "do it or else" does not help anything.

Ok, I've babbled a wall of text more than enough. Glad people have been reading, and other techies can sympathize/empathize, and the non-techies can get an inkling of the invisible work that goes on every day where IT fixes things without others known, and generally keep things working smooth, yet only being noticed when something too big or too bizarre to not cause disruption to the rest of the company unexpectedly happens.

Oh, as to the various malwares that encrypt stuff so they can demand ransoms, yeah, they're getting better at being aholes, but there are some really good devs on our side. If it was ever intended to be decryptable, there's a key somewhere and the devs can eventually figure out how to get their digital hands on it to save the data. But with each new one, it's a race to find it before somebody gives up and pays off the scum with the hope they'll actually give them the decryption key. (I hear it's about 50/50 if you pay them off). As to screwing the backups first, that's a big reason keeping older backups around is a good idea. Sure those backups services or tapes aren't exactly cheap, nor is loosing six months of data or whatever, but it's cheaper than losing all your data.

Sorry, slightly more gab. As to the public impression that the creeps making the malware are talented, it's completely false. Destroying things is easy, just ask a car mechanic about how mechanically talented a vandal with a sledge hammer is. Most of the malware producers are just scriptkiddies or slightly better. Some aren't even that as there are "tools" to make you own malware in which you select options and it compiles it for you. Most of the malware out there is also just a modified copy of another one. Unfortunately, there are a small handful of skilled programmers that write malware, some of those are simply researchers and the malware got loose one way or another, but some have malicious intent. Those are the ones the antivirus/antimalware people hate and have difficulty with. We still win in the end, it's just how much time and effort it takes before we do. I know one dev that can usually crack most things in an hour or two, that probably took the malware creator at least 6 months to write. When it came to something a skilled malware writer made that changed the malware paradigm again, was apparently around 4-6 years, but out dev cracked it in a month and a half. Of course, there are hundreds of people working to defeat malware, but there are thousands upon thousands of people writing that trash to start with.
If your curious, yes, the company I worked at fighting malware, among other tasks, has an "antivirus lab". We would test ways to prevent and defeat malware in that lab. One thing outsiders don't seem to understand is that it was an electronic black hole. If anything that could store data went in, no matter what it was or who it was, it stayed in their forever. More than a few people were upset when they didn't pay attention and lost their thumb drives and other devices, even laptops. Though admittedly the lab was happy to get more device "donations".

Ok. I'm shutting up now.