He may not have been speaking about tech resilience when the poet Shane Koyczan said, “If your heart is broken, make art with the pieces.” But resilience is resilience, and Mr. Koyczan knows all about broken hearts. When we posted our “How Did We Survive It All” story, we received a ton of feedback from colleagues, friends, and customers, many of whom were impressed by the resilience of our Chernihiv, Sumy, and Kyiv teams. “How did you make it all work under bombs, rockets, and fire?” some asked. So, to answer that question, we present you with Part 2 of this story, to add our lessons of tech resilience to those we shared about human resilience.
“There’s Nothing Special”
“You know, it just worked. It was supposed to work this way in such circumstances. There’s nothing special. We are being paid for that.”
Most people at PortaOne were unaware of the fancy word “tech resilience” before we started asking them for their stories. But that didn’t stop our team from building a resilient tech cluster that could survive the 2021 fire at our European data center. Or from delivering undisrupted performance to our customers when Russian airplanes, assault helicopters, missile launchers, tanks, and guns poured projectiles over all three cities where our Ukrainian technology teams are based.
So instead of asking our PortaOne DevOps and information security teams to explain their specific tech resilience strategy, we asked them a simpler question. And that was: “OK. Just tell us how you made it work.”
Back in the early 2000s, Yurii Zotsenko was teaching informatics (computer science) in one of Nizhyn’s public schools. “The salary of a Ukrainian schoolteacher was modest back then. [Writer’s note: it still is.] So, in parallel, I was moonlighting as a systems administrator and a scientific associate at Mykola Gogol Nizhyn State University, my alma mater,” Yurii says during a Zoom interview.
“We traveled 200 kilometers from Nizhyn to the Kyiv Kardachi electronics flea market to buy equipment and assemble it back home,” Yurii recalls. Ukrainian tech resilience started there, at Kardachi. “You could buy almost anything, and assemble it into anything as well. We then traveled to the Petrovka flea market [now Pochaina, named after a subway station] to buy software [first on diskettes, then on CDs].”
“One day, a friend invited me to do a part-time gig as a commercial sysadmin,” Yurii tells us. “Over one weekend, I earned more than I could in a month as a schoolteacher and university associate. Of course, I wanted to earn more so I could have a family. That was the point of no return.”
The “King of Sysadmins” at PortaOne
Yurii then moved to Chernihiv, the provincial capital of Chernihiv oblast (to which Nizhin belongs). After working at a few different places, he ended up at one of the city’s most desirable tech companies: PortaOne. “I worked in tech support at PortaOne for eight years, moving from junior engineer to team lead,” he says. In 2020, management offered Yurii a position that reminded him of his Nizhyn roots: head of the IT department (a.k.a. “the king of sysadmins”).
How COVID and Work-from-Home Improved Tech Resilience (and VPNs) at PortaOne
“COVID improved how we were treating our VPN,” explains Yurii. “The lockdown got our everyday operations infrastructure ready for what was yet to come.”
Before the lockdown, PortaOne had a VPN with limited features. Up to that point, people had preferred to work from the office, saving their time at home for their families. So, VPNs back then were mostly just for things like temporary illness, business trips, or child-care leave. It provided basic functionality: email, digital contract signatures, and access to the corporate Wiki (Confluence). But it wasn’t yet robust enough for long-term remote work.
Another issue was that, because its use was originally limited, the existing VPN at PortaOne could only accommodate about a hundred simultaneous connections. That definitely wasn’t enough for the entire company to work remotely.
“For several weeks after the lockdown, we extended the VPN capacity and onboarded all business-critical systems into it: source code repository, the continuous integration tool (Gerrit), and a variety of test systems used by QA,” says Yurii. That was the early foundation of our PortaOne tech resilience.
Bring Your Device (Home)
Most PortaOne customer support team members needed two large screens to do their jobs well. One is for instant chat support or communicating face-to-face with customers, and the other is for checking documentation and wikis, running build emulators, and communicating with developers and other support team members.
So, during the lockdown, PortaOne allowed people from the support teams to take their large-screen desktops home. As you’ll see, that decision played an important role when the Russian invasion of our cities began.
A Tech Resilience Prelude: Fire at the OVHcloud Data Center
“We started using cloud infrastructure for some projects in the early 2010s,” says Andriy Zhylenko, CEO of PortaOne. “However, after the Russians invaded Crimea in 2014, it became evident that we should migrate all business-critical servers to the cloud. Plus, customers started asking ’questions’ on how crisis-ready we were.”
As a result, by 2015, all of our business-critical servers were in the cloud. At first, PortaOne selected OVHcloud as our primary cloud provider.
Building Tech Resilience at PortaOne Before the Fire
“I’m afraid of business trips to Sumy,” jokes Yurii. “We had two major incidents with OVH before the big fire. Both took place when I was traveling by car to Sumy.”
Chernihiv and Sumy, 320 kilometers away from each other, are the capital cities of two neighboring Ukrainian oblasts. While the Chernihiv team at PortaOne is split 50/50 between developers and support, Sumy mainly houses 24/7 customer support teams who work in shifts. (They are joined by teams in Chernihiv and Barcelona.) Yurii travels to Sumy from Chernihiv now and then to help with the IT infrastructure there. “We don’t discuss our incidents publicly, sorry,” Yurii explains when we ask for details.
Implementing “Hot” Backups
Learning from both incidents, PortaOne introduced “hot” backups. Before you ask, the name has nothing to do with the content of those backups, although it does depend on your definition of hot. The fundamental difference between a “hot” and a “cold” backup is system availability. For example, you can apply the “hot” backup while your system is up and running in “production” mode.
Splitting Our Eggs Between the Baskets
“We started using various cloud providers for different elements of our corporate IT infrastructure even before I took charge of it,” says Yurii. Still, PortaOne realized that having a patchwork of different systems on different cloud providers would aggravate availability issues instead of improving tech resilience. So, we started using AWS in parallel with OVH to create identical cloud ecosystems. And we’re very glad we did.
The Tech Resilience Course of Action
On March 9, 2021, a fire took down the computing cluster at the SBG1 building of OVHcloud in Strasbourg. Yes, that’s where some of our PortaOne infrastructure resided – including a few of those hot backups. “It took OVH several weeks to restore its services after the fire,” recalls Yurii. The temperature in Chernihiv is usually below 0 Celsius (32 Fahrenheit) in early March, but a feverish work pace kept our team warm. “We had a boiling early spring with my team,” Yurii jokes.
Thanks to the OVH fire, all of our hot backups now reside with AWS in locations across the globe. The backups are synced between the two cloud providers. The PortaOne DevOps team restored corporate email, website, and code repository in minutes. Unfortunately, OVH never compensated us for the downtime. But they did provide a replacement for free for six months.
Welcome, Oracle Cloud!
At PortaOne, our corporate friendship with Oracle has spanned several decades. It started with PortaBilling Oracularius, and then with providing our clients with the option to run their PortaBilling or PortaSwitch (even when MySQL based) in the Oracle cloud datacenter (OCI). But after the OVH fire, we started asking ourselves a difficult question. What will happen to our business if both the OVH and AWS datacenters catch fire next time?
That’s how Oracle became our third cloud option. While OVH remains our primary cloud infrastructure provider, we mirrored all of its contents to both AWS and OCI. A load balancer makes all three providers available to end users within our corporate network.
What Should Go to the Cloud and What Should Stay (Bare-Metal)?
Tech resilience is always a combination of reliability and backups with usability and financial efficiency. You can create the most bulletproof system in the world, and in the next moment customers will start bypassing its features for convenience. (Hello, Enigma and its rotors.) That’s why our cloud infrastructure team realized that not everything should go to the cloud.
The “Release Time Machine”
A great example of this is our release time machine. As much as we might like them to, not all of our customers can or do update their PortaSwitch to the most recent MR as soon as it is available. That means our employees sometimes need to replicate an issue on the exact build of whatever version of PortaSwitch the customer uses. And sometimes that means supporting a system we released 5 or 10 years ago.
Running several hundred build variants of PortaSwitch in the cloud simultaneously did not seem appealing either financially or usability-wise. So, our developers and QA teams in Chernihiv gathered some retired physical servers that were once top-notch for running performance benchmark tests but had become “old, but not obsolete.” Then, they created a tool to launch on-demand whatever virtual machine they required on those servers.
The Role of RT in Our Tech Resilience
PortaOne started using RT (that means “Request Tracker”, not “Russia Today”) somewhere in mid-2000s, a.k.a. the glorious pre-Notion, pre-Clickup, and pre-Atlassian Era. Initially, RT tracked issues and requests that customers sent to the support and sales teams and issues and requests those teams sent to our developers. Some customers liked how this system worked and asked to have it integrated with their own PortaSwitch, to be run locally and used to support their customers.
Eventually, after many integrations, RT grew to become the stickiest piece of software at PortaOne. Since then, we have tried out every possible CRM, task, time, and issue tracker (YouTrack, HubSpot, you name it), and many teams have moved to those here and there. But for basic support operations, no tool can beat RT. After a while, management gave up on looking for something different, and we implemented cloud instances of RT “just in case.” That decision saved the day (pun intended) on what we now call “Jackpot Day.”
A Deeper Look at “Jackpot Day”
With our own trademark PortaOne black humor (okay, it is not officially trademarked), we call 24.02.2022 “Jackpot Day.” Why? Because all three of our Ukrainian teams were hit simultaneously by the Russian invasion as their host cities – Chernihiv, Sumy, and Kyiv – came under the Russian siege. What follows is their story of tech resilience during the first weeks of that crisis.
Why Stay in the Face of US and European Intelligence Warnings?
“It is way easier for expats to conceive of evacuating than it is for people who grew up on their land,” explains Andriy Zhylenko. Ukrainians are stubborn by nature. Many have older relatives who refused to evacuate even after the enemy began its shelling. Some had just bought their apartment or their house, or had just invested in renovations.
PortaOne created two hubs (in Ljubljana and Barcelona) for anyone on our team who decided or will decide to work from abroad while the war continues in Ukraine. But even if they evacuated during the initial siege, many of our employees returned to their native cities (Chernihiv and Sumy in particular) after the Russians retreated in early April. “While it is risky, and we support people’s decision to work from the safer regions of Ukraine or the neighboring European countries, we believe it is their right to decide where and how they want to live,” reasons Andriy Zhylenko.
When Satellite Internet Went Down
“Back in those days when we hosted business-critical servers in Chernihiv, we acquired a satellite dish, a modem, and a service subscription,” says Yurii Zotsenko. “We intended it to be a backup in case our optical cables were severed for any reason.” After our cloud transformation, the PortaOne management team decided to stay on that subscription, “just in case.”
Satellite Internet at our Ukraine offices went down on the night of February 23, 2022. One moment all was perfect, and the next our modems were displaying a “searching for satellites” message. As a later international investigation revealed, Russian government-sponsored hackers had downed satellite Internet in Ukraine on the day of their large-scale invasion.
Building a Resilient Satellite Connection
At the time of the Internet shutdown, our team was unaware that Russians had hacked all satellite modems in Ukraine. On February 24, Oleksandr Kapitanenko, president of PortaOne, called the main office of our satellite provider in Italy. They replied that they were under a cyberattack and apologized for being unable to help us.
The only exception to the large-scale hack was Starlink, which came in to start operations in Ukraine a few days after the cyberattack event on February 27, 2022.
“I would like to thank Datagroup and NeoCom, our main connectivity providers, for supplying us with a backup Internet connection,” remarks Yurii, declining to disclose further details. After this event, PortaOne ordered the Starlink terminal to have a backup satellite system.
Russian Shells Hit Optical Internet Cables at the Chernihiv Office
While Datagroup and NeoCom provided a solution under the Russian siege, it did not arrive instantly. Thankfully, the Chernihiv team could rely on two optical cables, plus the LTE Internet of their mobile devices. But in mid-March, Russian shells hit the first optical cable, then the second. After that, people working in the “PortaOne cellar” could rely only on their own LTE devices.
At that point, PortaOne distributed our customer support traffic between the Barcelona office and any of our employees who had managed to move to a (relatively) safe place. Barcelona is where many of our C-level execs are located, so it gave them the opportunity to re-live the good old days of early PortaOne and dive back into support operations. Some of our developers, though, risked staying and working in the PortaOne cellar. That meant we still needed a fast Internet solution there – both to support their work, and simply to give them access to news and keep an eye on the rapidly changing “escape window.” (Read @Madball’s story if you want to know more about that.)
Replacing Endpoint Devices for the Sake of Tech Resilience
Interestingly, endpoint devices became a significant vulnerability on Jackpot Day. Chernihiv and Sumy had been rapidly bombed. Some of our employees left for the safer villages and towns in or near Western Ukraine, while some decided to stay in local bomb shelters. “We instructed everyone to prioritize their safety over corporate property,” recalls Andriy Zhylenko. That means our team members had to leave their computers behind (remember those lockdown big-screens?). They needed to leave their homes with only some ID, some cash, some food, and their loved ones.
So, after our people got to safety, they realized they had no devices for the work. And they also discovered that working can help to handle stress, anxiety, and feelings of helplessness. So while our PortaOne teams in the invaded areas of Ukraine were not required to work, many chose to anyway. But how were they do get the equipment they needed? Our device vendor is located in Chernihiv, and during the first month of the war, its warehouses were blocked by the Russian siege.
Remembering the Good Old Kardachi Days
There was another problem, too. We not only needed to find new equipment for our teams, but we also had to get it to them. The enemy was targeting (and still targets) the logistics centers of all the major Ukrainian delivery companies. And during the first weeks of the war, many of these companies had sent their employees on paid vacation, not wanting to risk their lives.
That was “bingo time” for Yurii’s firsthand experience from the wild early 2000s. (Read: the time before assembling computers at home was hipster cool.) He contacted old friends in various regions of Ukraine, and that network provided our team with laptops, power banks, and LTE modems. Large-screen monitors, though, were still hard to come by. Still, given the situation, our team was okay with managing the discomfort of switching between windows on a single small laptop monitor.
No More Release Time Machine
Remember the “time machine” we decided to leave outside the cloud on a server somewhere in Chernihiv? Well, it became unavailable, partly due to the Internet outage and partly because we had donated our main power generator to the local water-supply facility so they could keep providing water to Chernihiv residents. As it turns out, this was an excellent opportunity to test our idea of “restoring to the cloud lightning fast.”
Unfortunately, “lightning fast” wasn’t happening. We hadn’t tested the “time machine” deployment sufficiently when we had the opportunity to do so. So, one of our Chernihiv team members decided (on his own) to walk to the server room on foot on a day when there wasn’t as much artillery fire. That action may have been intended to save the day, but it wasn’t the best decision. No time machine is worth the risk of having a young dude (even a volunteer) blasted into pieces by a Russian rocket.
“Hibernate Mode” and Our Own Robert Neville
“I joined PortaOne in autumn 2019 and worked the ‘normal’ office life for several months. Then the lockdown started,” jokes @Denusok, our IT specialist for the Chernihiv team. Now it’s three years later, and, during the that time, @Denusok mostly met his colleagues online (except those who got their new laptops from him).
“Things had started getting back to normal in autumn 2021… Then the war came,” he continues. “I love dogs. There are always packs of stray dogs around the [once-glorious and mighty Soviet] factory. Before the war, our accountant had been feeding them. She took them leftovers from the canteen, which became open to the public after the factory collapsed in the early 2000s. When the Russians started the siege of Chernihiv, that accountant also left. So, most of the time, it’s just me and one other IT specialist taking shifts, and then our driver. We are alone in this huge industrial landscape. It reminds me of that post-apocalyptic movie with Will Smith. Whenever possible, I feed those stray dogs. It’s not their fault that we humans have made their life this way.”
“Normal” Life in Abnormal Circumstances
@Denusok and his colleague @Aus are on Yurii Zotsenko’s technology infrastructure team. They have been instrumental in preserving the equipment and non-business-critical data at our Chernihiv office since the first days of the war. People are the No. 1 priority. Wars start, wars end, and the survivors need to go on with their lives and earn a living. Making preparations before the war and reacting swiftly when the hostilities erupted enabled us to put local offices and the equipment in the besieged cities in a “hibernate mode” — to save them from shells, rockets, vandals, and prying eyes.
“War,” “Peace,” and “Bayraktar”
“Chanél is our family’s favorite. She is a Cane Corso — a rare dog breed for Chernihiv. On the second day of the siege, most of my family decided to evacuate to a small town in the North. Formally, it was under Russian occupation for some time. But the Russian soldiers had bypassed it. The real fighting occurred to the South around Chernihiv, the oblast capital,” explains @Denusok.
Chanél was pregnant right before the war. She gave birth to seven healthy pups a day before the escalation reached Chernihiv. “We faced the Russian column during the evacuation from Chernihiv. It was going on an assault in the opposite direction toward Chernihiv. So we drove away from the road. Chanél and all her seven pups were sitting next to me in the car.” In Tolstovian form, @Denusok and his family decided to name three of the pups War, Peace, and Bayraktar. The latter is the brand name of an assault drone made in Turkey by Baykar, a company that the legendary Turkish engineer Özdemir Bayraktar established in 1984. Recently, Baykar decided to open a factory in Ukraine.
Introducing the Once-Per-Quarter “Jackpot Day”
In its recent tech resilience publication, McKinsey advises companies to “make sure problem management has structure and teeth.” Our PortaOne IT department demonstrated its teeth in the wake of Europe’s first armed conflict of the XXI century. And now we have decided to sharpen those teeth even further.
“No matter how well prepared you are in theory, only handling real-life situations and incidents helps organizations build true technology resilience,“ summarizes Andriy Zhylenko. “Sadly, our employees need no explanation for why this resilience is key to their survival and prosperity. They’ve seen it with their own eyes.”
So, to help maintain that personal and tech resilience, our management team has decided to introduce regular “Jackpot Day” training.
“This Is Not a Drill”
The idea is to experiment with shutting down various critical elements of our infrastructure for a single workday. (With the needed preparations to protect our operations, of course.) That might be Internet connectivity, electricity supply, corporate email, the RT, or anything else we think we rely on.
“The important part of this exercise is for people to move past the ‘this is just a drill’ attitude,” explains Mr. Zhylenko. The IT team and management will collect employee feedback, then meet to conduct a post-incident analysis. Then, they will plan whatever improvements to our corporate infrastructure (and beyond) are needed.
We think this will go very well. After all that has happened and continues to happen, getting a “this is for real” feeling won’t be hard.
What Do Tech Resilience, Rubik’s Cubes, and Potatoes Have in Common?
“Everybody in Ukraine now wants to move into the tech business. But, after more than two decades of being in it, I want to move in the opposite direction,“ says Yurii Zotsenko. Yurii loves working on microchips with a soldering iron, collecting puzzles, scroll sawing, and everything related to car repairs. “Maybe I will return to teaching while continuing to earn for my family with DevOps and running the corporate IT infrastructure,“ he says. “But that will happen only after we win the war.”
Yurii’s family and background has given him some good perspective on where the roots of tech resilience at PortaOne might lie. “You see, my part of Ukraine is all about growing potatoes. My parents did it, and their parents did it. Everybody does it in the village I come from. No matter what happens, we grow potatoes to have food in winter,” he says.
“When my mom calls, she doesn’t care about my science degrees or what kind of boss I am in which company. She just asks, ‘Hey, mama’s boss, when are you coming to take care of the potatoes?’ Whatever happens, we do our job. Even when the bombs are falling, we grow our potatoes.” And with that, Yurii shares his meditative advice for building tech resilience.