SummaryThis article discusses requirements and solutions for backing up Oracle EPM applications. It highlights a new procedure that is available in Oracle EPM cloud to help meet requirements for keeping application backups viable over a long period of time.
IntroductionBackups are one of those things we don't think too much about until we need them, then they suddenly become very important. It's something we know we have to do, routines are set up to do it, and we hope it works when it's needed. If we're really diligent, we test the recovery process from time to time to ensure it works as expected.
One thing I've learned about backups is that they are not all the same. We take backups for different reasons with different expectations. Just saying "I have a backup" sounds good to check it off a list and make our functional application owners feel good. However, depending on the circumstances when a request comes in to restore a backup, we may find we are not as good as we thought we were.
Different flavors of backupsWhat I've come to realize is that we take backups for different reasons and to satisfy different requirements.
- We take backups for operational purposes. This is the traditional backup we think about when we want to recover an application from a failure or mistake. Common examples are an application that becomes corrupted or a user accidentally deletes something they weren't supposed to. In these scenarios a request is made to restore the application to it's last backup in order to recover the application and allow users to get back to work.
- Sometimes we take backups as a safety precaution when doing work in the application or on the underlying platform. Common examples are applying patches, break fix work, or new development enhancements being implemented in production. The backups become part of the rollback plan in case something doesn't go as planned.
- Another reason for backups are to satisfy data retention requirements. Sometimes we take snapshots for a point in time recovery to bring an application back online "as it was" at the time the backup was taken. Depending on the purpose, the timeframe to retain these snapshots could extend over a considerable period, often years.
It's this last scenario that introduces some challenges that you wouldn't necessarily have to think about with the first two. Backups for operational integrity and backups for recovery during patching, or development changes, are all backups that will be used in the near term and if they are not needed, they are typically discarded. We may even refer to these kinds of backups as being stale after a while; often taking up a lot of space somewhere, but not of much use once the particular activity is completed. We wouldn't typically have much use for a backup of a Planning app two weeks after it was taken because users have been entering new data and the backup is too old to be meaningful, except maybe for recovering an artifact like a business rule or something like that. The third scenario for data retention definitely requires a lot more thought.
Real world exampleI learned a good lesson about application snapshot retention soon after I came to work at GE. I had been onboard for a few weeks when I was copied on an email thread about a severity one Essbase ticket. Auditors were onsite working on a project and requested some information from prior years, in fact they were looking back about five years. A request came in to restore a backup of the application from five years ago! At first I thought this was odd, who would have backups going back that far, but I learned this was documented in the application SLA to have quarterly snapshots that would be retained after each close and archived. The ops team had in fact been taking backups of the app but in this case, they were not able to restore it.
I got involved with the recovery effort and discovered they were not using a very good method for taking the snapshots. The backups were copies of the server app folder that were copied, when the app was stopped, to an archive directory. The thought process was that if the app needed to be recovered they would create a new app with the same name and then copy the app directory onto the server and access the application. Now if you've worked with Essbase you know this isn't a best practice, but in the real world it could work; at least they thought it would. What they had not accounted for was version upgrades to Essbase that would make those archive copies incompatible with the current version of the software running on the server; remember this was a copy from five years prior.
A number of days were spent on this task, ultimately we found an old copy of the install files for the version the snapshots were taken in. We spun up the old version on a dev server, recovered the apps then upgraded them to the newer version and the auditors were able to access and recover what they needed. Needless to say the functional executive was not happy with the amount of time it took to bring the application online.
Architecting a better solutionAs we moved forward as a team and began our endeavor to implement a shared service within GE running on Exalytics servers, it was up to me to address this requirement and come up with a solid solution to ensure we didn't run into a similar issue. I also found it wasn't just Essbase apps, but there were some planning apps too, and as we were going to be moving to HFM on the new Exa platform, we would have the same snapshot requirements for HFM apps. This one was going to be particularly tricky since all HFM apps on a server reside in the same relational schema.
I discussed with Oracle and the initial thought was to take LCM backups of the applications and store them offline. We could then import them in a non-prod environment when needed. This sounded good at first, but I had learned a valuable lesson not too long ago; would those LCM snapshots be compatible with the future version of the software I would be running when I needed to restore it? I proposed this hypothetical scenario to Oracle and asked "how long is an LCM snapshot supported?" the answer was "within one release of the current version". Well that was going to be a problem. It was completely unrealistic to think my LCM snapshots would be viable 2, 3, 5 years down the road.
I spent a lot of time working with the functional owners, infrastructure team, and Oracle PMs. I proposed a number of different approaches to meet this requirement and ultimately we went with a process where we keep the applications live on non-prod servers for Essbase and Planning; this ensures the archive copies are upgraded when the system is patched and keeps them viable. Fortunately on Exalytics we had a tremendous amount of space to store all these copies. If the apps were not started they weren't doing much harm and even if they were started by accident, we have enough processors and RAM to handle it. HFM was a bit more painful, however. Unlike Essbase and Planning, where each app is independent, in HFM multiple applications all reside in a single schema. Over time this schema would become extremely bloated and could suffer from performance degradation. To solve the requirement we actually created a stand alone archive zone to store HFM snapshots. All the apps are "live" in the archive zone and it is patched and maintained the same as the other zones used by the business. Overall this works, but it is an expensive and time consuming solution to the problem.
Moving to the cloudI am now working on our roadmap to migrate our EPM applications to the cloud as our Exa platform reaches end of life and same as before, I need to address this application archiving requirement. I knew early on it was not going to be practical to have multiple pods to support all my snapshots, we were going to have to come up with a way to keep our snapshots offline, but still keep them up to date with the EPM cloud latest version to ensure they are viable. I discussed this requirement with a few PMs at Oracle along with Matt Bradley who is the senior executive at Oracle responsible for EPM Cloud.
I discussed with them how the LCM process in EPM cloud was superior to on-premise and I loved how quickly I could recover an application. I felt confident that as long as the LCM was compatible with future versions it would be a great way to keep snapshots in an archive directory and spin them up as needed. Oracle confirmed they could still only guarantee that an LCM export would be officially supported within one version. So what could we do? I hypothesized that if there were a way to load the LCM up into the cloud periodically and apply the latest patch, we could then export it back out and have viable backups in perpetuity. Assuming we could automate this process, I would be able to run a job during off hours to use one of my environments to keep my snapshots up to date.
The solutionTo solve this requirement I was introduced to Vinay Gupta from the cloud ops team. Vinay developed two scripts, one for Windows and one for Linux, utilizing EPM automate to cycle through a directory of LCM snapshots, load them into EPM cloud, apply the latest patch, export the snapshot back down to our directory, and store it in a new folder with the same name. I tested the script provided by Vinay and it worked very well doing exactly what was needed.
As a result of this process the value proposition for moving to the cloud increases dramatically. Since we will be able to take all of our snapshots offline, and we will not need to maintain a separate archive environment, moving to the cloud will actually save us quite a bit of money. It will reduce our EPM foot print and provide a very logical and stable approach to managing application snapshots over a long period of time.
This is a big win for us and I am once again pleased with my collaboration with the Oracle product team. The process for this and the sample scripts are documented in the online EPMA documentation under the sample use cases. I am providing a copy here as well for reference.
Additional considerationOne side note to mention that I still have to work out is how to ensure my current snapshots are viable when we switch products moving to the cloud. How do I restore an on-premise HFM application if I no longer have HFM because I moved to FCCS? This is something I will have to ponder further and work with my functional counter parts. It may be necessary to keep a VM running HFM just for the purpose of restoration. Sounds costly and now I have to make sure I keep my VM up to date :/