What's insulating your business from disruption? Can you be sure that critical business and user applications are always available – even in the midst of technical changes?
- Resilience is the ability to provide and maintain an acceptable level of service in the face of any planned or unplanned interruption to normal operation
- Resilience is proactively planning for the unexpected and being prepared to respond in a way that mitigates downtime.
What steps can you take today to improve your IT resiliency? Below is a list of tips and tricks that we’ve learned from our work with countless clients that will help you take action and build a more resilient IT environment.
- Focus on likely scenarios first – Many companies have invested considerable time in developing a comprehensive Disaster Recovery (DR) plan. Should a catastrophe hit, it causes a failover to their DR target of choice. This is critical part of a DR plan. But what about those day-to-day issues affecting your critical applications. What happens when your SAN goes down? What happens if the SAN can’t go down? What is the plan for a lost fiber connection? Understanding the effects of these likely scenarios and creating repeatable workflows to improve continuity and recoverability will reduce impact on the business and take your organization to the next level of IT resiliency.
- Have an intentional focus on disaster avoidance– You might have a business application that is mission critical but this application runs on infrastructure that has single points of failure such as single SAN array. In many cases business owners and application SME’s are insulated or siloed from the underlying infrastructure and do not have an understanding of what happens when infrastructure supporting their application fails. Make sure you are aware of points of failure above and beyond the immediate application infrastructure. Assess all dependencies to these applications and truly understand how resilient you really are.
- Have a clear scope of DR strategy– Often IT utilities and development environments are not part of the recovery plan but have critical dependencies to production. Consider not only your IT assets that support customer facing digital business channels but the workloads that support your business operations. Your development team can’t continue to innovate, for example, if there is no recovery plan for applications they utilize such as a code repositories or digital workspaces.The sales team can’t follow up on incoming leads if your single-sign capability to Salesforce.com is no longer functioning.
- Leverage existing information– Your organization has great data. The trick is that all this great data is trapped in silos, like your CMDB, DCIM, Vcenter or even that spreadsheet that your DBA has been keeping for years. Start to collect and aggregate this data. Great tools are available that create centralized views of this type of information and start to map the dependencies and data gaps.
- Correlate SLAs with continuity and recovery data protection tooling capabilities – Always start with business requirements first. If an application requires a zero downtime SLA or a Recovery Time Objective (RTO) of 5 mins, is the tooling used to protect that applications data capable of meeting that objective? Don’t assume a capital expenditure is required to close the gap if there is one. Be sure to consider native tools that are part of the application and infrastructure stack to satisfy the business.
- Know your environment and dependencies– Dependencies, dependencies, dependencies! Make sure you have a vision of your applications and their dependencies. Understand your application-to-application and application-to-services dependencies as well as your application-to-infrastructure. Often just recovering the primary application that failed is not enough. A clear depiction of upstream and downstream relationships is required to fully recover and communicate impact to stakeholders. For example, let’s say you’ve recovered your primary application in the public cloud but you were unaware that a SaaS provider you used for a web service is part of your payment flow and needed to be notified of an IP Address change. So, you aren’t really recovered! A clear picture of the overall impact of an operational issue such as failed server will save you critical time and reduce the impact.
- Test your plan AND test your tests- Document the sequence and estimated timing for each of the steps in a failover test. Run tabletop reviews or dress rehearsals of your failover tests. This is a non-negotiable step in the process. It is amazing how many trivial issues can be resolved just by stepping through the plan with everyone involved. Analyze your SLAs, run tests of these failover scenarios. Benchmark your estimates against the actuals timing collected during your test. Conduct a “lessons learned” session and make updates based on what you have just learned.
- Make sure your plan is dynamic– Have a plan that can easily be updated and kept accurate. Establish policies and processes that will flag changes in the environment that could leave your resiliency plan exposed. Let’s say your development team released a new module of an application that requires additional servers and software to host. Is part of your acceptance criteria to create a resiliency plan for this new service? How does this impact your current resiliency plans? Wouldn’t it be great if resiliency workflows could be “smart” and account for this change?
- Create a roadmap and keep it current– Have an idea of where you want to go and use this roadmap to look for opportunities to incorporate this into other IT initiatives such as tech refresh, cloud adoption etc. Review and update this roadmap. At TDS, we work side by side with customers across a variety of industries who are taking on some form of IT transformation and we frequently advise them that this is an ideal opportunity to improve overall IT resiliency.
- I like top 10 lists and we’re one short! Based on your experiences, can you suggest a 10thtip or trick for attaining IT resilience? Please post your comment below and share your insights.
TDS has been orchestrating IT change since our inception in 2002. It is from this dedicated focus that TDS created TransitionManager, a Saas based platform that aggregates data sources and creates a single source view of IT environments with an application-centric focus. TransitionManager leverages an automated analysis and workflow feature to dynamically model both business and technology-based dependencies and generate workflows in real time to coordinate people and processes for an optimized recovery.
The need to be resilient is coming at us at a fast and furious pace. Businesses need to be more agile and expect the IT infrastructure to be robust, adept and ready to take on the changes necessary to keep pace. This shift toward resiliency assurance requires new approaches and tools built to manage end-to-end application availability in this hybrid IT environment.