High Availability

Availability is a significant concern for nearly every PTC customer. The degree to which you consider your system available combined with your disaster recovery time can have a substantial impact on the architecture configuration. Your required mean time to recovery (MTTR) from a disaster plays a critical role in determining the required capabilities of your architecture. Availability and required mean time to recover are tightly connected and have the most influence over the configuration and cost of any system, including Servigistics InService.

Keep in mind that the definition or requirements for availability can depend on your perspective. For example, consider the following question:

If the servers that are running the application are functioning fine but the network goes down, is the system still available?

For the administrators of the system and the unaffected sites, the system is still available. For the end user at that one site, it certainly is not. The question that you need to answer is as follows:

To what degree do all of the components in the infrastructure need to be redundant in order to claim that the system is highly available?

There are a wide variety of high availability configurations - reflecting a wide variety of customer requirements. In some of the simplest cases, customers leverage multiple components of their hardware infrastructure in a manner that allows them to reconfigure their environments on the remaining available hardware. Partition mobility solutions from numerous hardware vendors like HP, Sun, IBM and VMware allow applications to be shifted from one physical server to another. The behavior that is traditionally defined as an Active/Passive capability can now be done in an Active/Active capacity using some of these solutions. Other customers maintain a duplicate production environment in a geographically-separated data center that receives synchronized updates simultaneous to production. There are numerous configuration possibilities between the two extremes mentioned to accommodate nearly every customer need or budget.

Mean time to recovery (MTTR) is the desired speed at which your administrators can recover a system from a significant disaster. Some customers make a clear distinction in the MTTR for a failure caused by a hardware issue from an issue that may have occurred because of a natural disaster. The required speed at which a large system can be restored from a backup can have a direct impact on the cost of the infrastructure. These speeds are often controlled by the practical limits of the existing infrastructure to support data recovery.

Data recovery time periods of less than eight hours will begin to impact the cost and complexity of the infrastructure by an order of magnitude or more. In cases where there is a massive failure as the result of a significant disaster, the acceptable recovery times to accommodate more massive outage scenarios are generally longer as other critical systems tend to have a higher priority than your Servigistics InService solution may have.

As you determine your High Availability requirements and deployment needs, for large implementations the Servigistics InService Publisher should be considered as a separate application from the Servigistics InService Viewer.

For the Servigistics InService Publisher:

• The high availability requirements involve the priority you apply to having the latest version of service content published and accessible. Can iterative content publishing wait one or two days? How quickly can a backup Publisher application be generated? Answers to questions like these will help decide whether high availability for the Publisher is needed, or if other (slower) redeployment procedures are sufficient.

• The Servigistics InService Publisher application cannot be configured as a cluster application. A high availability deployment of the Publisher will use an active/passive solution. The passive server would need to be updated at least after each publishing task.

For the Servigistics InService Viewer;

• It is generally expected that the Servigistics InService Viewer application will have greater high availability requirements compared to the Publisher application. The high availability requirements of the Viewer involve the priority you apply to maintaining user access the published service content. How critical is this system to your business needs? What is the highest acceptable level of performance lag? How long can the published content be unavailable to it user base? Answers to questions like these will help decide your high availability deployment.

• The first step of high availability is to configure Servigistics InService Viewers to operate as a cluster within a single publishing site. While clustering is initially recognized as a method to support a large user base, it is also a method to provide high availability to a system. Adding one extra server to a cluster of Viewers can now support an event where if one Viewer application goes offline, user performance is not impacted as the extra Viewer server covers the loss of the offline server.

• The next step up in high availability is to configure the Viewer application to operate over two or more data centers. This is referred to as a multi-site publishing configuration. At least one Viewer is established in each data center. A cluster of Viewers in each data center is also feasible. If one data center goes offline, the Viewers in the operational data centers can manage the user traffic.

• Servigistics InService Viewers are configured to recognize multiple WindchillDS installations in an active/passive fashion. There should be at least two WindchillDS installations for a high availability configuration. For a single site configuration, then there should be two installations of WiIndchillDS. For a multi-site installation, then there should be at least one WindchillDS installation in each site.

• Oracle high availability requirements for a single site system should support the requirements of the Servigistics InService Viewer application (Oracle high availability requirements for a multi-site system are rather detailed).

• Four of the schemas (E3C, Titan, Titan2, CMI) should be available per site. (for example, for a two site solution, there should be two of each schema, one per site, ideally located in the same data center as the Viewer(s) it supports).

• For the Windchill schema, there should be only one such schema for the entire system, shared by all Viewers across all sites. This schema may require Oracle high availability solutions that span multiple data centers, such as dataguard or Real Application Clusters (RAC).