ThingWorx High Availability > Expected Behaviors when Failures Occur
Expected Behaviors when Failures Occur
This section describes the actions that occur in a ThingWorx HA configuration, as it responds to a failure of one or more of the components.
ThingWorx Server Failures
ThingWorx Leader Node Fails
Enacted HA Procedure
1. ZooKeeper receives no response from the leader.
2. ZooKeeper elects a new leader from the pool of standby ThingWorx servers.
3. ZooKeeper notifies the standby node to become the leader.
4. The new leader fully connects to the database and initializes the ThingWorx model.
5. The new leader sends confirmation to the load balancer to have all ThingWorx requests routed to it.
6. The load balancer routes all ThingWorx traffic to the new leader.
Load Balancer Failures
Actions and results will depend on the deployed load balancer HA solution. Active sessions should not be interrupted if the load balancer has the capacity to share session content across all load balancer nodes.
HAProxy Server Failure
If the only HAProxy node fails or all the HAProxy nodes fail, the following occurs:
The ThingWorx leader will still be accessible through its IP address, but not through the HAProxy IP address.
Requests to ThingWorx through HAProxy will not reach ThingWorx.
If one of multiple HAProxy nodes fail, the following occurs:
Existing user sessions will be recognized in ThingWorx Composer once the backup HAProxy becomes the new master. User should not have to login again.
Mashups will not be loaded until the backup HAProxy becomes the master.
Browsing entities in Composer will not be loaded until the backup HAProxy becomes the master.
Requests will not reach ThingWorx until the backup HAProxy becomes the master.
Zookeeper Node Failures
One ZooKeeper Node Failure
If one of three ZooKeeper nodes fails, the following occurs:
Other ZooKeeper nodes detect the failure to respond.
A new ZooKeeper leader is elected if the failed node is the current leader.
All ThingWorx servers should not be impacted. They remain active and accessible.
Two ZooKeeper Node Failure
Leader election for ZooKeeper cannot take place since the original three node ZooKeeper setup would expect two servers to be available to form a quorum. Maximum failures allowed = ceil(N/2) - 1
The remaining ZooKeeper instance is neither a leader nor a standby.
The ThingWorx leader will be shut down, since it cannot communicate with ZooKeeper for leader election.
The ThingWorx standby server will keep retrying to talk to ZooKeeper until at least one other ZooKeeper node comes back up.
Once two or more ZooKeeper nodes are back online, ZooKeeper leader election will occur. The ThingWorx standby node will reconnect to ZooKeeper and come back as the new leader.
The previous ThingWorx leader must be restarted to become the standby.
ThingWorx and ZooKeeper Failure
When the leaders for both ZooKeeper and ThingWorx fail, the following occurs:
All conditions listed under ‘One ZooKeeper Node Failure’. The new ZooKeeper leader must be determined first, so that it will then elect the new ThingWorx Leader.
All conditions listed under ‘ThingWorx Leader Node Fails’.
PostgreSQL Failures
This discussion of PostgreSQL failures is based on this configuration:
Three PostgreSQL nodes (writer, reader, and standby)
Using streaming replication between PostgreSQL nodes
Two Pgpool-II nodes that manage client access to PostgreSQL nodes and manage PostgreSQL recovery procedures
If a PostgreSQL server fails, the active Pgpool-II node detects the failure and stops routing requests to that server. User or device data being saved at the time of the failure could be lost if the information had not been committed and replicated to other nodes before the failure.
When the master PostgreSQL node fails (assuming the sync and potential nodes are up and running), the following occurs:
Failover to the sync node occurs through Pgpool-II. The potential node now becomes the sync node to the new master node. Writes to the database are still possible (such as creating new entities and writing data to BDWS).
If the original master comes back up, you need to manually clean up and configure your environment to use the original master.
When both standby nodes fail (assuming the master node is still up and running), the following occurs:
No failover occurs and the master node should have zero nodes for replication.
Composer will still be accessible. Entities will be loaded and can be viewed but not saved. Logs can be viewed.
Actions that require writes to the database (such as creating and saving an entity, setting values to persistent properties, and adding a stream entry) will not be successful since PostgreSQL must replicate the data to a standby node.
When the master node and the sync standby node fail, the following occurs:
Failover to the potential node occurs. The potential node is now the master node with zero nodes for replication.
Composer will be accessible. Entities will be loaded and can be viewed but not saved. Logs can be viewed.
Actions that require writes to the database (such as creating and saving an entity, setting values to persistent properties, and adding a stream entry) will not be successful since the writes will hang and eventually fail.
When all three nodes fail, the following occurs:
Failover will not occur since there are no available nodes.
Composer has no access to the database. Therefore, entities should not be loaded, most services will not work (subsystem services like platform subsystem may still work), and system functionality is limited (logs, system monitoring, and mashups may work).
Writes to and reads from the database will not be possible.
ThingWorx and PostgreSQL Failure
When the ThingWorx leader and PostgreSQL master both fail, the following occurs:
The standby ThingWorx instance becomes the leader after the ThingWorx leader goes down and the sync node of the PostgreSQL database becomes the PostgreSQL master node.
The potential node of the PostgreSQL database becomes the new sync node.
ThingWorx Composer is available and writes to the PostgreSQL database can be made (entities can be created, edited, and deleted, and data can be added).
If the original PostgreSQL master node must be reset as the master node, you must manually clean up the PostgreSQL nodes and Pgpool-II.
Pgpool-II Node Failure
Active Pgpool-II Node Failure
If the active Pgpool-II node fails, the backup will detect it and take over the handling of all requests to the PostgreSQL servers. Users logged onto the active ThingWorx server may experience delays in their applications, and there could be loss of user or device data that is being saved when the Pgpool-II node failure occurred.
All Pgpool-II Nodes Failure
When all Pgpool-II instances fail, ThingWorx will lose access to PostgreSQL content and most functions will fail. Some limited functionality may be available in the following areas:
Logging - the Application log may still be updated with error messages.
System monitoring (such as the MonitoringPlatformStats mashup)
Mashups - widgets that do not rely on services or data from the database might still work.
Property values from non-persistent properties
Services that do not involve the database.
ThingWorx and Pgpool-II Failure
When the ThingWorx leader and Pgpool-II master instances fail simultaneously, the following occurs:
The ThingWorx leader loses leadership, so one of the standby nodes become leader.
The Pgpool-II master loses the virtual IP address. One of the Pgpool-II standby nodes will become master and the virtual IP address is reassigned to it.
The ThingWorx standby server will attempt to fully connect to the database and will be successful once the new Pgpool master is established.
The procedure and behavior listed under ThingWorx Leader Failure will apply.
Pgpool-II and PostgreSQL Failure
When PostgreSQL and Pgpool-II fail, the following occurs:
The PostgreSQL standby node becomes the master node.
The Pgpool-II standby node becomes the master node, and the virtual IP address is transferred to it.
Services are briefly unavailable during failover.
Was this helpful?