Expected Behaviors when Failures Occur
This topic describes the actions that occur in a ThingWorx clustering configuration, as it responds to a failure of one or more of the components.
Load Balancer Failures
Actions and results will depend on the deployed load balancer HA solution. Active sessions should not be interrupted if the load balancer has the capacity to share session content across all load balancer nodes.
HAProxy Server Failure
If the only HAProxy node fails or all the HAProxy nodes fail, the following occurs:
• The ThingWorx leader will still be accessible through its IP address, but not through the HAProxy IP address.
• Requests to ThingWorx through HAProxy will not reach ThingWorx.
If one of multiple HAProxy nodes fail, the following occurs:
• Existing user sessions will be recognized in ThingWorx Composer once the backup HAProxy becomes the new master. User should not have to login again.
• Mashups will not be loaded until the backup HAProxy becomes the master.
• Browsing entities in Composer will not be loaded until the backup HAProxy becomes the master.
• Requests will not reach ThingWorx until the backup HAProxy becomes the master.
ThingWorx Server Failures
When a ThingWorx server fails, the following happens:
• The server is removed from the load balancer since the health check ping will fail. The timing of its removal is dependent on the check frequency.
• ZooKeeper will detect that the server has failed, remove it from internal service discovery, and notify watchers, such as the Connection Server.
• If the server was the singleton server, ZooKeeper will notify the other servers and select a new singleton server.
Clients connected to the server may receive errors while it switches over but will then reconnect to a new server.
It is possible to lose data in a server failure or if the server is killed. In these cases, data in the batch queues will be lost. If the server is shut down instead of failing, the above happens by the server deregistering, and batch queues are drained.
ThingWorx Platform Nodes are Down
As long as at least one Platform instance is in a healthy state, other nodes can be restarted without a system impact. However, if all Platform nodes are taken down or are unhealthy, then the state stored in ignite will become inconsistent. In this case it is required that all Platform nodes be stopped and all ignite nodes be stopped before restarting.
1. Stop all Platform nodes.
2. Stop all ignite nodes.
3. Restart all ignite nodes.
4. Restart all Platform nodes.
If ignite is not restarted, bind maps and other data stored in ignite will not be correct and cause odd behaviors over time.
ZooKeeper Failures
Node Failure
If one the ZooKeeper nodes fails, the following occurs:
• Other ZooKeeper nodes detect the failure to respond.
• A new ZooKeeper leader is elected if the failed node is the current leader.
Multiple Node Failure
If multiple nodes fail and ZooKeeper loses its quorum, it will drop into read-only mode and reject requests for changes.
• Leader election for ZooKeeper cannot take place since the original three node ZooKeeper setup would expect two servers to be available to form a quorum. Maximum failures allowed = ceil(N/2) - 1
• The remaining ZooKeeper instance(s) are neither a leader nor a standby.
• All clients will receive SUSPENDED and eventually LOST state.
ThingWorx Servers
• The singleton role will be unassigned while in the SUSPENDED state. During this time, no timers or schedulers will run.
• In LOST status, all nodes will shut down.
• If ZooKeeper recovers before the system times out, a new singleton will be elected and processing will continue.
Connection Server
• If the Connection Server gets a LOST state from ZooKeeper, it will shut down since it does not know the status of the ThingWorx servers.
Ignite
• It is assumed that the ZooKeeper cluster is always visible to all nodes in the cluster. If a node disconnects from ZooKeeper, it shuts down and other nodes treat it as failed or disconnected.
Ignite Failures
The impact of Ignite failures is based on the data replication level. Ignite should always be configured to store data on at least two nodes in the cluster. Therefore, if any one node is lost, there is no impact to the system.
If multiple nodes are lost, it is possible to have data loss, which can place the system into a inconsistent state. If this happens, we recommend a full shutdown of Ignite and ThingWorx. You can then restart Ignite and restart ThingWorx. Ignite is the application memory and if it is lost, processing behavior can be very inconsistent.
If there is a total Ignite failure in which ThingWorx cannot connect to any Ignite nodes, ThingWorx will shut down.
For information about data backups, see the following:
PostgreSQL Failures
This discussion of PostgreSQL failures is based on this configuration:
• Three PostgreSQL nodes (writer, reader, and standby)
• Using streaming replication between PostgreSQL nodes
• Two Pgpool-II nodes that manage client access to PostgreSQL nodes and manage PostgreSQL recovery procedures
If a PostgreSQL server fails, the active Pgpool-II node detects the failure and stops routing requests to that server. User or device data being saved at the time of the failure could be lost if the information had not been committed and replicated to other nodes before the failure.
When the master PostgreSQL node fails (assuming the sync and potential nodes are up and running), the following occurs:
• Failover to the sync node occurs through Pgpool-II. The potential node now becomes the sync node to the new master node. Writes to the database are still possible (such as creating new entities and writing data to BDWS).
• If the original master comes back up, you need to manually clean up and configure your environment to use the original master.
When both standby nodes fail (assuming the master node is still up and running), the following occurs:
• No failover occurs and the master node should have zero nodes for replication.
• Composer will still be accessible. Entities will be loaded and can be viewed but not saved. Logs can be viewed.
• Actions that require writes to the database (such as creating and saving an entity, setting values to persistent properties, and adding a stream entry) will not be successful since PostgreSQL must replicate the data to a standby node.
When the master node and the sync standby node fail, the following occurs:
• Failover to the potential node occurs. The potential node is now the master node with zero nodes for replication.
• Composer will be accessible. Entities will be loaded and can be viewed but not saved. Logs can be viewed.
• Actions that require writes to the database (such as creating and saving an entity, setting values to persistent properties, and adding a stream entry) will not be successful since the writes will hang and eventually fail.
When all three nodes fail, the following occurs:
• Failover will not occur since there are no available nodes.
• Composer has no access to the database. Therefore, entities should not be loaded, most services will not work (subsystem services like platform subsystem may still work), and system functionality is limited (logs, system monitoring, and mashups may work).
• Writes to and reads from the database will not be possible.
ThingWorx and PostgreSQL Failure
When the ThingWorx leader and PostgreSQL master both fail, the following occurs:
• The standby ThingWorx instance becomes the leader after the ThingWorx leader goes down and the sync node of the PostgreSQL database becomes the PostgreSQL master node.
◦ The potential node of the PostgreSQL database becomes the new sync node.
◦ ThingWorx Composer is available and writes to the PostgreSQL database can be made (entities can be created, edited, and deleted, and data can be added).
• If the original PostgreSQL master node must be reset as the master node, you must manually clean up the PostgreSQL nodes and Pgpool-II.
Pgpool-II Node Failure
Active Pgpool-II Node Failure
If the active Pgpool-II node fails, the backup will detect it and take over the handling of all requests to the PostgreSQL servers. Users logged onto the active ThingWorx server may experience delays in their applications, and there could be loss of user or device data that is being saved when the Pgpool-II node failure occurred.
All Pgpool-II Nodes Failure
When all Pgpool-II instances fail, ThingWorx will lose access to PostgreSQL content and most functions will fail. Some limited functionality may be available in the following areas:
• Logging - the Application log may still be updated with error messages.
• System monitoring (such as the MonitoringPlatformStats mashup)
• Mashups - widgets that do not rely on services or data from the database might still work.
• Property values from non-persistent properties
• Services that do not involve the database.
ThingWorx and Pgpool-II Failure
When the ThingWorx leader and Pgpool-II master instances fail simultaneously, the following occurs:
• The ThingWorx leader loses leadership, so one of the standby nodes become leader.
• The Pgpool-II master loses the virtual IP address. One of the Pgpool-II standby nodes will become master and the virtual IP address is reassigned to it.
• The ThingWorx standby server will attempt to fully connect to the database and will be successful once the new Pgpool master is established.
• The procedure and behavior listed under ThingWorx leader failure will apply.
Pgpool-II and PostgreSQL Failure
When PostgreSQL and Pgpool-II fail, the following occurs:
• The PostgreSQL standby node becomes the master node.
• The Pgpool-II standby node becomes the master node, and the virtual IP address is transferred to it.
• Services are briefly unavailable during failover.