Troubleshooting for HA Clustering

ThingWorx High Availability > Troubleshooting for HA Clustering

Application key exists on each server but its value is blank

This issue occurs when the encryption key is different on each server. This can happen if the keystore is not properly shared.

You can configure the keystore-password and keystore.pfx file using the Security Management Tool. The ThingworxStorage directory should then be shared by each platform instance.

If that is not possible, you must start one server to create the keystore-password and keystore.pfx file and then copy it to the other machine before starting it:

1. Start one server to create the /ThingworxPlatform/keystore-password and /ThingworxStorage/keystore.pfx files.

2. Copy those files to the other server, and then start the other server.

Thing created on server A is not on server B

High availability (HA) works by synchronizing the model through the database. This synchronization happens approximately every 100 ms but can be configured. All servers must be pointing to the same PostgreSQL database instance; therefore, check the database configuration in the platform settings to ensure the connection settings match. If you want to run PostgreSQL in an HA cluster, you must follow the PostgreSQL HA configuration using Pgpool-II and a primary-secondary PostgreSQL configuration.

Property values are not being set on all servers

If you set an in-memory property value on server A and do not see the value update on server B, the cache layer that holds the state is not configured properly. ThingWorx stores property states in the Apache Ignite cache, which can be run as embedded or distributed. A topology log is written to the application log and Ignite logs, which show the number of clients and servers in the cluster. You should validate that this logged number of servers matches your expected number of servers. If the servers cannot talk to each other, which could be due to firewalls, Ignite will only run locally.

For example:

# log entry showing platform 1 has 2 clients and 1 server
platform1_1 | 13-Jan-2020 17:08:53.231 INFO [disco-event-worker-#37%twx-core-server%] org.apache.ignite.logger.java.JavaLogger.info Topology snapshot [ver=5, locNode=0cab6e47, servers=1, clients=2, state=ACTIVE, CPUs=12, offheap=1.6GB, heap=9.9GB]
# log entry showing platform 2 has 2 clients and 1 server
platform2_1 | 13-Jan-2020 15:02:29.736 INFO [ForkJoinPool.commonPool-worker-1] org.apache.ignite.logger.java.JavaLogger.info Topology snapshot [ver=4, locNode=c7383c40, servers=1, clients=3, state=ACTIVE, CPUs=16, offheap=1.6GB, heap=14.0GB]

Server will not start due to HTTP_PORT error

The HTTP_PORT, through which external services are required to connect to ThingWorx, must be defined before server startup in order to implement service discovery. This is the port that is exposed in Apache Tomcat on which the platform application is running. This environment variable must be available to the process running Tomcat. You can configure this in the setEnv file or you can update the service definition.

Tomcat service definition:

[Unit]
Description=Apache Tomcat Web Application Container
After=network.target
[Service]
Type=forking
PIDFile=/var/run/tomcat.pid
Environment=CATALINA_PID=/var/run/tomcat.pid
Environment=JAVA_HOME=/usr/lib/jvm/jdk1.8.0_191
Environment=HTTP_PORT=8080
Environment=CATALINA_HOME=/usr/share/tomcat9/9.0.26
Environment=CATALINA_BASE=/usr/share/tomcat9/9.0.26

ThingWorx Connection Server is unable to connect to ThingWorx with authentication errors

You must create an application key on the ThingWorx server and add this application key to the ThingWorx Connection Server configuration. If this key does not exist or does not match, the Connection Server will throw authentication errors when trying to create connections to the platform.

ThingWorx Connection Server is unable to find ThingWorx servers

If the ThingWorx Connection Server will not connect to the platform, make sure the HTTP_PORT environment variable stored on the ThingWorx server is set to the port where the platform is running and the service name matches what is configured for ThingWorx. If either is wrong, the Connection Server will not find the ThingWorx servers.

Also, the ThingWorx server could have registered a bad address in Apache ZooKeeper. This can happen when the ThingWorx server tries to determine the IP address of the machine on which ZooKeeper is running. The address resolver will scan all IP addresses on all network interfaces on the host machine to determine the IP address most likely to be the machine's LAN address. If the machine has multiple IP addresses, this method will prefer a site-local IP address if the machine has one (for example, 192.168.x.x or 10.10.x.x, usually IPv4) and will return the first site-local address if the machine has more than one. If the machine does not have a site-local address, this method will return the first non-loopback address found (IPv4 or IPv6).

Application performance issues

In a clustering environment, how an application is written has greater impact on performance because the Thing data is distributed. It is important to reduce round-trips outside of the server where scripts are running. For more information, see Best Practices for HA Applications.

Cache provider will not start

If the Apache Ignite cache provider will not start, there could be a configuration issue. For example:

platform1_1 | 2020-07-13 17:34:14.965+0000 [L: ERROR] [O: E.c.q.l.c.Logger] [I: ] [U: SuperUser] [S: ] [P: platform1] [T: main] *** CRITICAL ERROR ON STARTUP: Failed to start CacheProvider com.thingworx.cache.ignite.IgniteCacheProvider

We recommend checking the ZooKeeper nodes to ensure that ZooKeeper is running correctly and verifying that the local node can access the ZooKeeper server on the configured port.

If ZooKeeper is fine, check the Ignite cluster to see if it is running properly. Check the log for issues and the topology snapshot to ensure the cluster is the correct size. Verify that the local node can access the Ignite host on all required ports.

Apache Ignite Client shuts down ThingWorx JVM

The Ignite Client can potentially slowdown the ThingWorx JVM with the following error in the Catalina logs.

SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-5, threadName=sys-stripe-5-#6%twx-core-server%, blockedFor=10s] ...
SEVERE: JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteInterruptedException: Got interrupted while waiting for future to complete.]]

If the Ignite nodes can not sync for a period of time, it shuts down the VM, because memory is getting out of sync.

Because of application design issues, this behavior is usually observed when very large info tables are used as properties. Each write operation of these large info tables causes a full sync of the table property. This results into halt of JVM.

We recommend converting large Infotable properties into database tables.

Apache Ignite Connectivity Issues

If Ignite has connection timeout, client connectivity, or network latency issues, enable the following advanced Ignite configurations under cache settings in the platform-settings.json file. See the Ignite documentation for their information about how to configure each of the values. For more information about the platform-settings.json file, see Platform Settings for ThingWorx HA.

# This failure timeout automatically controls the following parameters: getSocketTimeout(), getAckTimeout(), getMaxAckTimeout(), getReconnectCount().
# If any of those parameters is set explicitly, the failure timeout setting will be ignored. For example, for stable low-latency networks the
# failure detection timeout may be set to ~120 ms.

failure-detection-timeout = 10000
client-failure-detection-timeout = 30000

# should only be used for advanced configuration
tcp-communication-spi {
connection-timeout = 5000
socket-write-timeout = 2000
slow-client-queue-limit = 0
}

# should only be used for advanced configuration
tcp-discovery-spi {
socket-timeout = 5000
ack-timeout = 5000
join-timeout = 5000
network-timeout = 5000
connection-recovery-timeout = 5000
reconnect-count = 10
reconnect-delay = 2000
so-linger = 5
stats-print-frequency = 0
}

Platform node stuck on restart

During restart of a Platform node in the cluster, if the node remains stuck during startup on this log: Initializing shared changelog state..., the problem maybe because there is heavy contention on Ignite. The Platform node tries to acquire a permit from Ignite on startup, but due to heavy load on Ignite it is unable to release the stale permit. If you encounter this, restart the Ignite nodes.

This will allow the Platform node to acquire a permit and complete startup.

Stuck locks in model providers

Model sync uses database locks to guarantee consistency in the change log. A stuck lock could hang the whole system, at least for model changes. If you encounter stuck locks, you can do the following:

• In PostgreSQL

a. Set a lock timeout in PostgreSQL to avoid hanging while waiting for stuck locks as described in the following:https://www.postgresql.org/docs/9.3/static/runtime-config-client.html

For example:

SET lock_timeout = 3000

b. Try to acquire the lock on a table.

c. If the server fails to acquire the lock because of the lock timeout, determine the age of the existing lock using the following query:

select extract(epoch from (now() - query_start)) from pg_stat_activity where query like '%lock <tableName> in exclusive mode;'

d. If the age of the lock is above the set threshold, execute the following query to find the process holding the lock on a given table:

SELECT t.schemaname,
t.relname,
l.locktype,
l.page,
l.virtualtransaction,
l.pid,
l.mode,
l.granted
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
WHERE t.schemaname <> 'pg_toast'::name AND t.schemaname <> 'pg_catalog'::name and t.relname = '<tableName>'

e. Terminate the process by holding the lock using following command:

SELECT pg_terminate_backend(pid);

• In MS SQL:

a. Set a lock timeout in MS SQL to avoid hanging while waiting for stuck locks as described in the following:https://docs.microsoft.com/en-us/sql/t-sql/statements/set-lock-timeout-transact-sql?view=sql-server-2017

For example:

set lock_timeout 3000;

b. Try to acquire the lock on a table.

c. If the server fails to acquire the lock because of the lock timeout, execute the following query to find the process holding the lock on a given table:

select
object_name(p.object_id) as tablename, request_owner_id, session_id
from
sys.dm_tran_locks l
inner join sys.partitions p on l.resource_associated_entity_id = p.object_id inner join sys.dm_tran_session_transactions tst ON l.request_owner_id = tst.transaction_id
and object_name(p.object_id) = '<tableName>'

d. Determine the age of the existing lock using the following query by passing in session_id retrieved in the previous step:

select datediff(second, (select last_batch from master.dbo.sysprocesses where spid = <session_id>), CURRENT_TIMESTAMP)

e. If the age of the lock is above the set threshold, use the session_id from the previous query result to terminate the process holding the lock using the following command:

kill <sessionId>;

EnsembleTracker Errors

If you are getting errors like the following in the Application log, even when your ZooKeeper servers are up and running, the problem may be that the Apache Curator framework is not able to handle a configuration change.

2021-03-22 06:35:10.092+0000 [L: ERROR] [O: o.a.c.f.i.EnsembleTracker] [I: ] [U: ] [S: ] [P: VTWSandboxSVA2] [T: main-EventThread] Invalid config event received: {server.2=52.149.229.159:2888:3888:participant, server.1=0.0.0.0:2888:3888:participant, server.3=13.92.154.53:2888:3888:participant, version=0}

The solution is to change the configuration for the server mappings used in zoo.cfg to include the client port.

The following configuration can cause an error:

server.1= xx.xxx.x.xx:2888:3888
server.2= xx.xxx.x.xx:2888:3888
server.3= xx.xxx.x.xx:2888:3888

Therefore, update the configuration to include the client port as follows:

server.1= xx.xxx.x.xx:2888:3888;2181
server.2= xx.xxx.x.xx:2888:3888;2181
server.3= xx.xxx.x.xx:2888:3888;2181

Was this helpful?