Troubleshooting for HA Clustering
Application key exists on each server but its value is blank
This issue occurs when the encryption key is different on each server. This can happen if the keystore is not properly shared.
You can configure the keystore-password and
keystore.pfx file using the
Security Management Tool. The
ThingworxStorage directory should then be shared by each platform instance.
If that is not possible, you must start one server to create the keystore-password and keystore.pfx file and then copy it to the other machine before starting it:
1. Start one server to create the /ThingworxPlatform/keystore-password and /ThingworxStorage/keystore.pfx files.
2. Copy those files to the other server, and then start the other server.
Thing created on server A is not on server B
High availability (HA) works by synchronizing the model through the database. This synchronization happens approximately every 100 ms but can be configured. All servers must be pointing to the same PostgreSQL database instance; therefore, check the database configuration in the platform settings to ensure the connection settings match. If you want to run PostgreSQL in an HA cluster, you must follow the PostgreSQL HA configuration using Pgpool-II and a primary-secondary PostgreSQL configuration.
Property values are not being set on all servers
If you set an in-memory property value on server A and do not see the value update on server B, the cache layer that holds the state is not configured properly. ThingWorx stores property states in the Apache Ignite cache, which can be run as embedded or distributed. A topology log is written to the application log and Ignite logs, which show the number of clients and servers in the cluster. You should validate that this logged number of servers matches your expected number of servers. If the servers cannot talk to each other, which could be due to firewalls, Ignite will only run locally.
For example:
# log entry showing platform 1 has 2 clients and 1 server
platform1_1 | 13-Jan-2020 17:08:53.231 INFO [disco-event-worker-#37%twx-core-server%] org.apache.ignite.logger.java.JavaLogger.info Topology snapshot [ver=5, locNode=0cab6e47, servers=1, clients=2, state=ACTIVE, CPUs=12, offheap=1.6GB, heap=9.9GB]
# log entry showing platform 2 has 2 clients and 1 server
platform2_1 | 13-Jan-2020 15:02:29.736 INFO [ForkJoinPool.commonPool-worker-1] org.apache.ignite.logger.java.JavaLogger.info Topology snapshot [ver=4, locNode=c7383c40, servers=1, clients=3, state=ACTIVE, CPUs=16, offheap=1.6GB, heap=14.0GB]
Server will not start due to HTTP_PORT error
The HTTP_PORT, through which external services are required to connect to ThingWorx, must be defined before server startup in order to implement service discovery. This is the port that is exposed in Apache Tomcat on which the platform application is running. This environment variable must be available to the process running Tomcat. You can configure this in the setEnv file or you can update the service definition.
Tomcat service definition:
[Unit]
Description=Apache Tomcat Web Application Container
After=network.target
[Service]
Type=forking
PIDFile=/var/run/tomcat.pid
Environment=CATALINA_PID=/var/run/tomcat.pid
Environment=JAVA_HOME=/usr/lib/jvm/jdk1.8.0_191
Environment=HTTP_PORT=8080
Environment=CATALINA_HOME=/usr/share/tomcat9/9.0.26
Environment=CATALINA_BASE=/usr/share/tomcat9/9.0.26
ThingWorx Connection Server is unable to connect to ThingWorx with authentication errors
You must create an application key on the ThingWorx server and add this application key to the ThingWorx Connection Server configuration. If this key does not exist or does not match, the Connection Server will throw authentication errors when trying to create connections to the platform.
ThingWorx Connection Server is unable to find ThingWorx servers
If the ThingWorx Connection Server will not connect to the platform, make sure the HTTP_PORT environment variable stored on the ThingWorx server is set to the port where the platform is running and the service name matches what is configured for ThingWorx. If either is wrong, the Connection Server will not find the ThingWorx servers.
Also, the ThingWorx server could have registered a bad address in Apache ZooKeeper. This can happen when the ThingWorx server tries to determine the IP address of the machine on which ZooKeeper is running. The address resolver will scan all IP addresses on all network interfaces on the host machine to determine the IP address most likely to be the machine's LAN address. If the machine has multiple IP addresses, this method will prefer a site-local IP address if the machine has one (for example, 192.168.x.x or 10.10.x.x, usually IPv4) and will return the first site-local address if the machine has more than one. If the machine does not have a site-local address, this method will return the first non-loopback address found (IPv4 or IPv6).
Application performance issues
In a clustering environment, how an application is written has greater impact on performance because the Thing data is distributed. It is important to reduce round-trips outside of the server where scripts are running. For more information, see
Best Practices for HA Applications.
Cache provider will not start
If the Apache Ignite cache provider will not start, there could be a configuration issue. For example:
platform1_1 | 2020-07-13 17:34:14.965+0000 [L: ERROR] [O: E.c.q.l.c.Logger] [I: ] [U: SuperUser] [S: ] [P: platform1] [T: main] *** CRITICAL ERROR ON STARTUP: Failed to start CacheProvider com.thingworx.cache.ignite.IgniteCacheProvider
We recommend checking the ZooKeeper nodes to ensure that ZooKeeper is running correctly and verifying that the local node can access the ZooKeeper server on the configured port.
If ZooKeeper is fine, check the Ignite cluster to see if it is running properly. Check the log for issues and the topology snapshot to ensure the cluster is the correct size. Verify that the local node can access the Ignite host on all required ports.
Apache Ignite Client shuts down ThingWorx JVM
The Ignite Client can potentially slowdown the ThingWorx JVM with the following error in the Catalina logs.
SEVERE: Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-5, threadName=sys-stripe-5-#6%twx-core-server%, blockedFor=10s] ...
SEVERE: JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteInterruptedException: Got interrupted while waiting for future to complete.]]
If the Ignite nodes can not sync for a period of time, it shuts down the VM, because memory is getting out of sync.
Because of application design issues, this behavior is usually observed when very large info tables are used as properties. Each write operation of these large info tables causes a full sync of the table property. This results into halt of JVM.
We recommend converting large Infotable properties into database tables.
Apache Ignite Connectivity Issues
If Ignite has connection timeout, client connectivity, or network latency issues, enable the following advanced Ignite configurations under
cache settings in the
platform-settings.json file. See the
Ignite documentation for their information about how to configure each of the values. For more information about the
platform-settings.json file, see
Platform Settings for ThingWorx HA.
# This failure timeout automatically controls the following parameters: getSocketTimeout(), getAckTimeout(), getMaxAckTimeout(), getReconnectCount().
# If any of those parameters is set explicitly, the failure timeout setting will be ignored. For example, for stable low-latency networks the
# failure detection timeout may be set to ~120 ms.
failure-detection-timeout = 10000
client-failure-detection-timeout = 30000
# should only be used for advanced configuration
tcp-communication-spi {
connection-timeout = 5000
socket-write-timeout = 2000
slow-client-queue-limit = 0
}
# should only be used for advanced configuration
tcp-discovery-spi {
socket-timeout = 5000
ack-timeout = 5000
join-timeout = 5000
network-timeout = 5000
connection-recovery-timeout = 5000
reconnect-count = 10
reconnect-delay = 2000
so-linger = 5
stats-print-frequency = 0
}
Stuck locks in model providers
Model sync uses database locks to guarantee consistency in the change log. A stuck lock could hang the whole system, at least for model changes. If you encounter stuck locks, you can do the following:
• In PostgreSQL
For example:
SET lock_timeout = 3000
b. Try to acquire the lock on a table.
c. If the server fails to acquire the lock because of the lock timeout, determine the age of the existing lock using the following query:
select extract(epoch from (now() - query_start)) from pg_stat_activity where query like '%lock <tableName> in exclusive mode;'
d. If the age of the lock is above the set threshold, execute the following query to find the process holding the lock on a given table:
SELECT t.schemaname,
t.relname,
l.locktype,
l.page,
l.virtualtransaction,
l.pid,
l.mode,
l.granted
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
WHERE t.schemaname <> 'pg_toast'::name AND t.schemaname <> 'pg_catalog'::name and t.relname = '<tableName>'
e. Terminate the process by holding the lock using following command:
SELECT pg_terminate_backend(pid);
• In MS SQL:
For example:
set lock_timeout 3000;
b. Try to acquire the lock on a table.
c. If the server fails to acquire the lock because of the lock timeout, execute the following query to find the process holding the lock on a given table:
select
object_name(p.object_id) as tablename, request_owner_id, session_id
from
sys.dm_tran_locks l
inner join sys.partitions p on l.resource_associated_entity_id = p.object_id inner join sys.dm_tran_session_transactions tst ON l.request_owner_id = tst.transaction_id
and object_name(p.object_id) = '<tableName>'
d. Determine the age of the existing lock using the following query by passing in session_id retrieved in the previous step:
select datediff(second, (select last_batch from master.dbo.sysprocesses where spid = <session_id>), CURRENT_TIMESTAMP)
e. If the age of the lock is above the set threshold, use the session_id from the previous query result to terminate the process holding the lock using the following command:
kill <sessionId>;
EnsembleTracker Errors
If you are getting errors like the following in the Application log, even when your ZooKeeper servers are up and running, the problem may be that the Apache Curator framework is not able to handle a configuration change.
2021-03-22 06:35:10.092+0000 [L: ERROR] [O: o.a.c.f.i.EnsembleTracker] [I: ] [U: ] [S: ] [P: VTWSandboxSVA2] [T: main-EventThread] Invalid config event received: {server.2=52.149.229.159:2888:3888:participant, server.1=0.0.0.0:2888:3888:participant, server.3=13.92.154.53:2888:3888:participant, version=0}
The solution is to change the configuration for the server mappings used in zoo.cfg to include the client port.
The following configuration can cause an error:
server.1= xx.xxx.x.xx:2888:3888
server.2= xx.xxx.x.xx:2888:3888
server.3= xx.xxx.x.xx:2888:3888
Therefore, update the configuration to include the client port as follows:
server.1= xx.xxx.x.xx:2888:3888;2181
server.2= xx.xxx.x.xx:2888:3888;2181
server.3= xx.xxx.x.xx:2888:3888;2181