Surviving a Reboot of the SCM Client
Applications managers want to be able to reboot edge devices at the end of an SCM deployment. That way, they can apply firmware, operating system, and applications updates that require a device restart. A reboot can occur asynchronously at any time. This time cannot be associated with any internal status change for a particular job on the client because a separate service initiates the reboot after detecting a successful SCM installation. What statuses can a client be in when a reboot occurs? These statuses can be idle, receiving a status change, transferring a file, or executing a job.
The following types of shutdowns can also cause the device to reboot:
• “Orderly”, where the operating system notifies the service or daemon
• “Signaled” through some type of semaphore, such as the existence of a file
• “Unexpected” shutdown, such as communications or power loss. Note that during an unexpected shutdown, it is possible to lose data.
The Orderly and Signaled shutdowns are the best because the client can prepare for the shutdown. In the event of an Orderly or Signaled shutdown, the client takes the following actions:
1. Unbinds any SCM Things to prevent the receiving of any new updates and suspending the delivery of any new file transfer data.
2. Suspends all status transitions.
3. Kills the running job, if one exists while keeping its status set to STARTED. This action may leave the job in an indeterminate state, which must be dealt with the next time that the job is run, by the job script.
4. Persists the current job list to disk in a well-known location.
5. Terminates itself.
In the event of an unexpected shutdown, the possibility of data loss exists but is minimized by the reboot persistence feature.
|
The reboot persistence feature is enabled by default.
|
This technique of persisting the list of SCM jobs allows a window of potential status loss if a status change is being received at the moment of shutdown, but the platform should recognize and respond to this status change request, assuming it was a timeout and retransmit. The same principle applies to any file transfer in progress. While this technique minimizes the failure window, if a state is promoted but not yet persisted, the disconnect between client and server could cause a state synchronization issue, causing one or more jobs to eventually go into the FAILED state.
Persistence of the Job List
The method of persistence is as a JSON file that takes the form of an array of the actual job data structure. The job list persists by default to the same directory as the offline message store for the C SDK, but can be redirected by the use of a configuration file or environment variable.
Restoration of the Job List on Restart
Upon restart, the job list can be restored. However, it is very important that the client connect with the ThingWorx Platform BEFORE attempting to restore the job list. Otherwise, if the platform is no longer aware of a job but the client is, the situation can lock up the client. By connecting first, you can ask the platform if the job still matters. You must be connected to the platform for jobs to be restored properly.
Reboot Warning File
The reboot persistence feature includes a way to warn the client of an approaching shutdown. Create a full path to a file, which can be any file you want to use as the reboot warning file, with any file name you want to use. If you create a warning file at run time, the client disconnects and persists the job list file immediately. You can use this file as a way to warn the client or service without performing a proper shutdown. After it disconnects and persists, the client deletes the job list and can perform a proper shutdown. There is a slight risk that the job list may become corrupted. If it does, deletion covers that situation.
The traditional OS shutdown also works. Use the creation of the warning file as an alternative to the traditional OS shutdown if the user cannot perform a traditional shutdown,
Examples
There are examples of using the reboot persistence feature in the scmClient example, in the Program.cs file. At the beginning of the file, the variables for the reboot survival are declared, along with all the other variables needed for this client:
static string rebootSurvivalFilePath;
static string rebootSurvivalFileName;
static string rebootSurvivalEnabled;
static bool rebootSurvivalEnabledBool = true;
static string rebootWarningFile;
In addition to handling a terminate or other kind of signal, the terminate_signal handler also performs clean-up actions that save the job state to disk, unbind the Thing, and persist the current job state before re-raising the signal to shut down.
The settings for reboot survival are stored in environment variables, which are checked after parsing properties entered at the command line for the client:
rebootSurvivalFilePath = System.Environment.GetEnvironmentVariable("TWX_REBOOTSURVIVALFILEPATH");
rebootSurvivalFileName = System.Environment.GetEnvironmentVariable("TWX_REBOOTSURVIVALFILENAME");
rebootSurvivalEnabled = System.Environment.GetEnvironmentVariable("TWX_REBOOTSURVIVALENABLED");
if(null== rebootSurvivalEnabled)
{
rebootSurvivalEnabled = "true";
}
if (rebootSurvivalEnabled == "false")
{
rebootSurvivalEnabledBool = false;
}
rebootWarningFile = System.Environment.GetEnvironmentVariable("TWX_REBOOTWARNINGFILE");
After setting the log level and getting and verifying that an application key is available, the example starts the SCM extension, and includes the reboot survival file, path to the file, the enabled settings, and the warning path and file name:
static void StartScm()
{
// Set the required configuration information
var config = new ClientConfigurator() {
MaxMsgHandlerThreadCount = 8,
MaxApiTaskerThreadCount = 8,
ReconnectInterval = 15,// Reconnect every 15 seconds if a disconnect occurs
OfflineMsgStoreDir = ".",
Claims = SecurityClaims.fromAppKeyCallback(appKeyCallback),
AllowSelfSignedCertificates = true, // Do not set true in production
DisableCertValidation = true // Do not set true in production
};
// Put the offline message store into a writable directory
config.OfflineMsgStoreDir = Environment.CurrentDirectory = Environment.GetEnvironmentVariable("userprofile");
// Don't create a gateway thing
config.setType(null);
// The uri for connecting to Thingworx
config.Uri = "";
if (portNum == 443 || portNum == 8443)
{
config.Uri = String.Format("wss://{0}:{1}/Thingworx/WS", hostname, portNum);
}
else
{
config.Uri = String.Format("ws://{0}:{1}/Thingworx/WS", hostname, portNum);
ConnectedThingClient.disableEncryption();
}
// Create the client passing in the configuration from above
ConnectedThingClient client = new ConnectedThingClient(config);
try
{
// Start the client
client.connect();
// Create SCM Thing
ScmThing scmThing = new ScmThing(thingName, null, null, client, rebootSurvivalFilePath, rebootSurvivalFileName, rebootSurvivalEnabledBool, rebootWarningFile);
scmThing.StagingDir = stagingDirectory;
scmThing.LoadValidationSettingsFile(validationSettingsFile);
scmThing.LoadWhitelistFile(whitelistFile);
// Bind the Virtual Things
client.bindThing(scmThing);
}
catch (Exception e)
{
LOG.ErrorFormat("Initial Start Failed {0} " , e.Message);
Environment.Exit(-3);
}
// Wait for the SteamSensorClient to connect, then process its associated things.
// As long as the client has not been shutdown, continue
while (!client.isShutdown())
{
// Suspend processing at the scan rate interval
Thread.Sleep(1000);
}
The Job List File
The default name of the job list file is joblist.json. By default, it is stored in the same directory as the one specified for the offline message store. At run time, the log will tell you where the file is written.
Here is an example of a joblist.json file:
[{
"id": "1739a009-c1b1-49c4-a5c5-d37b89db89cd",
"entityName": "SampleScmDevice",
"campaignName": "WindowPackage",
"serverUpdateMgr": "TW.RSM.SFW.SoftwareManager",
"downloadTime": 1548350710490,
"installTime": 1548268767811,
"script": "script.bat",
"path": "/windows_test_package_1548268265574.zip",
"state": 5,
"state_string": "downloading",
"prev_state": 5,
"lastActivity": 1548350823914,
"downloadDir": "C:\\Users\\me\\staging\\SampleScmDevice",
"scriptParams": " twId=1739a009-c1b1-49c4-a5c5-d37b89db89cd twCampaign=WindowPackage twEntity=SampleScmDevice twUpdateMgr=TW.RSM.SFW.SoftwareManager"
}]