Use Parallel Upgrades With Acceptable Failures
Fleet Upgrade provides the Parallel upgrades and Acceptable failures fields to help you both speed up and control long upgrade runs with many devices.
As explained in Run a Fleet Upgrade, whenever you create a Fleet Upgrade job, you can select up to 50 devices to be upgraded. Even though the upgrade process is entirely automated, depending on the number and size of the upgrades being performed on each of these devices, this can result in upgrade runs lasting many hours. This is especially true if you must perform each upgrade one at a time, in series, and if you can't cancel the run if you start experiencing upgrade failures.
To help with these issues, you can use the Parallel upgrades field to specify how many upgrades you want performed at the same time, in parallel. If you enter a Parallel upgrades value equal to the total number of devices to be upgraded (up to the maximum of 50), all the upgrades will take place at the same time. If you leave Parallel upgrades set to the default value of 1, Fleet Upgrade performs each of them one at a time.
As a practical matter, many users specify a lower Parallel upgrades value, such as 5 or 10. Doing this helps conserve processing resources and ensures that only a few of the network devices in a 50-device group will be offline at any one time.
With this type of Parallel upgrades value, Fleet Upgrade performs the upgrades in batches. For a 50-device upgrade group with a Parallel upgrades value of 5, this means 10 batches of five upgrades each. In this case, Fleet Upgrade performs all five of the upgrades in batch #1 at the same time, in parallel, and doesn't initiate any the upgrades in batch #2 until all of the upgrades in batch #1 are done.
How can you cancel a run that's failing too often? Fleet Upgrade will automatically cancel the remaining upgrades in a run depending on the number of Acceptable failures you set. The value you specify in this field acts as a failure "budget" that, when exceeded, triggers automatic cancellation of all of the remaining upgrades in the run. If you want to avoid automatic cancellation entirely, specify an Acceptable failures value equal to the total number of devices to be upgraded (up to the maximum of 50). Set it to the default value of 1 if you want the system to cancel remaining upgrades after the very first failure.
Bear in mind that, when executing parallel upgrades in batch mode, Fleet Upgrade will continue to execute each new batch until the failure budget set by Acceptable failures is actually exceeded. This can mean that the total number of failures will sometimes exceed the budget you set. It can also mean that it sometimes takes longer for cancellation to kick in than you might expect.
For example: Let's assume that our set of devices to upgrade is 50, our Parallel upgrades setting is 5 and our Acceptable failures setting is 5. That means we have 10 batches of 5 devices for Fleet Upgrade to perform. Let's further suppose that, during execution of batch #1, we encounter 4 failures. The 5-failures budget is not yet exceeded, so Fleet Upgrade will begin to execute all the upgrades in batch #2 in parallel. We then encounter 4 more failures in batch #2. The 5-failure budget is now exceeded, so Fleet Upgrade will automatically cancel execution of batch #3 and the remaining 7 other batches. However, we've actually encountered 8 failures, not 5. Similarly, we might encounter only 1 failure each in batches #1, #2, #3, and #4, then encounter 5 failures in batch #5, triggering cancellation of the run. In this case, we've actually encountered 9 failures, almost twice the number we specified. Also, cancellation wasn't triggered until batch #6 and device #30, 60 percent of the way through the entire run.