Data Science & AI Workbench employs automatic service restarts and health monitoring to remain operational if a process halts or a worker node becomes unavailable. Additional levels of fault tolerance, such as service migration, are provided if there are at least three nodes in the deployment. However, the master node cannot currently be configured for automatic failover and does present a single point of failure. When Workbench is deployed to a cluster with three or more nodes, the core services are automatically configured into a fault tolerant mode—whether Workbench is initially configured this way or changed later. As soon as there are three or more nodes available, the service fault tolerance features come into effect. This means that in the event of any service failure:Documentation Index
Fetch the complete documentation index at: https://anaconda.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
- Workbench core services will automatically be restarted or, if possible, migrated.
- User-initiated project deployments will automatically be restarted or, if possible, migrated.
/opt/anaconda/ on the master node should be located on a redundant disk array or backed up frequently to avoid data loss. See Backing up and restoring Workbench for more information.
To restore Workbench operations in the event of a master node failure:
- Create a new master node. Follow the installation process for adding a new cluster node, described in command-line installations.
To create the new master node, select
--role=ae-master instead of --role=ae-worker.- Restore data from a backup. After the installation of the new master node is complete, follow the instructions in Backing up and restoring Workbench.