# README
cluster-backup-operator
Cluster Back up and Restore Operator
Table of Contents generated with DocToc
- Work in Progress
- Community, discussion, contribution, and support
- License
- Getting Started
- Design
- Scheduling a cluster backup
- Restoring a backup
- Backup validation using a Policy
- Active passive configuration design
- Setting up your development environment
Work in Progress
We are in the process of enabling this repo for community contribution. See wiki here.
Community, discussion, contribution, and support
Check the CONTRIBUTING Doc for how to contribute to the repo.
License
This project is licensed under the Apache License 2.0. A copy of the license can be found in LICENSE.
Getting Started
The Cluster Back up and Restore Operator provides disaster recovery solutions for the case when the Red Hat Advanced Cluster Management for Kubernetes hub goes down and needs to be recreated. Scenarios outside the scope of this component : disaster recovery scenarios for applications running on managed clusters or scenarios where the managed clusters go down.
The Cluster Back up and Restore Operator runs on the Red Hat Advanced Cluster Management for Kubernetes hub and depends on the OADP Operator to create a connection to a backup storage location on the hub, which is then used to backup and restore user created hub resources.
The Cluster Back up and Restore Operator chart is not installed automatically. Starting with Red Hat Advanced Cluster Management version 2.5, in order to enable the backup component, you have to set the cluster-backup
option to true
on the MultiClusterHub resource. The OADP Operator will be installed automatically with the Cluster Back up and Restore Operator chart, as a chart hook.
Before you can use the Cluster Back up and Restore operator, the OADP Operator must be configured to set the connection to the storage location where backups will be saved. Make sure you follow the steps to create the secret for the cloud storage where the backups are going to be saved, then use that secret when creating the DataProtectionApplication resource to setup the connection to the storage location.
OADP Operator installed by the backup chart
The Cluster Back up and Restore Operator chart is not installed automatically.
Starting with Red Hat Advanced Cluster Management version 2.5, the Cluster Back up and Restore Operator chart is installed by setting the cluster-backup
option to true
on the MultiClusterHub
resource.
The Cluster Back up and Restore Operator chart in turn automatically installs the OADP Operator, in the same namespace with the backup chart.
Note:
- The OADP Operator 1.0 has disabled building multi-arch builds and only produces x86_64 builds for the official release. This means that if you are using an architecture other than x86_64, the OADP Operator installed by the chart will have to be replaced with the right version. In this case, uninstall the OADP Operator and find the operator matching your architecture then install it.
- If you have previously installed and used the OADP Operator on this hub, you should uninstall this version since the backup chart works now with the operator installed in the chart's namespace. This should not affect your old backups and previous work. Just use the same storage location for the DataProtectionApplication resource owned by the OADP Operator installed with the backup chart and it will access the same backup data as the previous operator. The only difference is that velero backup resources are now loaded under the new OADP Operator namespace on this hub.
Policy to inform on backup configuration issues
The Cluster Back up and Restore Operator chart installs the backup-restore-enabled Policy, used to inform on issues with the backup and restore component. The Policy templates check if the required pods are running, storage location is available, backups are available at the defined location and no error status is reported by the main resources. This Policy is intended to help notify the Hub admin of any backup issues as the hub is active and expected to produce backups.
Resource Requests and Limits Customization
When Velero is initially installed, Velero pod is set with default cpu and memory limits as defined below.
resources:
limits:
cpu: "1"
memory: 256Mi
requests:
cpu: 500m
memory: 128Mi
These limits work well with regular scenarios but may need to be updated when your cluster backs up a large number of resources. For instance, when back up is executed on a hub managing 2000 clusters, Velero pod crashes due to the out-of-memory error (OOM). The following configuration allows backup to complete for this scenario.
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 500m
memory: 256Mi
In order to update the Velero pod resource(cpu, memory) limits and requests, you need to update the DataProtectionApplication
resource and insert the resourceAllocation template for the Velero pod, as described below:
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: velero
namespace: open-cluster-management-backup
spec:
...
configuration:
...
velero:
podConfig:
resourceAllocations:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 500m
memory: 256Mi
Refer to Velero Resource Requests and Limits Customization to find out more about the DataProtectionApplication
parameters for setting the Velero pod resource requests and limits.
Protecting data using Server-Side Encryption
Server-side encryption is the encryption of data at its destination by the application or service that receives it. Our backup mechanism itself does not encrypt data while in-transit (as it travels to and from backup storage location) or at rest (while it is stored on disks at backup storage location), instead it relies on the native mechanisms in the object and snapshot systems.
For server-side encryption using AWS provider, you can use serverSideEncryption
and kmsKeyId
configurations as explained in this sample AWS BackupStorageLocation.
The following sample specifies an AWS KMS key ID when setting up the DataProtectionApplication resource:
spec:
backupLocations:
- velero:
config:
kmsKeyId: 502b409c-4da1-419f-a16e-eif453b3i49f
profile: default
region: us-east-1
Refer to Velero supported storage providers to find out about all of the configurable parameters of other storage providers.
Design
The content below describes the backup and restore flow using the Cluster Back up and Restore Operator solution, with details on what and how the data is backed up and restored. Once you are familiar with these concepts you can follow the Active passive configuration design section to understand how to build a complete DR solution with an ACM hub set as a primary, active configuration, managing clusters, and one or more ACM hubs setup to take over in a disaster scenario.
Cluster Backup and Restore flow
The operator defines the BackupSchedule.cluster.open-cluster-management.io
resource, used to setup Red Hat Advanced Cluster Management for Kubernetes backup schedules, and the Restore.cluster.open-cluster-management.io
resource, used to process and restore these backups.
The operator sets the options required to backup remote clusters configuration and any other hub resources that need to be restored.
What is backed up
The Cluster Back up and Restore Operator solution provides backup and restore support for all Red Hat Advanced Cluster Management for Kubernetes hub resources like managed clusters, applications, policies, bare metal assets. It provides support for backing up any third party resources extending the basic hub installation. With this backup solution, you can define a cron based backup schedule which runs at specified time intervals and continuously backs up the latest version of the hub content. When the hub needs to be replaced or in a disaster scenario when the hub goes down, a new hub can be deployed and backed up data moved to the new hub, so that the new hub replaces the old one.
Steps to identify backup data
The steps below show how the Cluster Back up and Restore Operator finds the resources to be backed up. With this approach the backup includes all CRDs installed on the hub, including any extensions using third parties components.
- Exclude all resources in the MultiClusterHub namespace. This is to avoid backing up installation resources which are linked to the current Hub identity and should not be backed up.
- Backup all CRDs with an api version suffixed by
.open-cluster-management.io
. This will cover all Advanced Cluster Management resources. - Additionally, backup all CRDs from these api groups:
argoproj.io
,app.k8s.io
,core.observatorium.io
,hive.openshift.io
- Exclude all CRDs from the following api groups :
admission.cluster.open-cluster-management.io
,admission.work.open-cluster-management.io
,internal.open-cluster-management.io
,operator.open-cluster-management.io
,work.open-cluster-management.io
,search.open-cluster-management.io
,admission.hive.openshift.io
,velero.io
- Exclude the following CRDs; they are part of the included api groups but are either not needed or they are being recreated by owner resources, which are also backed up:
clustermanagementaddon
,observabilityaddon
,applicationmanager
,certpolicycontroller
,iampolicycontroller
,policycontroller
,searchcollector
,workmanager
,backupschedule
,restore
,clusterclaim.cluster.open-cluster-management.io
- Backup secrets and configmaps with one of the following labels:
cluster.open-cluster-management.io/type
,hive.openshift.io/secret-type
,cluster.open-cluster-management.io/backup
- Use this label for any other resources that should be backed up and are not included in the above criteria:
cluster.open-cluster-management.io/backup
Example :
apiVersion: my.group/v1alpha1
kind: MyResource
metadata:
labels:
cluster.open-cluster-management.io/backup: ""
- Resources picked up by the above rules that should not be backed up, can be explicitly excluded when setting this label:
velero.io/exclude-from-backup: "true"
Example :
apiVersion: my.group/v1alpha1
kind: MyResource
metadata:
labels:
velero.io/exclude-from-backup: "true"
Extending backup data
Third party components can choose to back up their resources with the ACM backup by adding the cluster.open-cluster-management.io/backup
label to these resources. The value of the label could be any string, including an empty string. It is indicated though to set a value that can be later on used to easily identify the component backing up this resource. For example cluster.open-cluster-management.io/backup: idp
if the components are provided by an idp solution.
Note:
Use the cluster-activation
value for the cluster.open-cluster-management.io/backup
label if you want the resources to be restored when the managed clusters activation resources are restored. Restoring the managed clusters activation resources result in managed clusters being actively managed by the hub where the restore was executed.
Resources restored at managed clusters activation time
As mentioned above, when you add the cluster.open-cluster-management.io/backup
label to a resource, this resource is automatically backed up under the acm-resources-generic-schedule
backup. If any of these resources need to be restored only when the managed clusters are moved to the new hub, so when the veleroManagedClustersBackupName:latest
is used on the restored resource, then you have to set the label value to cluster-activation
. This will ensure the resource is not restored unless the managed cluster activation is called.
Example :
apiVersion: my.group/v1alpha1
kind: MyResource
metadata:
labels:
cluster.open-cluster-management.io/backup: cluster-activation
Aside of these activation data resources, identified by using the cluster.open-cluster-management.io/backup: cluster-activation
label and stored by the acm-resources-generic-schedule
backup, the Cluster Back up and Restore Operator includes by default a few resources in the activation set. These resources are backed up by the acm-managed-clusters-schedule
:
- managedcluster.cluster.open-cluster-management.io
- managedcluster.clusterview.open-cluster-management.io
- klusterletaddonconfig.agent.open-cluster-management.io
- managedclusteraddon.addon.open-cluster-management.io
- managedclusterset.cluster.open-cluster-management.io
- managedclusterset.clusterview.open-cluster-management.io
- managedclustersetbinding.cluster.open-cluster-management.io
- clusterpool.hive.openshift.io
- clusterclaim.hive.openshift.io
- clustercurator.cluster.open-cluster-management.io
Passive data
Passive data is backup data such as secrets, configmaps, apps, policies and all the managed cluster custom resources which are not resulting in activating the connection between managed clusters and hub where these resources are being restored on. These resources are stored by the credentials backup and resources backup files.
Managed clusters activation data
Managed clusters activation data or activation data, is backup data which when restored on a new hub will result in managed clusters being actively managed by the hub where the restore was executed. Activation data resources are stored by the managed clusters backup, and by the resources-generic backup, using the cluster.open-cluster-management.io/backup: cluster-activation
label. More details about the activation resources are available with the backup section
Scheduling a cluster backup
A backup schedule is activated when creating the backupschedule.cluster.open-cluster-management.io
resource, as shown here
After you create a backupschedule.cluster.open-cluster-management.io
resource you should be able to run oc get bsch -n <oadp-operator-ns>
and get the status of the scheduled cluster backups. The <oadp-operator-ns>
is the namespace where BackupSchedule was created and it should be the same namespace where the OADP Operator was installed.
The backupschedule.cluster.open-cluster-management.io
creates 6 schedule.velero.io
resources, used to generate the backups.
Run os get schedules -A | grep acm
to view the list of backup scheduled.
Resources are backed up in 3 separate groups:
-
credentials backup ( 3 backup files, for hive, ACM and generic backups )
-
resources backup ( 2 backup files, one for the ACM resources and second for generic resources, labeled with
cluster.open-cluster-management.io/backup
) -
managed clusters backup, schedule labeled with
cluster.open-cluster-management.io/backup-schedule-type: acm-managed-clusters
( one backup containing only resources which result in activating the managed cluster connection to the hub where the backup was restored on)
Note:
a. The backup file created in step 2. above contains managed cluster specific resources but does not contain the subset of resources which will result in managed clusters connect to this hub. These resources, also called activation resources, are contained by the managed clusters backup, created in step 3. When you restore on a new hub just the resources from step 1 and 2 above, the new hub shows all managed clusters created with the hive api but they are in a detached state. Managed clusters imported on the primary hub using the Import cluster operation will show only when the activation data is restored on the passive hub. The managed clusters are still connected to the original hub that had produced the backup files.
b. Only managed clusters created using the hive api will be automatically connected with the new hub when the acm-managed-clusters
backup from step 3 is restored on another hub. All other managed clusters will show up as Pending Import
and must be imported back on the new hub. The reason the hive managed clusters can be connected with the new hub is because hive provides the kubeconfig to connect to the managed cluster and this is being backed up and restored on the new hub. The import controller will next update the bootstrap kubeconfig on the managed cluster using the restored configuration. This information is only available for managed clusters created using the hive api.
c. When restoring the acm-managed-clusters
backup on a new hub, by using the veleroManagedClustersBackupName: latest
option on the restore resource, make sure the old hub from where the backups have been created is shut down, otherwise the old hub will try to reconnect with the managed clusters as soon as the managed cluster reconciliation addons find the managed clusters are no longer available, so both hubs will try to manage the clusters at the same time.
Backup Collisions
As hubs change from passive to primary clusters and back, different clusters can backup up data at the same storage location. This could result in backup collisions, which means the latest backups are generated by a hub who is no longer the designated primary hub. That hub produces backups because the BackupSchedule.cluster.open-cluster-management.io
resource is still Enabled on this hub, but it should no longer write backup data since that hub is no longer a primary hub.
Situations when a backup collision could happen:
- Primary hub goes down unexpectedly:
- 1.1) Primary hub, Hub1, goes down
- 1.2) Hub1 backup data is restored on Hub2
- 1.3) The admin creates the
BackupSchedule.cluster.open-cluster-management.io
resource on Hub2. Hub2 is now the primary hub and generates backup data to the common storage location. - 1.4) Hub1 comes back to life unexpectedly. Since the
BackupSchedule.cluster.open-cluster-management.io
resource is still enabled on Hub1, it will resume writting backups to the same storage location as Hub2. Both Hub1 and Hub2 are now writting backup data at the same storage location. Any cluster restoring the latest backups from this storage location could pick up Hub1 data instead of Hub2.
- The admin tests the disaster scenario by making Hub2 a primary hub:
- 2.1) Hub1 is stopped
- 2.2) Hub1 backup data is restored on Hub2
- 2.3) The admin creates the
BackupSchedule.cluster.open-cluster-management.io
resource on Hub2. Hub2 is now the primary hub and generates backup data to the common storage location. - 2.4) After the disaster test is completed, the admin will revert to the previous state and make Hub1 the primary hub:
- 2.4.1) Hub1 is started. Hub2 is still up though and the
BackupSchedule.cluster.open-cluster-management.io
resource is Enabled on Hub2. Until Hub2BackupSchedule.cluster.open-cluster-management.io
resource is deleted or Hub2 is stopped, Hub2 could write backups at any time at the same storage location, corrupting the backup data. Any cluster restoring the latest backups from this location could pick up Hub2 data instead of Hub1. The right approach here would have been to first stop Hub2 or delete theBackupSchedule.cluster.open-cluster-management.io
resource on Hub2, then start Hub1.
- 2.4.1) Hub1 is started. Hub2 is still up though and the
In order to avoid and to report this type of backup collisions, a BackupCollision state exists for a BackupSchedule.cluster.open-cluster-management.io
resource. The controller checks regularly if the latest backup in the storage location has been generated from the current cluster. If not, it means that another cluster has more recently written backup data to the storage location so this hub is in collision with another hub.
In this case, the current hub BackupSchedule.cluster.open-cluster-management.io
resource status is set to BackupCollision and the Schedule.velero.io
resources created by this resource are deleted to avoid data corruption. The BackupCollision is reported by the backup Policy. The admin should verify what hub must be the one writting data to the storage location, then remove the BackupSchedule.cluster.open-cluster-management.io
resource from the invalid hub and recreated a new BackupSchedule.cluster.open-cluster-management.io
resource on the valid, primary hub, to resume the backup on this hub.
Example of a schedule in BackupCollision
state:
oc get backupschedule -A
NAMESPACE NAME PHASE MESSAGE
openshift-adp schedule-hub-1 BackupCollision Backup acm-resources-schedule-20220301234625, from cluster with id [be97a9eb-60b8-4511-805c-298e7c0898b3] is using the same storage location. This is a backup collision with current cluster [1f30bfe5-0588-441c-889e-eaf0ae55f941] backup. Review and resolve the collision then create a new BackupSchedule resource to resume backups from this cluster.
Restoring a backup
Prepare the new hub
Before running the restore operation on a new hub, you need to manually configure the hub and install the same operators as on the initial hub. You have to install the Red Hat Advanced Cluster Management for Kubernetes operator, in the same namespace as the initial hub, then create the DataProtectionApplication resource and connect to the same storage location where the initial hub had backed up data.
If the initial hub had any other operators installed, such as Ansible Automation Platform
, Red Hat OpenShift GitOps
, cert-manager
you have to install them now, before running the restore operation, and using the same namespace as the primary hub operators. This ensure the new hub is configured in the same way as the initial hub.
Note:
The settings for the local-cluster
managed cluster resource, such as owning managed cluster set, are not restored on new hubs.
This is because the local-cluster
managed cluster resource is not being backed up since the resource contains local cluster specific information, such as cluster url details. Restoring this content and overwriting the local-cluster
data on a new hub would corrupt the cluster where the restore is executed. As a result, any configuration changes applied to the local-cluster
resource on the primary hub - such as updating the owning managed cluster set from default
to another managed cluster set - should be manually applied on the restored cluster.
Restoring backups
In a usual restore scenario, the hub where the backups have been executed becomes unavailable and data backed up needs to be moved to a new hub. This is done by running the restore operation on the hub where the backed up data needs to be moved to. In this case, the restore operation is executed on a different hub than the one where the backup was created.
There are also cases where you want to restore the data on the same hub where the backup was collected, in order to recover data from a previous snapshot. In this case both restore and backup operations are executed on the same hub.
A restore backup is executed when creating the restore.cluster.open-cluster-management.io
resource on the hub. A few samples are available here
By passive data we mean resource that don't result in activating the connection between the new hub and managed clusters. When the passive data is restored on the new hub, the managed clusters show up but they are in detached state. The hub that produced the backup is still managing these clusters.
By activation data we mean resources that, when restored on the new hub, result in making the managed clusters to be managed by the new hub. The new hub is now the active hub, managing the clusters.
Restoring passive resources and check for new backups
Use the restore passive with sync sample if you want to restore passive data then keep checking if new backups are available and restore them automatically. For this automatic restore of new backups to work, the restore must set syncRestoreWithNewBackups
property to true
and must only restore latest, passive data. So for this option to work, you need to set VeleroResourcesBackupName
and VeleroCredentialsBackupName
to latest
and the VeleroManagedClustersBackupName
to skip
- as soon as the VeleroManagedClustersBackupName
is set to latest
, the managed clusters are activated on the new hub and this hub becomes a primary hub. When this happens, the restore resource is set to Finished
and the syncRestoreWithNewBackups
is ignored, even if set to true
. The restore operation has completed.
By default, when syncRestoreWithNewBackups
is set to true
, the controller checks for new backups every 30 minutes. If new backups are found, it restores the backed up resources. You can update the duration after which you want the controller to check for new backups using this property restoreSyncInterval
.
For example, the resource below checks for new backups every 10 minutes.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm-passive-sync
spec:
syncRestoreWithNewBackups: true # restore again when new backups are available
restoreSyncInterval: 10m # check for new backups every 10 minutes
cleanupBeforeRestore: CleanupRestored
veleroManagedClustersBackupName: skip
veleroCredentialsBackupName: latest
veleroResourcesBackupName: latest
Restoring passive resources
Use the passive sample if you want to restore all resources on the new hub but you don't want to have the managed clusters be managed by the new hub. You can use this restore configuration when the initial hub is still up and you want to prevent the managed clusters to change ownership. You could use this restore option when you want to view the initial hub content using the new hub or to prepare the new hub to take over when needed. In the case of takeover, just restore the managed clusters resources using the passive activation sample; the managed clusters will now connect with the new hub.
Restoring activation resources
Use the passive activation sample when you want for this hub to manage the clusters. In this case it is assumed that the other data has been restored already on this hub using the passive sample
Restoring all resources
Use the restore sample if you want to restore all data at once and make this hub take over the managed clusters in one step.
After you create a restore.cluster.open-cluster-management.io
resource on the hub, you should be able to run oc get restore -n <oadp-operator-ns>
and get the status of the restore operation. You should also be able to verify on your hub that the backed up resources contained by the backup file have been created.
Cleaning up the hub before restore
Velero currently skips backed up resources if they are already installed on the hub. This limits the scenarios that can be used when restoring hub data on a new hub. Unless the new hub is not used and the restore is applied only once, the hub could not be relibly used as a passive configuration: the data on this hub is not reflective of the data available with the restored resources.
Restore limitations examples:
- A Policy exists on the new hub, before the backup data is restored on this hub. After the restore of the backup resources, the new hub is not identical with the initial hub from where the data was restored. The Policy should not be running on the new hub since this is a Policy not available with the backup resources.
- A Policy exists on the new hub, before the backup data is restored on this hub. The backup data contains the same Policy but in an updated configuration. Since Velero skips existing resources, the Policy will stay unchanged on the new hub, so the Policy is not the same as the one backed up on the initial hub.
- A Policy is restored on the new hub. The primary hub keeps updating the content and the Policy content changes as well. The user reapplies the backup on the new hub, expecting to get the updated Policy. Since the Policy already exists on the hub - created by the previous restore - it will not be restored again. So the new hub has now a different configuration from the primary hub, even if the backup contains the expected updated content; that content is not updated by Velero on the new hub.
To address above limitations, when a Restore.cluster.open-cluster-management.io
resource is created, the Cluster Back up and Restore Operator runs a prepare for restore set of steps which will clean up the hub, before Velero restore is called.
The prepare for cleanup option uses the cleanupBeforeRestore
property to identify the subset of objects to clean up. There are 3 options you could set for this clean up:
None
: no clean up necessary, just call Velero restore. This is to be used on a brand new hub.CleanupRestored
: clean up all resources created by a previous acm restore. This should be the common usage for this property. It is less intrusive then theCleanupAll
and covers the scenario where you start with a clean hub and keep restoring resources on this hub ( limitation sample 3 above )CleanupAll
: clean up all resources on the hub which could be part of an acm backup, even if they were not created as a result of a restore operation. This is to be used when extra content has been created on this hub which requires clean up ( limitation samples 1 and 2 above ). Use this option with caution though as this will cleanup resources on the hub created by the user, not by a previous backup. It is strongly recommended to use theCleanupRestored
option and to refrain from manually updating hub content when the hub is designated as a passive candidate for a disaster scenario. Basically avoid getting into the situation where you have to swipe the cluster using theCleanupAll
option; this is given as a last alternative.
Note:
-
Velero sets a
PartiallyFailed
status for a velero restore resource if the backup restored had no resources. This means that arestore.cluster.open-cluster-management.io
resource could be inPartiallyFailed
status if any of therestore.velero.io
resources created did not restore any resources because the corresponding backup was empty. -
The
restore.cluster.open-cluster-management.io
resource is executed once, unless you use thesyncRestoreWithNewBackups:true
to keep restoring passive data when new backups are available.For this case, follow the restore passive with sync sample. After the restore operation is completed, if you want to run another restore operation on the same hub, you have to create a newrestore.cluster.open-cluster-management.io
resource. -
Although you can create multiple
restore.cluster.open-cluster-management.io
resources, only one is allowed to be executing at any moment in time. -
The restore operation allows to restore all 3 backup types created by the backup operation, although you can choose to install only a certain type (only managed clusters or only user credentials or only hub resources).
The restore defines 3 required spec properties, defining the restore logic for the 3 types of backed up files.
veleroManagedClustersBackupName
is used to define the restore option for the managed clusters.veleroCredentialsBackupName
is used to define the restore option for the user credentials.veleroResourcesBackupName
is used to define the restore option for the hub resources (applications and policies).
The valid options for the above properties are :
latest
- restore the last available backup file for this type of backupskip
- do not attempt to restore this type of backup with the current restore operation<backup_name>
- restore the specified backup pointing to it by name
Below you can see a sample available with the operator.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
namespace: open-cluster-management-backup
spec:
cleanupBeforeRestore: CleanupRestored
veleroManagedClustersBackupName: latest
veleroCredentialsBackupName: latest
veleroResourcesBackupName: latest
View restore events
Use the oc describe -n <oadp-n> <restore-name>
command to get information about restore events.
A sample output is shown below
Spec:
Cleanup Before Restore: CleanupRestored
Restore Sync Interval: 4m
Sync Restore With New Backups: true
Velero Credentials Backup Name: latest
Velero Managed Clusters Backup Name: skip
Velero Resources Backup Name: latest
Status:
Last Message: Velero restores have run to completion, restore will continue to sync with new backups
Phase: Enabled
Velero Credentials Restore Name: example-acm-credentials-schedule-20220406171919
Velero Resources Restore Name: example-acm-resources-schedule-20220406171920
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406155817
Normal Prepare to restore: 76m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-generic-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-resources-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-cluster-schedule-20220406155817
Normal Velero restore created: 74m Restore controller example-acm-credentials-hive-schedule-20220406155817
Normal Prepare to restore: 64m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406165328
Normal Prepare to restore: 62m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-cluster-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-generic-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-resources-schedule-20220406165328
Normal Velero restore created: 61m Restore controller example-acm-credentials-hive-schedule-20220406165328
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-generic-schedule-20220406171920
Normal Prepare to restore: 38m Restore controller Cleaning up resources for backup acm-resources-schedule-20220406171920
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-hive-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-cluster-schedule-20220406171919
Normal Prepare to restore: 36m Restore controller Cleaning up resources for backup acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-cluster-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-credentials-schedule-20220406171919
Normal Velero restore created: 36m Restore controller example-acm-resources-generic-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-resources-schedule-20220406171920
Normal Velero restore created: 36m Restore controller example-acm-credentials-hive-schedule-20220406171919
Backup validation using a Policy
The Cluster Back up and Restore Operator chart installs the backup-restore-enabled Policy, used to inform on issues with the backup and restore component.
The Policy has a set of templates which check for the following constraints and informs when any of them are violated.
Pod validation
The following templates check the pod status for the backup component and dependencies:
acm-backup-pod-running
template checks if Backup and Restore operator pod is runningoadp-pod-running
template checks if OADP operator pod is runningvelero-pod-running
template checks if Velero pod is running
Data Protection Application validation
data-protection-application-available
template checks if aDataProtectionApplication.oadp.openshift.io
resource was created. This OADP resource sets up Velero configurations.
Backup Storage validation
backup-storage-location-available
template checks if aBackupStorageLocation.velero.io
resource was created and the status isAvailable
. This implies that the connection to the backup storage is valid.
BackupSchedule collision validation
acm-backup-clusters-collision-report
template checks that if aBackupSchedule.cluster.open-cluster-management.io
exists on the current hub, its state is notBackupCollision
. This verifies that the current hub is not in collision with any other hub when writing backup data to the storage location. For a definition of the BackupCollision state read the Backup Collisions section
BackupSchedule and Restore status validation
acm-backup-phase-validation
template checks that if aBackupSchedule.cluster.open-cluster-management.io
exists on the current cluster, the status is not in (Failed, or empty state). This ensures that if this cluster is the primary hub and is generating backups, theBackupSchedule.cluster.open-cluster-management.io
status is healthy.- the same template checks that if a
Restore.cluster.open-cluster-management.io
exists on the current cluster, the status is not in (Failed, or empty state). This ensures that if this cluster is the secondary hub and is restoring backups, theRestore.cluster.open-cluster-management.io
status is healthy.
Backups exist validation
acm-managed-clusters-schedule-backups-available
template checks ifBackup.velero.io
resources are available at the location specified by theBackupStorageLocation.velero.io
and the backups were created by aBackupSchedule.cluster.open-cluster-management.io
resource. This validates that the backups have been executed at least once, using the Backup and restore operator.
Backups are running to completion
acm-backup-in-progress-report
template checks ifBackup.velero.io
resources are stuck InProgress. This validation is added because with a large number of resources, the velero pod restarts as the backup executes and the backup stays in progress without proceeding to completion. During a normal backup though, the backup resources are in progress at some point of the execution but they don't get stuck in this phase and run to completion. It is normal to see theacm-backup-in-progress-report
template reporting a warning during the time the schedule is running and backups are in progress.
Backups are actively running as a cron job
- This validation is done by the
backup-schedule-cron-enabled
template. It checks that aBackupSchedule.cluster.open-cluster-management.io
is actively running and creating new backups at the storage location. The template verifies there is aBackup.velero.io
with a labelvelero.io/schedule-name: acm-validation-policy-schedule
at the storage location. Theacm-validation-policy-schedule
backups are set to expire after the time set for the backups cron schedule. If no cron job is running anymore to create backups, the oldacm-validation-policy-schedule
backup is deleted because it expired and a new one is not created. So if noacm-validation-policy-schedule
backup exists at any moment in time, it means that there are no active cron jobs generating acm backups.
This Policy is intended to help notify the Hub admin of any backup issues as the hub is active and expected to produce or restore backups.
Active passive configuration design
Setting up an active passive configuration
In an active passive configuration you have
- one hub, called active or primary hub, which manages the clusters and is backing up resources at defined time intervals, using the
BackupSchedule.cluster.open-cluster-management.io
resource - one or more passive hubs, which are continously retrieving the latest backups and restoring the passive data. The passive hubs use the
Restore.cluster.open-cluster-management.io
resource to keep restoring passive data posted by the primary hub, when new backup data is available. These hubs are on standby to become a primary hub when the primary hub goes down. They are connected to the same storage location where the primary hub backs up data so they can access the primary hub backups. For more details on how to setup this automatic restore configuration see the Restoring passive resources and check for new backups section.
In the image below, the active hub manages the remote clusters and backs up hub data at regular intervals. The passive hubs restore this data, except for the managed clusters activation data, which would move the managed clusters to the passive hub. The passive hubs can restore the passive data continously, or as a one time operation.
Disaster recovery
When the primary hub goes down, one of the passive hubs is chosen by the admin to take over the managed clusters. In the image below, the admin decides to use Hub N as the new primary hub. These are the steps taken to have Hub N become a primary hub:
- Hub N restores the Managed Cluster activation data. At this point, the managed clusters connect with Hub N.
- The admin starts a backup on the new primary Hub N, by creating a
BackupSchedule.cluster.open-cluster-management.io
resource and storing the backups at the same storage location as the initial primary hub. All other passive hubs will now restore passive data using the backup data created by the new primary hub, unaware that the primary hub has changed. Hub N behaves now as the primary hub, managing clusters and backing up data.
Note:
- Step 1 is not automated since the admin should decide if the primary hub is down and needs to be replaced, or there is some network communication error between the hub and managed clusters. The admin also decides which passive hub should become primary. If desired, this step could be automated using the Policy integration with Ansible jobs: the admin can setup an Ansible job to be executed when the Backup Policy reports backup execution errors.
- Although Step 2 above is manual, the admin will be notified using the Backups are actively running as a cron job if he omits to start creating backups from the new primary hub.
Setting up your development environment
Prerequiste tools
- Operator SDK
Installation
To install the Cluster Back up and Restore Operator, you can either run it outside the cluster, for faster iteration during development, or inside the cluster.
First we require installing the Operator CRD:
make build
make install
Then proceed to the installation method you prefer below.
Outside the cluster
If you would like to run the Cluster Back up and Restore Operator outside the cluster, execute:
make run
Inside the cluster
If you would like to run the Operator inside the cluster, you'll need to build a container image. You can use a local private registry, or host it on a public registry service like quay.io.
- Build your image:
make docker-build IMG=<registry>/<imagename>:<tag>
- Push the image:
make docker-push IMG=<registry>/<imagename>:<tag>
- Deploy the Operator:
make deploy IMG=<registry>/<imagename>:<tag>
Usage
Here you can find an example of a backupschedule.cluster.open-cluster-management.io
resource definition:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: BackupSchedule
metadata:
name: schedule-acm
spec:
veleroSchedule: 0 */6 * * * # Create a backup every 6 hours
veleroTtl: 72h # deletes scheduled backups after 72h; optional, if not specified, the maximum default value set by velero is used - 720h
-
veleroSchedule
is a required property and defines a cron job for scheduling the backups. -
veleroTtl
is an optional property and defines the expiration time for a scheduled backup resource. If not specified, the maximum default value set by velero is used, which is 720h.
This is an example of a restore.cluster.open-cluster-management.io
resource definition
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
spec:
cleanupBeforeRestore: CleanupRestored
veleroManagedClustersBackupName: latest
veleroCredentialsBackupName: latest
veleroResourcesBackupName: latest
In order to create an instance of backupschedule.cluster.open-cluster-management.io
or restore.cluster.open-cluster-management.io
you can start from one of the sample configurations.
Replace the <oadp-operator-ns>
with the namespace name used to install the OADP Operator.
kubectl create -n <oadp-operator-ns> -f config/samples/cluster_v1beta1_backupschedule.yaml
kubectl create -n <oadp-operator-ns> -f config/samples/cluster_v1beta1_restore.yaml
Testing
Schedule a backup
After you create a backupschedule.cluster.open-cluster-management.io
resource you should be able to run oc get bsch -n <oadp-operator-ns>
and get the status of the scheduled cluster backups.
In the example below, you have created a backupschedule.cluster.open-cluster-management.io
resource named schedule-acm.
The resource status shows the definition for the 3 schedule.velero.io
resources created by this cluster backup scheduler.
$ oc get bsch -n <oadp-operator-ns>
NAME PHASE
schedule-acm
Restore a backup
After you create a restore.cluster.open-cluster-management.io
resource on the new hub, you should be able to run oc get restore -n <oadp-operator-ns>
and get the status of the restore operation. You should also be able to verify on the new hub that the backed up resources contained by the backup file have been created.
Velero sets a PartiallyFailed
status for a velero restore resource if the backup restored had no resources. This means that a restore.cluster.open-cluster-management.io
resource could be in PartiallyFailed
status if any of the restore.velero.io
resources created did not restore any resources because the corresponding backup was empty.
The restore defines 3 required spec properties, defining the restore logic for the 3 types of backed up files.
veleroManagedClustersBackupName
is used to define the restore option for the managed clusters.veleroCredentialsBackupName
is used to define the restore option for the user credentials.veleroResourcesBackupName
is used to define the restore option for the hub resources (applications and policies).
The valid options for the above properties are :
latest
- restore the last available backup file for this type of backupskip
- do not attempt to restore this type of backup with the current restore operation<backup_name>
- restore the specified backup pointing to it by name
The cleanupBeforeRestore
property is used to clean up resources before the restore is executed. More details about this options here.
Note: The restore.cluster.open-cluster-management.io
resource is executed once. After the restore operation is completed, if you want to run another restore operation on the same hub, you have to create a new restore.cluster.open-cluster-management.io
resource.
Below is an example of a restore.cluster.open-cluster-management.io
resource, restoring all 3 types of backed up files, using the latest available backups:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
spec:
cleanupBeforeRestore: CleanupRestored
veleroManagedClustersBackupName: latest
veleroCredentialsBackupName: latest
veleroResourcesBackupName: latest
You can define a restore operation where you only restore the managed clusters:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
spec:
cleanupBeforeRestore: None
veleroManagedClustersBackupName: latest
veleroCredentialsBackupName: skip
veleroResourcesBackupName: skip
The sample below restores the managed clusters from backup acm-managed-clusters-schedule-20210902205438
:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Restore
metadata:
name: restore-acm
spec:
cleanupBeforeRestore: None
veleroManagedClustersBackupName: acm-managed-clusters-schedule-20210902205438
veleroCredentialsBackupName: skip
veleroResourcesBackupName: skip