How to - Scheduling data image creation
One of the many benefits of Spawn is the ability to work with production-like datasets in all environments regardless of the size due to instant data container creation.
However to take advantage of this you first need to have a data image containing that data.
In this guide, you'll explore how to set up a scheduled pipeline in a CI environment to regularly create data images from your production-like datasets.
Spawn is currently in open beta. Complete the installation instructions to get access.
#
Prerequisites- We'll assume you already have a masked backup of your production environment.
- We'll assume that the agent you're using to invoke Spawn has access to the masked production backup file.
#
Backing up your databaseDepending on your environment, you'll have to get hold of a backup of your database you'd like to create an image from. The following table gives some suggestions of how you can do this depending on your environment. This table is by no means exhaustive, but following these instructions to generate a backup file has been tested and confirmed to work with Spawn.
Engine | Documentation |
---|---|
PostgreSQL | pg_dump |
MySQL | mysqldump |
MSSQL RDS | Native MSSQL RDS Backups |
MSSQL On-prem | MSSQL Backups |
Mongo | mongodump |
#
Setting up Spawn in CI#
AuthenticatingWhen you're using Spawn interactively, you'll start off by running spawnctl auth
. The authentication token you receive has a configured expiration time. This is no good for CI environments as an interactive authentication workflow is impossible.
Therefore, we must use access tokens to authenticate against Spawn as these have no expiration (though can be revoked if necessary).
This command will create an access token with a given purpose. It's best practice to give clear, human-readable purposes for your access tokens so you can understand what they're used for in the future.
Now that we have this access token, you should set it up as a secret in your CI pipeline of choice so that your agents can access it.
#
Creating the data imageSpawn can be used in any CI environment that supports running scripts. In this case, we're using a Bash script on a Linux agent, but you could use whichever OS and scripting language you like.
#
Defining the data image to createFirst, we'll need a file in source control that represents the data image we'd like to create:
There's some important best practices to mention in this yaml:
- The image is shared with multiple teams. In this case, Developers and DBAs in my organisation
- The image is tagged with
latest-production
- This means that consumers can always run
spawnctl create data-container --image WidgetStore:latest-production
and they'll receive a data container with the latest production data
- This means that consumers can always run
As called out in the prerequisites, we've assumed this agent can access the masked production backup. The yaml assumes those backups reside in the
/backups/
directory on the CI agent.
#
Creating the data imageNow we have the data image yaml defined in source control, we'll actually create it in our CI pipeline.
Here's an example of the script we'll use to do just that:
This script is very short, as we're only downloading spawnctl
and then creating a data image.
The data image YAML file contains all the information about how to construct that image.
This pipeline can be configured to run as often as you'd like to refresh your data images.
#
AuthenticatingThe $SPAWNCTL_ACCESS_TOKEN
environment variable is the access token we created and made available to the agents in previous steps.
#
An extra tag for tracingYou'll notice that we've also appended the --tag $PIPELINE_RUN_ID
flag to the command. This is another best practice, as it will add a tag in addition to latest-production
defined in the YAML file. In this case, the additional tag is the pipeline run identifier that triggered this data image creation. This means you'll be able to identify which images were created by which pipeline invocation.
#
Image lifetimesWe've also specified a lifetime for the image.
This sets a retention period for the data image. In this case, our data image is only valid for 7 days before automatically being cleaned up by Spawn. This prevents us from having stale data images that would no longer be useful.
#
Suppressing progress outputWe've also added the -q
flag to suppress output from spawnctl
to avoid polluting the CI pipeline logs with progress messages.
#
Reviewing the new imagesAs a developer in my organisation, I can now see these newly created data images and start using them in development: