Categorygithub.com/nixys/nxs-data-anonymizer
repositorypackage
1.14.0
Repository: https://github.com/nixys/nxs-data-anonymizer.git
Documentation: pkg.go.dev

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# README

photo_2023-07-28_15-28-52

nxs-data-anonymizer

Introduction

nxs-data-anonymizer is a tool for anonymizing PostgreSQL and MySQL/MariaDB/Percona databases' dump.

Features

  • Supported databases and versions:
    • PostgreSQL (9/10/11/12/13/14/15/all versions)
    • MySQL/MariaDB/Percona (5.7/8.0/8.1/all versions)
  • Flexible data faking based on:
    • Go templates and Sprig template’s library like Helm. You may also use values of other columns for same row to build more flexible rules
    • External commands you may execute to create table field values
    • Security enforcement rules
    • Link cells across the database to generate the same values
  • Stream data processing. It means that you can a use the tool through a pipe in command line and redirect dump from source DB directly to the destination DB with required transformations
  • Easy to integrate into your CI/CD

Who can use the tool

Development and project teams which are dealing with production and test/dev/stage or dynamic namespaces with databases and need to ensure security and prevent data leaks.

Quickstart

Inspect your database structure and set up the nxs-data-anonymizer config in accordance with the sensitive data you need to anonymize.

You are able to use this tool in any way you want. Three most common ways are described below.

Console

In order to operate with your database anonymization via console you need to go through the following steps:

  • Download and untar the nxs-data-anonymizer binary
  • Run the nxs-data-anonymizer through the command line with arguments you want to use

For example, use the following command if you need to anonymize your PostgreSQL database from production to dev on fly (PostgreSQL Client need to be installed):

export PGPASSWORD=password; pg_dump -U postgres prod | /path/to/nxs-data-anonymizer -t pgsql -c /path/to/nxs-data-anonymizer.conf | psql -U postgres dev

GitLab CI

This section describes how to integrate nxs-data-anonymizer into your GitLab CI. You may add jobs presented below into your .gitlab-ci.yml and adjust it for yourself.

Job: anonymize prod

Job described in this section is able to perform the following tasks:

  • Run when special tag for main branch is set
  • Create a production database dump, anonymize and upload it into s3 bucket

Job sample:

anonymize:
  stage: anonymize
  image: nixyslab/nxs-data-anonymizer:latest
  variables:
    GIT_STRATEGY: none
    PG_HOST: ${PG_HOST_PROD}
    PG_USER: ${PG_USER_PROD}
    PGPASSWORD: ${PG_PASS_PROD}
  before_script: 
  - echo "${S3CMD_CFG}" > ~/.s3cmd
  - echo "${NXS_DA_CFG}" > /nxs-data-anonymizer.conf
  script:
  - pg_dump -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE} | /nxs-data-anonymizer -t pgsql -c /nxs-data-anonymizer.conf | gzip | s3cmd put - s3://bucket/anondump.sql.gz
  only:
  - /^v.*$/
  except:
  - branches
  - merge_requests
Job: update stage

Job described in this section deals with the following:

  • Manual job for stage branch
  • Download the anonymized dump from s3 bucket and load into stage database

Job sample:

restore-stage:
  stage: restore
  image: nixyslab/nxs-data-anonymizer:latest
  variables:
    GIT_STRATEGY: none
    PG_HOST: ${PG_HOST_STAGE}
    PG_USER: ${PG_USER_STAGE}
    PGPASSWORD: ${PG_PASS_STAGE}
  before_script: 
  - echo "${S3CMD_CFG}" > ~/.s3cmd
  script:
  - s3cmd --no-progress --quiet get s3://bucket/anondump.sql.gz - | gunzip | psql -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE}
  only:
  - stage
  when: manual
CI/CD variables

This section contains a description of CI/CD variables used in GitLab CI job samples above.

General
VariableDescription
S3CMD_CFGS3 storage config
PG_SCHEMAPgSQL schema
PG_DATABASEPgSQL database name
Production
VariableDescription
NXS_DA_CFGnxs-data-anonymizer config
PG_HOST_PRODPgSQL host
PG_USER_PRODPgSQL user
PG_PASS_PRODPgSQL password
Stage
VariableDescription
PG_HOST_STAGEPgSQL host
PG_USER_STAGEPgSQL user
PG_PASS_STAGEPgSQL password

GitHub Actions

This section describes how to integrate nxs-data-anonymizer into your GitHub Actions. You may add jobs presented below into your .github workflows and adjust it for yourself.

Job: anonymize prod

Job described in this section is able to perform the following tasks:

  • Run when special tag is set
  • Create a production database dump, anonymize and upload it into s3 bucket
on:
  push:
    tags:
    - v*.*

jobs:
  anonymize:
    runs-on: ubuntu-latest
    container:
      image: nixyslab/nxs-data-anonymizer:latest
      env:
        PG_HOST: ${{ secrets.PG_HOST_PROD }}
        PG_USER: ${{ secrets.PG_USER_PROD }}
        PGPASSWORD: ${{ secrets.PG_PASS_PROD }}
        PG_SCHEMA: ${{ secrets.PG_SCHEMA }}
        PG_DATABASE: ${{ secrets.PG_DATABASE }}
    steps:
    - name: Create services configs
      run: |
        echo "${{ secrets.S3CMD_CFG }}" > ~/.s3cmd
        echo "${{ secrets.NXS_DA_CFG }}" > /nxs-data-anonymizer.conf
    - name: Anonymize
      run: |
        pg_dump -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE} | /nxs-data-anonymizer -t pgsql -c /nxs-data-anonymizer.conf | gzip | s3cmd put - s3://bucket/anondump.sql.gz
Job: update stage

Job described in this section deals with the following:

  • Manual job
  • Download the anonymized dump from s3 bucket and load into stage database
on: workflow_dispatch

jobs:
  restore-stage:
    runs-on: ubuntu-latest
    container:
      image: nixyslab/nxs-data-anonymizer:latest
      env:
        PG_HOST: ${{ secrets.PG_HOST_STAGE }}
        PG_USER: ${{ secrets.PG_USER_STAGE }}
        PGPASSWORD: ${{ secrets.PG_PASS_STAGE }}
        PG_SCHEMA: ${{ secrets.PG_SCHEMA }}
        PG_DATABASE: ${{ secrets.PG_DATABASE }}
    steps:
    - name: Create services configs
      run: |
        echo "${{ secrets.S3CMD_CFG }}" > ~/.s3cmd
    - name: Restore
      run: |
        s3cmd --no-progress --quiet get s3://bucket/anondump.sql.gz - | gunzip | psql -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE}
GitHub Actions secrets

This section contains a description of secrets used in GitHub Actions job samples above.

General
VariableDescription
S3CMD_CFGS3 storage config
PG_SCHEMAPgSQL schema
PG_DATABASEPgSQL database name
Production
VariableDescription
NXS_DA_CFGnxs-data-anonymizer config
PG_HOST_PRODPgSQL host
PG_USER_PRODPgSQL user
PG_PASS_PRODPgSQL password
Stage
VariableDescription
PG_HOST_STAGEPgSQL host
PG_USER_STAGEPgSQL user
PG_PASS_STAGEPgSQL password

Settings

Default configuration file path: /nxs-data-anonymizer.conf. The file is represented in yaml.

Command line arguments

ArgumentShortRequiredHaving valueDefault valueDescription
--help-hNoNo-Show program help message
--version-vNoNo-Show program version
--conf-cNoYes/nxs-data-anonymizer.confConfiguration file path
--input-iNoYes-File to read data from. If not specified stdin will be used
--log-format-lNoYesjsonLog file format. You are available to use either json or plain value
--output-oNoYes-File to write data to. If not specified stdout will be used
--type-tYesYes-Database dump file type. Available values: pgsql, mysql

General settings

OptionTypeRequiredDefault valueDescription
logfileStringNostderrLog file path. You may also use stdout and stderr
loglevelStringNoinfoLog level. Available values: debug, warn, error and info
progressProgressNo-Anonymization progress logging
variablesMap of Variables (key: variable name)No-Global variables to be used in a filters. Variables are set at the init of application and remain unchanged during the runtime
linkSlice of LinkNo-Rules to link specified columns across the database
filtersMap of Filters (key: table name)No-Filters set for specified tables (key as a table name). Note: for PgSQL you also need to specify a scheme (e.g. public.tablename)
securitySecurityNo-Security enforcement for anonymizer
Progress settings
OptionTypeRequiredDefault valueDescription
rhythmStringNo0sFrequency write into the log a read bytes count. Progress will be written to the log only when this option is specified and has none-zero value. You may use a human-readable values (e.g. 30s, 5m, etc)
humanizeBoolNofalseSet this option to true if you need to write into the log a read bytes count in a human-readable format. On false raw bytes count will be written to the log
Variables settings
OptionTypeRequiredDefault valueDescription
typeStringNotemplateType of field value: template and command are available
valueStringYes-The value to be used as global variable value within the filters. In accordance with the type this value may be either Go template or command. See below for details
Link settings

Link is used to create the same data with specified rules for different cells across the database.

Each link element has following properties:

  • Able to contain multiple tables and columns for each table
  • All specified cells with the same data before anonymization will have same data after
  • One common rule to generate new values
OptionTypeRequiredDefault valueDescription
typeStringNotemplateType of field value: template and command are available
valueStringYes-The value to be used to replace at every cell in specified column. In accordance with the type this value may be either Go template or command. See below for details
uniqueBoolNofalseIf true checks the generated value for cell is unique whole an all columns specified for link element
Filters settings

Filters description for specified table.

OptionTypeRequiredDefault valueDescription
columnsMap of Columns (key: column name)No-Filter rules for specified columns of table (key as a column name)
Columns settings
OptionTypeRequiredDefault valueDescription
typeStringNotemplateType of field value: template and command are available
valueStringYes-The value to be used to replace at every cell in specified column. In accordance with the type this value may be either Go template or command. See below for details
uniqueBoolNofalseIf true checks the generated value for cell is unique whole the column

Go template

To anonymize a database fields you may use a Go template with the Sprig template library's functions.

Additional filter functions:

  • null: set a field value to NULL
  • isNull: compare a field value with NULL
  • drop: drop whole row. If table has filters for several columns and at least one of them returns drop value, whole row will be skipped during the anonymization process

You may also use the following data in a templates:

  • Current table name. Statement: {{ .TableName }}
  • Current column name. Statement: {{ .CurColumnName }}
  • Values of other columns in the rules for same row (with values before substitutions). Statement: {{ .Values.COLUMN_NAME }} (e.g.: {{ .Values.username }})
  • Global variables. Statement: {{ .Variables.VARIABLE_NAME }} (e.g.: {{ .Variables.password }})
  • Raw column data type. Statement: {{ .ColumnTypeRaw }}
  • Regex's capturing groups for the column data type. This variable has array type so you need to use range or index to access specific element. Statement: {{ index .ColumnTypeGroups 0 0 }}. See Types for details

Command

To anonymize a database fields you may use a commands (scripts or binaries) with any logic you need. The command's concept has following properties:

  • The command's stdout will be used as a new value for the anonymized field
  • Command must return zero exit code, otherwise nxs-data-anonymizer will falls with error (in this case stderr will be used as an error text)
  • Environment variables with the row data are available within the command:
    • ENVVARTABLE: contains a name of the current table
    • ENVVARCURCOLUMN: contains the current column name
    • ENVVARCOLUMN_{COLUMN_NAME}: contains values (before substitutions) for all columns for the current row
    • ENVVARGLOBAL_{VARIABLE_NAME}: contains value for specified global variable
    • ENVVARCOLUMNTYPERAW: contains raw column data type
    • ENVVARCOLUMNTYPEGROUP_{GROUP_NUM}_{SUBGROUPNUM}: contains regex's capturing groups for the column data type. See Types for details
Security settings
OptionTypeRequiredDefault valueDescription
policyPolicyNo-Security policy for entities
exceptionsExceptionsNo-Exceptions for entities
defaultsDefaultsNo-Default filters for entities
Policy settings
OptionTypeRequiredDefault valueDescription
tablesStringNopassSecurity policy for tables. If value skip is used all undescribed tables in config will be skipped while anonymization
columnsStringNopassSecurity policy for columns. If value randomize is used all undescribed columns in config will be randomized (with default rules in accordance to types) while anonymization

Values to masquerade a columns in accordance with the types see below.

PgSQL:

TypeValue to masquerade
smallint0
integer0
bigint0
smallserial0
serial0
bigserial0
decimal0.0
numeric0.0
real0.0
double0.0
characterrandomized character data"
bpcharrandomized bpchar data
textrandomized text data

MySQL:

TypeValue to masquerade
bit0
bool0
boolean0
tinyint0
smallint0
mediumint0
int0
integer0
bigint0
float0.0
double0.0
double precision0.0
decimal0.0
dec0.0
charrandomized char (String will be truncated to "COLUMN_SIZE" length.)
varcharrandomized varchar (String will be truncated to "COLUMN_SIZE" length.)
tinytextrandomized tinytext
textrandomized text
mediumtextrandomized mediumtext
longtextrandomized longtext
enumLast value from enum
setLast value from set
date2024-01-01
datetime2024-01-01 00:00:00
timestamp2024-01-01 00:00:00
time00:00:00
year2024
json{"randomized": "json_data"}
binarycmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
varbinarycmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
tinyblobcmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
blobcmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
mediumblobcmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
longblobcmFuZG9taXplZCBiaW5hcnkgZGF0YQo=
Exceptions settings
OptionTypeRequiredDefault valueDescription
tablesSlice of stringsNo-Table names without filters which are not be skipped while anonymization if option security.policy.tables set to skip
columnsSlice of stringsNo-Column names (in any table) without filters which are not be randomized while anonymization if option security.policy.columns set to randomize
Defaults settings
OptionTypeRequiredDefault valueDescription
columnsMap of FiltersNo-Default filter for columns (in any table). That filters will be applied for columns with this names without described filters
typesSlice of TypesNo-Custom filters for types (in any table). With this filter rules you may override default filters for types
Types settings
OptionTypeRequiredDefault valueDescription
regexStringYes-Regular expression. Will be checked for match for column data type (in CREATE TABLE section). Able to use capturing groups within the regex that available as an additional variable data in the filters (see Columns for details). This ability helps to create more flexible rules to generate the cells value in accordance with data type features
ruleColumnsYes-Rule will be applied columns with data types matched for specified regular expression

Example

Imagine you have a simple database with two tables users and posts in your production PgSQL like this:

idusernamepasswordapi_key
1adminZjCX6wUxtXIMtipepezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918
2alicetuhjLkgwwetiwf82od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z
3bobAjRzvRp3DWo6VbAowp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx
idposter_idtitlecontent
11example_post_1epezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918
22example_post_22od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z
33example_post_3owp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx
41example_post_4epezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918
52example_post_52od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z
63example_post_6owp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx

You need to get a dump with fake values:

  • For admin: preset fixed value for a password and API key to avoid the need to change an app settings in your dev/test/stage or local environment after downloading the dump.
  • For others: usernames in format user_N (where N it is a user ID) and unique random passwords and API keys.
  • Need to preserve data mapping between users and posts tables in id and poster_id columns
  • Need to randomize contents of content column. In accordance with these conditions, the nxs-data-anonymizer config may look like this:
variables:
#Global variables.
  adminPassword:
    type: template
    value: "preset_admin_password"
  adminAPIKey:
    value: "preset_admin_api_key"

#Block defining rules of behavior with fields and tables for which filters are not specified.
security:
# Specifies the required actions for tables and columns that are not specified in the configuration.
  policy:
    tables: skip
    columns: randomize
# Excludes policy actions for the specified tables and columns.
  exceptions:
    tables: 
    - public.posts
    columns:
    - title
# Overrides the default policy actions for the columns specified in this block. The value is generated once and substituted into all instances of the field.
  defaults:
    columns:
      content:
        value: "{{- randAlphaNum 20 -}}"

#Here you define the rules that allow you to preserve the mapping of values ​​between tables.
link:
- rule:
#Value generation rule.
    value: "{{ randInt 1 15	}}"
    unique: true
  with:
#Tables and columns to which the rule is applied.
    public.users:
    - id
    public.posts:
    - poster_id

#Block describing replacement rules for fields.
filters:
  public.users:
    columns:
      username:
        value: "{{ if eq .Values.username \"admin\" }}{{ .Values.username }}{{ else }}user_{{ .Values.id }}{{ end }}"
      password:
        type: command
        value: /path/to/script.sh
        unique: true
      api_key:
        value: "{{ if eq .Values.username \"admin\" }}{{ .Variables.adminAPIKey }}{{ else }}{{- randAlphaNum 50 | nospace | lower -}}{{ end }}"
        unique: true

The /path/to/script.sh script content is following:

#!/bin/bash

# Print preset password if current user is admin
if [ "$ENVVARCOLUMN_username" == "admin" ];
then
    echo -n "$ENVVARGLOBAL_adminPassword"
    exit 0
fi

# Generate password for other users
p=$(pwgen -s 5 1 2>&1) 
if [ ! $? -eq 0 ];
then

    # On error print message to stderr and exit with non zero code

    echo -n "$p" >&2
    exit 1
fi

# Print generated password
echo $p | tr -d '\n'

exit 0

Now you may execute the following command in order to load anonymized data into your dev DB:

pg_dump ... | ./nxs-data-anonymizer -c filters.conf | psql -h localhost -U user example

As a result:

idusernamepasswordapi_key
5adminpreset_admin_passwordpreset_admin_api_key
4user_2Pp4HYdhx4mccxyd8ux5uf1khpbqsws8qqeqs4efex1vhfltzhtjcwcu
7user_3vu5TWlgkkq3csskuyew8fr52vfjjenjzudokmiidg3cohl2bertc93x
idposter_idtitlecontent
15example_post_1EDlT6bGXJ2LOS7CE2E4b
24example_post_2EDlT6bGXJ2LOS7CE2E4b
37example_post_3EDlT6bGXJ2LOS7CE2E4b
45example_post_4EDlT6bGXJ2LOS7CE2E4b
54example_post_5EDlT6bGXJ2LOS7CE2E4b
67example_post_6EDlT6bGXJ2LOS7CE2E4b

It's easy. You can find more examples in doc/examples.

Roadmap

Following features are already in backlog for our development team and will be released soon:

  • Global variables with the templated values you may use through the filters for all tables and columns
  • Ability to delete tables and rows from faked dump
  • Ability to output into log a custom messages. It’s quite useful it order to obtain some generated data like admin passwords, etc
  • Support of a big variety of databases

Feedback

For support and feedback please contact me:

For news and discussions subscribe the channels:

License

nxs-data-anonymizer is released under the Apache License 2.0.