Post Generation

Normally the aim of synthetic data is to look as close to real data as possible, but sometimes it is useful to differentiate it from the source.

The following are example use cases for when data can be optionally further changed.

  • Data can be transformed so that it is clear to users looking at the data it is synthetic data and not production data.

  • Data can be transformed to make sure that should it get linked up to a live system by mistake the data cannot be used. For example, to make sure that email addresses would not be deliverable. Further transformation is only required in this case as an absolute guarantee that there is no chance that the synthesized email addresses do not correspond to email addresses actually in use.

Note that synthetic data is already abstracted data and further transformation is not required to ensure privacy.

Transformations can be applied to any string column. Note that the impact of the changes means that it is likely that the data will no longer conform to the format rules for that data. For example, an email address that is changed in this way may no longer be a valid email address.

Options

There are three different types of transformations that can be applied. By either prepending, appending or encrypting the data, or a combination of them all. For example, the email address test@test.com could be transformed the following ways:

Configuration

class hazy_configurator.general_params.post_generation_config.PostGenerationConfig

Bases: VersionedHazyBaseModel

Main class for all post generation options.

Examples

from hazy_configurator import PostGenerationConfig

post_generation_config = PostGenerationConfig(
    target={
        "table_name": {
            "column_name_in_table": [
                EncryptMethodConfig(
                    padding_character="@",
                    key="A051114A0B8F9075BE488347FA97F908",
                    tweak="4CEB496A1E0037"
                ),
                ExtendMethodConfig(method="prepend", string="%"),
                ExtendMethodConfig(method="append", string="@"),
            ]
        }
    }
)
field target: Dict[str, Dict[str, List[Union[ExtendMethodConfig, EncryptMethodConfig]]]] = None

Configuration for applying post-generation to a set of tables. Keys in this object should correspond to table names.Nested keys should correspond to column names.

Extend Method Configuration

class hazy_configurator.general_params.post_generation_config.ExtendMethodConfig

Bases: VersionedHazyBaseModel

Used to append or prepend strings to column values.

Fields:
field method: Literal['append', 'prepend'] [Required]

Method for extending generated data to ensure that it does not match real data. Must be one of [append, prepend].

field string: str [Required]

String to be appended or prepended to the generated data.

Encrypt Method Configuration

class hazy_configurator.general_params.post_generation_config.EncryptMethodConfig

Bases: VersionedHazyBaseModel

Used to encrypt column values.

Fields:
field method: Literal['encrypt'] = 'encrypt'
field padding_character: ConstrainedStrValue [Required]

This is required if the method is “encrypt”. This is used in two ways. Firstly it is used to pad out the synthesised data to the minimum length required for encryption. As such, the value entered is only important if there is a need to decrypt the data afterwards. Secondly this character is used after the encryption to append an extra character to the resulting string in clear text. This is to ensure that the encrypted data does not accidentally match a valid scenario. In both cases the character should be one that will never appear at the end of the source data. For many scenarios a good character to use is “@”.

Constraints:
  • maxLength = 1

field key: Optional[str] = None

A 32 character hexadecimal string and is used as the key for the encryption. It is only necessary if there is a requirement to decrypt the data to get back to the original synthesised string. If none is specified then the algorithm will use a random value. If the key is specified then a tweak parameter is also required.

field tweak: Optional[str] = None

A 14 character hexadecimal string and is used similar to an initialisation vector for symmetric encryption. It is only necessary if there is a requirement to decrypt the data to get back to the original synthesised string. If none is specified then the algorithm will use a random value. If the tweak is specified then a key parameter is also required.

Decrypting encrypted data

It is possible to decrypt data that has been encrypted as long as a key and tweak were specifed when doing the encryption. The following code can be used to decrypt the encrypted data.

# pip3 install ff3>=1.0.1
from ff3 import FF3Cipher
import pandas as pd

def decrypt_standalone(col: pd.Series, key, tweak, padding_character):

    def decrypt_cell(ciphertext, cipher, pad_char):
        ciphertext_chunked = chunk_str(ciphertext, cipher.maxLen)
        cleartext_chunked = [cipher.decrypt(c) for c in ciphertext_chunked]
        cleartext_padded = "".join(cleartext_chunked)
        if cleartext_padded == pad_char * (cipher.minLen + 2):
            cleartext = ""
        elif cleartext_padded == pad_char * (cipher.minLen + 1):
            cleartext = pd.NA
        else:
            cleartext = cleartext_padded.rstrip(pad_char)
        return cleartext

    def create_alphabet(column, padding_character):
        alphabet = set()
        for cell in column:
            if not pd.isna(cell):
                alphabet |= set(cell)
        return sorted(alphabet - set(padding_character)) + [padding_character]

    def chunk_str(input, chunk_size):
        return [input[i : i + chunk_size] for i in range(0, len(input), chunk_size)]

    alphabet = create_alphabet(col, padding_character)
    cipher = FF3Cipher.withCustomAlphabet(key, tweak, alphabet)
    col = col.apply(lambda x: decrypt_cell(x[:-1], cipher, padding_character))
    return col

decrypt_standalone(ciphertext_column, key, tweak, padding_character)