1.6.0

Features¶

Read/Write

Capability to integrate with IBM Db2 as data source and sink. Db2 is a relational database that delivers data management and analytics capabilities for transactional workloads. This operational database is designed to deliver high performance, actionable insights, data availability and reliability, and it is supported across z/OS, Linux, Unix and Windows operating systems.
Supported data types:
- Strings: CHAR, VARCHAR, GRAPHIC, VARGRAPHIC, CLOB, DBCLOB
- Integers/Boolean: BOOLEAN, SMALLINT, INT, BIGINT
  - The supported upper limit for BIGINT is 9223372036854774784, which is 1023 less than the officially stated limit by IBM.
- Binary precision decimal numbers: REAL, DOUBLE
- Fixed precision decimal numbers: DECIMAL
  - Due to conversion to and from binary floating point representation during synthesis, the true lower and upper bounds of values inserted into DECIMAL columns in Db2 are restricted to double precision floating point limits.
- Date/time: DATE, TIME, TIMESTAMP
  - Can read TIMESTAMP values up to precision 12 (picosecond level)
  - Can synthesise TIMESTAMP values up to precision 3 (millisecond level)
- Other: XML
Unsupported data types:
- Binary objects: BINARY, VARBINARY, BLOB
- Decimal floating points: DECFLOAT

Note that while these types are supported for read and write, some types such as XML are not possible to synthesise.

Write

Capability to integrate with Microsoft SQL Server as a data sink. SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network (including the Internet).

The synthesiser makes use of SQL transactions, allowing us to validate any inserts into tables prior to committing them. If an error occurs within any one of these transactions during the write process, the transactions for all tables will be rolled back, therefore leaving the database(s) in exactly the same state they were prior to generation.
A fail safe location can be provided as part of the configuration so that if for any reason the write to DB fails, the synthetic data can be written to disk instead.
Known limitation: as a result of writing all rows in a single transaction, it is possible to quickly reach the transaction log size limit configured for your database. This limit may be updated in your database configuration at any time if any issues relating to transaction log sizes arise during training.

This handler allows the user to specify any number of regex patterns to match in the target column, and then assign a Sampler to generate values for the matched patterns.

With this feature a user can specify any number of queries (using Pandas query language) to search for matches across a table. When a record matches a given query, its corresponding ID sampler will be used to populate the value of the target column.
This can be used in conjunction with the Composite ID Sampler to generate semi-structured data that matches the correct format based on values in other columns.

Improvements to an existing handler that creates a representation of a person in the synthetic data.
- Adds ability to specify custom_columns in the PersonHandler. These can be things such as print-name for different types of correspondence (e.g. for letter, email), and ability to add titles (that match the gender of the name) without a gender or title column existing in the dataset.
- When using extra names in the custom columns constructor, the naming convention increases incrementally with each additional extra name.

An improvement to the ID handler now means we can support multiple ID formats for a single primary / foregin key column through the use of regex pattern matching.

Ability to generate a combination of any of the supported ID types for each value in a column.
- A composite key, in the context of relational databases, is a combination of two or more columns in a table that can be used to uniquely identify each row in the table.
- Uniqueness is only guaranteed when the columns are combined; when taken individually the columns do not guarantee uniqueness.
Improved representation of number of rows with composite keys from source data.

Synthesisers will now support a given column being both a Primary Key and a Foreign Key.

Handlers can now refer to columns in tables other than the table currently being configured. In order to allow for cross-table dependencies a MultiTableManager has been implemented that automatically queries all requested columns.

2.0.0

1.5.0