2.3.0

Features

Subsetting

Subsetting has been introduced to speed up training, exhausting less memory by using a smaller representative sample of the original data. Subsetting is done by maintaining inter-table referential integrity and respecting consistency of intra-table connections. The feature works for files (csv, parquet, etc) and for Db2 and SQLite, and can be enabled in the UI. By default, it is disabled and can be enabled in the 'General Setting' section. Users have the ability to tell the algorithm to skip any number of tables, thus not subsetting those.

Users can choose to perform subsetting in cache, an option when enabled performs training in an encrypted temporary storage using cache keys as unique identifiers for fetching and re-use. This is used to reduce the amount of time on subsequent training runs to access data already read, by fetching them directly from cache so it doesn’t need to be sampled again. In the event of a training failure, with this option enabled, users can leverage previous work already in cache, thereby saving users time and computational resources by not having to rerun everything from scratch.

Known Limitations

  • Does not support non-categorical string values
  • Cache key only available for subsetting and not normal training

Enhanced DataType Detection

Enhanced data type detection has been introduced as an optional feature to yield more detailed and accurate results from the analysis stage of configuration, thereby making the configuration of a dataset a less manual process. It does this by having less reliance on column names to infer types, instead it conducts a statistical analysis of the underlying data itself. Having said this, the highest accuracy can be achieved when the column names reflect the type of data contained in them. It is important to note that the results of data type detection should always be reviewed by someone with an understanding of the data itself.

Enhanced datatype detection supports Integer, Float, Category, DateTime, ID, Person and Location types. Standard data type detection is still set as default and can be used if preferred. Please see the Configurator Installation page of our documentation for information on how to get set up with the latest version of analysis.

Known Limitations

Enhanced data type detection has all of the capabilities of the standard analysis however there are still some limitations:

  • Real, TimeDelta, Raw, Age, Mapped and Combination datatypes are not supported.
  • Common date formats such as "%d%m%Y" are automatically detected. More unusual formats may not be detected and will require manual configuration.
  • Unable to detect title, gender and custom columns for person type.
  • Unable to detect incode, outcode, region, region code, street, street number, floor, door for location type.
  • Does not support cross table dependencies for types.
  • Enhanced DataType Detection can only be enabled via UI.

Automatic Configuration of PK/CK/FK Constraints

With enhanced data type detection, we have also introduced the ability to automatically configure primary, composite and foreign key constraints during the analysis stage of configuration.

The way this is done can differ depending on the source of the data:

When reading data from disk or S3, the constraints are inferred by analysing the underlying data itself. When reading data from a database, an attempt is first made to query the constraints directly from the database. If no constraints are found, or database permissions issues prevent constraints from being queried, the underlying data will be analysed to infer the constraints.

By default, this feature is disabled. Please see the Configurator Installation page of our documentation for information on how to get set up with the latest version of analysis.

Known Limitations

When constraints are inferred by analysing the underlying data, the following limitations apply:

  • Foreign keys will only be matched if they have the same column name as the primary key that they reference.
  • Composite keys will only be identified in the case where every column that makes up the key is also a foreign key.

Configuration of 100s of tables from the UI

UI navigation improvements have been made to enable users to configure and navigate an increased number of big tables via the user interface.

In prior versions, a horizontal list of table names would be shown at a time and it would be cumbersome for users to click through them all. In v2.3, the UI has been redesigned to be a paginated list which is searchable and sortable, making it easier to navigate a large amount of tables and columns.

Known Limitations

  • Suffers from performance issues when table column count exceeds 15000.

Fixes

Hazy Synthesisers

  • Key for FormFieldView
  • Many queries for large tables
  • Static id now parsed to correct type instead of string
  • DB exception and clean up naming
  • Negative ages bug when century not provided in date of births
  • IBAN/BBAN ID settings now generate valid data for all locales

Improvements

Hazy Synthesiser / Configurator

  • Improvements to composite keys (category and date/time dtypes)
  • Navigating and configuring 100s of tables
  • Show warning for columns set as categories in case they contain PII data
  • Change Synth default settings to reduce memory & cpu time
  • Support category and datetime CK parts
  • Increased Schema number limit in dropdown selector

Training Configuration Breaking Changes

  • Fix: Broken locales removed from PersonLocales : es_MX, lt_LT, hi_IN, lv_LV, en_IN, en_TH.
  • Fix: The ID settings SwiftSettings, Swift8Settings, Swift11Settings and CompanySettings default values have changed as default did not match the type.
  • Fix: IBAN and BBAN locales were previously incorrect and have been replaced.
  • Fix: StaticIdSettings now parsed correctly to the correct type. They were previously parsed to strings.