Once you have chosen your source files, you can run our preliminary analysis on them. This will give you an overview of the statistical properties of the data, and help you decide which properties you wish to model in the synthetic data.

The analysis step will also identify any potential issues with the data, such as missing values, outliers, or data types that are not supported by the hub.

Our automatic analysis connects to the chosen source to automatically:

  • Determine primary, composite and foreign key constraints (for multi-table datasets)

  • Compute statistical properties of all columns, including data types, cardinality, and distribution of values

  • Infer an appropriate Hazy data type if possible. See Data types.

  • Detect common date formats such as %d%m%Y. More unusual formats may not be detected and will require manual configuration.

It is important to note that the results of data type detection should always be reviewed by someone with an understanding of the data itself.

When reading data from disk or cloud storage, the constraints are inferred by analysing the underlying data itself. When reading data from a database, an attempt is first made to query the constraints directly from the database. If no constraints are found, or database permissions issues prevent constraints from being queried, the underlying data will be analysed to infer the constraints.