Data types
Hazy defines a set of data types that columns from the source data need to be mapped to in order to allow the correct interpretation of the data during training.
The most commonly used types are Category, Datetime, Float, Integer, Foreign Key, ID, and Person.
Age¶
Usage¶
- The underlying data type will most likely be integer. It could also be a string.
- The table contains both an age column and a date of birth column. The date of birth column can also be in a parent table.
- Date of birth and age columns need to be generated to make logical sense i.e. an 80-year-old person needs a date of birth 40 years before a 40-year-old person.
Date of birth column¶
- The column must be in the current table or in a parent table.
- Age will be generated to stay logical to this column.
Reference Date¶
- Defines the date when the age column will be calculated.
- The format needs to be
%Y-%m-%d
, i.e.2023-09-25
. - If the date of birth value is
2000-10-25
and the reference date is set to2023-11-25
. The age will be calculated as 23 years.
Category¶
Usage¶
- The underlying datatype will most likely be string/varchar or integer. However, it can be almost anything.
- The column must contain values from a fixed-sized list.
- A basic example would be a
car_brand
column where the column would only contain values from the list["Audi", "Toyota", "Aston Martin", "Tesla", "Ford"]
. - The max_cat parameter describes how many values in the fixed-sized list will be accurately modelled. This parameter can be found on the general settings page and can be increased to model distributions more accurately, or decreased to reduce memory consumption and training time.
Advanced settings¶
- Combination ID Used to link multiple category columns together so that only combinations of categories observed in the source data will be generated in the synthetic data. See Combination ID for more information.
- Repeat by See denormalisation handling
City¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to cities.
- The entity_id parameter allows this city column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate regular city names or city codes.
Entity Settings¶
- Locales List of locales that represent the geographical regions on which the location data is based on.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zipcode i.e. country, region, state, district, city.
Constant¶
Usage¶
- The user wishes to replace a column with a single value.
- Can be used as a form of redaction.
- The underlying data type can be anything.
Value¶
- Constant value to replace the column.
- The underlying type will be chosen by parsing the value following this order of preference: integer, float, bool, datetime, timedelta, str.
"None"
should be supplied if a user wishes to indicate missing data.
Copy¶
Usage¶
- Used to handle a type of denormalisation. See Copy type for further information.
- The underlying data type can be anything.
Country¶
Usage¶
- The underlying datatype will be string/varchar.
- The table contains location data, including a column that relates to country.
- The entity_id parameter allows this country column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate country names or ISO2/ISO3 codes.
Entity Settings¶
- Locales List of locales that represent the geographical regions on which the location data is based on.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Currency¶
Usage¶
- The underlying data type is string, int, or float.
- There are relationships to other columns in the data involving currency or currency calculations. For instance, if one column contains a collection of currency symbols eg
£, £, $, $, $, £, £
and a second column contains amounts (the underlying type can be float, string, or int) eg404, 203, 2305, 432, 102, 50, 60
. In this case, the user would set the column containing the amount tocurrency
type, and the currency symbol column would be set tocategorical
. - If the underlying data type is string. It can also include decimal and thousand separators ie
"£250,609.00"
,€250.609,00"
orUSD250,609.00"
- If there is a date when an amount was recorded and the user wishes for exchange rate variability to be taken into account, you can specify a column of type
datetime
This column will help ensure that the right exchange rate is applied based on the date the amount was recorded."
Currency column¶
- Points to the column where the currency units are stored.
- Should be left blank if no currency units (i.e. in the case of int/float underlying data type)
- Should be left blank if currency units are part of the column in the underlying string data type case.
Decimal and thousand separator¶
- In the case that the underlying type is a string and separators are present.
- Used to normalise the data.
Date column¶
- Used if a date column is present alongside the amount.
- Used to allow exchange rate information to be included in the modelling.
Currency map¶
- Used to handle non-ISO currency codes.
- By providing a map it allows mapping non-standard currency codes to their ISO format.
Custom Address¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that contains a concatenation of other location information.
- The entity_id parameter allows this city column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The format_string parameter describes which sub-parts of the location entity you would like to concatenate to form this column, along with how you would like them to be concatenated, i.e " {street_number} {street}, {postcode}".
Entity Settings¶
- Locales List of locales that represent the geographical regions on which the location data is based on.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Datetime¶
Usage¶
- Datatype for datetime columns in databases.
- Datatype for datetime columns in
.parquet
and.avro
file formats. - For strings in files
.csv
and.csv.gz
conforming to a datetime format string.
Format string¶
- Essential for date parsing in files.
- See strftime.org for format codes.
- Also required for databases, used for exporting synthetic data to files. The format is ignored during data read.
Advanced settings¶
- Formulas The datetime can be derived using a formula that might combine multiple columns and constants. Other columns may likely consist of a blend of Datetime and Timedelta types.
- Bounds Datetimes can be bounded i.e. have a minimum or maximum value set by only values in another column i.e. if an
account_requested
date was always before anaccount_opened
date, then this would be a setting to use. - Repeat by See denormalisation handling.
District¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to districts.
- The entity_id parameter allows this district column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate regular district names or district codes.
Entity Settings¶
- Locales List of locales that represent the geographical regions on which the location data is based on.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Driving license number¶
Usage¶
- The source data is not used when this type is configured, and the driving licenses generated will be completely random.
- Alternatively, the source data can be used and the license numbers are randomly generated, but the counts match the distribution of the source data.
- In the latter case, epsilon still affects the amount of noise injected into the distribution.
Driving license locales¶
- Generate license plates from a set of specified country locales.
- Multiple locales can be provided.
Preserve distribution¶
- If on, the license numbers are randomly generated, but the counts of each license match the distribution of the source data.
- If off, the driving licenses generated will be completely random and won't match the source distribution.
Advanced Settings¶
- Repeat by See denormalisation handling.
Email¶
Usage¶
- The underlying datatype will be string/varchar.
- The table contains data relating to a person, including a column that relates to emails.
- The entity_id parameter allows this email column to be linked to an person entity, meaning that the person data across a given record will be coherent.
Entity Settings¶
- Locales List of locales that represent the geographical regions that the person data is based on.
Float¶
Usage¶
- Used for floating point data in databases.
- Used for floating point data in
.avro
and.parquet
file formats. - Used for strings in files which resemble floating point numbers. For example,
1e4
,1.0
,1.00
,-1.0
,1.2e2
would all be examples that should be defined by the Float type.
Advanced settings¶
- Formulas The float columns can be derived using a formula that might combine multiple columns and constants. Other columns involved in formulas will most likely be float or integer data types.
- Bounds Float columns can be bounded i.e. have a minimum or maximum value set by either another column or a constant.
- Repeat by See denormalisation handling
Foreign Key¶
Usage¶
- This is a column that points to a primary key in a parent table linking the two tables together.
- These types are not configurable through the
Column
page of the UI, but are instead automatically generated when a foreign key relationship is defined on theStructure
page. - Foreign keys can also be defined as primary keys for instances where there is a 1:1 relationship between tables, or when the foreign key column forms part of a composite primary key.
Gender¶
Usage¶
- The underlying datatype will most likely be string/varchar or possibly integer/enum.
- Accounts for different types of gender encoding.
Example¶
genero | nombre |
---|---|
hombre | Ricardo |
mujer | Marta |
otro | Carmen |
hombre | José |
... | ... |
In this example:
genero
would be set as the Gender
type with the gender map
hombre=m
, mujer=f
, otro=o
.
The nombre
column would be set as Person Type, with the settings sub part type=first_name
, person_id=1
(where the 1 would be autogenerated when creating a new person entity), person_gender=genero
.
Gender map¶
- A mapping can be provided from labels in the user's database to a standardised
m
,f
, ando
-male
,female
, andother
.
Advanced settings¶
- Repeat by See denormalisation handling.
Entity Settings¶
- Locales List of locales that represent the geographical regions that the person data is based on.
ID¶
Usage¶
- The underlying data type will most likely be string or integer.
- These columns will be uniquely generated to match certain patterns.
- Many different ID type formats are supported.
- This type can be a primary key.
ID type¶
Used to specify the type of ID generated. The different types of ID are described here in more detail.
Most formats do not rely on the underlying data. Some formats such as Conditioned and ID mixture can be used to model distributions of IDs within the source data.
Advanced settings¶
- Repeat by See denormalisation handling.
Integer¶
Usage¶
- Used for integer data in databases.
- Used for integer data in
.avro
and.parquet
file formats. - Used for strings in files which resemble integer numbers. For example
1
,10
, and-1
, would all be examples that should be defined by the Integer type.
Advanced settings¶
- Formulas The integer columns can be derived using a formula that might combine multiple columns and constants. Other columns involved in formulas will most likely be float or integer data types.
- Bounds Float columns can be bounded i.e. have a minimum or maximum value set by either another column or a constant.
- Repeat by See denormalisation handling.
List¶
Usage¶
- Used when the source data contains information in the form of a list of items within a single cell/column. These items are typically presented in the form of a comma-separated list and may include tags, keywords, or other enumerated data.
- Internally, the items within the list are separated and treated as individual categorical columns. However, the data will be returned as a single column, mirroring its format in the source data.
- The underlying data type is string since it deals with text-based lists.
Example¶
ID | Company | Industries |
---|---|---|
1 | ABC Inc. | AI, Data, Banking, Finance |
2 | XYZ Plc. | Banking, Finance |
3 | DEF Ltd. | Data, SaaS, eLearning |
In this example, the column Industries
would be configured using the list type. During training, the items in Industries
will be split and modelled as separate category columns. Using this list type will give the ability to return a column of lists that aren't exact copies of the lists in source data but maintain a sense of coherence by grouping related items together.
Separator¶
- You can specify the separator in the list. By default, it uses a comma (
,
), which is common for comma-separated lists. However, you can customize this separator to match the specific format of your data.
Maximum columns to split¶
- To control the number of columns the data is split into, you can specify the maximum number of columns. Any additional columns beyond this limit will be discarded. For example, if the maximum number was set to 4 in the example above, the items
AI
,Data
,Banking
, andFinance
will each be placed in their own individual columns. However, if the maximum column count is restricted to 3,AI
,Data
, andBanking
will be placed in individual columns, meanwhileFinance
will be dropped.
Mapped¶
Usage¶
- Used for categorical columns containing sensitive labels that you wish to obfuscate.
- An ID sampler is configured to replace the source category values.
- It generates the same frequency distribution as the source data, however, the values are replaced by samples from one of our ID samplers.
- See this example for why it might be used.
- A list of ID samplers that can be used for value replacement can be found here
Name¶
Usage¶
- The underlying datatype will be string/varchar.
- The table contains data relating to a person, including a column that relates to names.
- The entity_id parameter allows this name column to be linked to an person entity, meaning that the person data across a given record will be coherent.
- The type parameter describes what type of name to generate, i.e.
first_name
,last_name
,full_name
. - The format_string parameter can be used alongside the
custom_name
name type to specify a bespoke format of the name, for instance{first_name:.1}. {last_name}
if you wanted names such asJ. Smith
,D. Zemlak
,B. Purdy
.
Entity Settings¶
- Locales List of locales that represent the geographical regions that the person data is based on.
Passport¶
Usage¶
- Used for generating passport numbers, either for a single country or multiple countries which can be specified in the configuration.
- When configured with the Passport type, source data is not used. Instead, passport numbers are randomly generated using known passport regex patterns for the selected countries.
Passport countries¶
The countries used to generate the passport numbers can be configured in two different ways:
- It is possible to choose a list of countries using the type's
countries
field. - Alternatively, you can set a related country column using the type's
country_column
field. This column is typically part of a location entity or just a category column. When setting a country column, it's often necessary to provide an accompanying country map to map each value within the column to a recognizable 2-letter country code.
If both a list of countries and a country column are provided, the countries found in the country column will take precedence over the ones specified in the countries list.
Advanced Settings¶
- Repeat by See denormalisation handling.
Percentage¶
Usage¶
- The Percentage type is designed to handle numerical columns that may contain a
%
symbol, comma separators, and decimal points, such as%10,000
- The underlying data type is string.
Decimal and thousand separator¶
- When separators are present, you can specify the symbols used to denote the decimal and thousands separators. These symbols are used to normalize the data.
Advanced settings¶
- Repeat by See denormalisation handling
Postcode¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to postcodes/zipcodes.
- The entity_id parameter allows this postcode/zipcode column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate regular postcodes/zipcodes, incodes or outcodes.
Entity Settings¶
- Locales A set of country locales that represent all of the countries that exist in the column.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Person¶
Usage¶
- For modelling names of people, email addresses, and user names.
- A full example can be found here.
Person ID¶
- For linking names belonging to the same person together.
- All constituent parts relating to a single person will be generated together.
Person sub part¶
- For picking the type of name i.e.
First Name
/Last Name
/Full Name
. - Multiple parts can be provided.
Format string¶
- Used to specify a bespoke format of the name, for instance
{first_name:.1}. {last_name}
if you wanted names asA. Chiles
,D. Zemlak
,B. Purdy
. - You must select the matching sub-parts that make up the format string.
Entity Settings¶
- Locales A set of country locales for names to be picked from. More than one locale can be selected.
- Person Title For pointing to a title column which can be in a parent table or in the same table.
- Person Gender For pointing to a gender column configured with the Gender type.
Raw¶
Usage¶
- The underlying type can be anything. Most likely string.
- A type set as
Raw
must have a custom handler configured on theGeneral Settings
page. - It is designed to handle bespoke use cases and requires a high level of understanding to be able to configure custom handlers - talk to a Hazy engineer if you think you might need to use them.
Real¶
If you are at all concerned about private information in this column do not use this data type!
Usage¶
- Can only be used inside a Reference table.
- Used for replicating a column exactly.
- The underlying data type can be anything.
- For instance inside a Reference table you might have a
URL
column you wish to be replicated exactly, since each value is unique, setting as any other type such as Category has no statistical value. - If you believe there is statistical value in the column, i.e. the values are not unique and there is some distribution present consider using one of the other types eg. Category, Datetime, Float, Integer.
- This presents a privacy risk so should only be used with columns the user does not care about applying privacy to.
Region¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to regions.
- The entity_id parameter allows this region column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate regular region names or region codes.
Entity Settings¶
- Locales A set of country locales that represent all of the countries that exist in the column.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
State¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to states.
- The entity_id parameter allows this state column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate regular state names or region codes.
Entity Settings¶
- Locales A set of country locales that represent all of the countries that exist in the column.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Street Address¶
Usage¶
- The underlying datatype will be string/varchar or integer.
- The table contains location data, including a column that relates to street addresses.
- The entity_id parameter allows this street address column to be linked to an location entity, meaning that the location data across a given record will be coherent.
- The type parameter describes whether to generate door numbers, floor numbers, street numbers or street names.
Entity Settings¶
- Locales A set of country locales that represent all of the countries that exist in the column.
- Mismatch Handling behaviour when postcodes/zipcodes cannot be matched against the source data during location modelling.
- Number of clusters Number of clusters to be used for location modelling.
- Territory modelling Modelling approach used for lower specificity location data than post/zip code i.e. country, region, state, district, city.
Symbol¶
Usage¶
- The underlying data type is string.
- Often used to deal with file data sources, however, data like this is sometimes present in a database.
- Allows support for numerical columns with a symbol leading or trailing the numerical value, as well as thousand separators and decimal points such as
VAL10,000
. - If the data represents a percentage, the more specific Percentage type should be used.
- If the data represents a currency, the more specific Currency type should be used.
Symbol/pattern¶
- Leading or trailing pattern to strip away.
- In the case of values such as
VAL4.0
,VAL89.3
, the pattern would beVAL
.
Decimal and thousand separator¶
- In the case that the data does not use the standard
.
to indicate a decimal place or,
to indicate thousands. - Used to normalise the data.
Advanced settings¶
- Repeat by See denormalisation handling.
Timedelta¶
Usage¶
- The underlying data type is integer.
- Used to represent periods of time for example seconds, minutes, hours, days.
- Normally used when the column concerned interacts with Datetime type columns via formulas.
- If the column interacts with no other columns, the Integer type is usually sufficient.
Time delta unit¶
- Unit of time the column represents.
Advanced settings¶
- Formulas
timedelta
columns can be derived using a formula that might combine multiple columns and constants. Other columns involved in formulas will most likely be datetime ortimedelta
data types. Typically this column will be used inside the formula of a datetime column. Atimedelta
column can be added to a datetime column. - Bounds
Timedelta
columns can be bounded ie have a minimum or maximum value set by either another column or a constant. - Repeat by See denormalisation handling.
Title¶
Usage¶
- The underlying datatype will be string/varchar.
- The table contains data relating to a person, including a column that relates to titles e.g. "Mr", "Mrs, "Ms, etc.
- The entity_id parameter allows this title column to be linked to an person entity, meaning that the person data across a given record will be coherent.
Entity Settings¶
- Locales List of locales that represent the geographical regions that the person data is based on.
Username¶
Usage¶
- The underlying datatype will be string/varchar.
- The table contains data relating to a person, including a column that relates to usernames.
- The entity_id parameter allows this username column to be linked to an person entity, meaning that the person data across a given record will be coherent.
Entity Settings¶
- Locales List of locales that represent the geographical regions that the person data is based on.