Formula Settings

class hazy_configurator.settings.formula_setting.FormulaSetting

Bases: HazyBaseModel

This class is used to specify the formula that should be used to specify the data.

Fields:
field expression: str [Required]

Formula to apply to the column.

Some examples of valid formulas are "a + b + c" or "if(is_last(x), y, z)". Note that the formula cannot contain static values such as integers or strings. If this is required use a static value in the column map. See Expression syntax for available syntax and examples.

field column_map: Dict[str, Union[str, StaticValue, ColId]] = {}

A mapping between variables defined in the formula and the columns of the provided data.

The dictionary value can either be a string specifying the name of a column or a ColId object (for when the column exists in different table to the target column), or another dictionary specifying a static value . If it is a dictionary specifying a static value then it should be in the format StaticValue(value="30 days, 2 hours", dtype="timedelta") where the type is always static, the value is the value and the dtype is a required parameter which is used to convert the provided value into a datatype. The dtype can be either "string", "float", "integer", "boolean", "datetime" or "timedelta".

When using “timedelta” in the column_map the following units can be used in strings:

  • W

  • D / days / day

  • hours / hour / hr / h

  • m / minute / min / minutes / T

  • S / seconds / sec / second

  • ms / milliseconds / millisecond / milli / millis / L

  • us / microseconds / microsecond / micro / micros / U

  • ns / nanoseconds / nano / nanos / nanosecond / N

It is recommended to use these in a comma separated list. Some examples are:

  • 30 days, 2 hours

  • 1W - i.e. 1 week

  • 1 hour, 30 mins, 30 seconds

When using a “datetime” value in the column_map it is recommended to use isoformat following YYYY-MM-DD[*HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]] where * can match any single character.

Some examples are:

  • 2011-11-04

  • 2011-11-04T00:05:23

  • 2011-11-04 00:05:23.283

  • 2011-11-04 00:05:23.283+00:00

  • 2011-11-04T00:05:23+04:00

field condition: str = None

Query condition to use when attempting to apply the formula to only a subset of the data verifying the specified condition.

Examples of condition syntax, where a, b, c, d, col with spaces are column names. Note backticks ` should be used around column names which contain spaces:

  • (a < b) & (b < c) i.e. apply formula when column b is between a and c.

  • a not in b i.e. apply formula when the value in column a is not a value in column b.

  • a in b and c < d i.e. apply formula when the value in column a is in column b and value in c is less than value in d.

  • a in (b + c + d) i.e. apply formula when value in a is in a column which is the sum of b, c and d.

  • b == ["a", "b", "c"] i.e. apply formula when column b is equal to the value “a”, “b” or “c”. Notice quotes are used for possible values.

  • c != [1, 2] i.e. apply formula when column c is not equal to 1 or 2.

  • [1, 2] in c i.e. apply formula when column c is equal to 1 or 2. Can also be written c == [1, 2].

  • `col with spaces` < b i.e. apply formula when values in column col with spaces is less than values in column b.

field model_error: bool = False

When set to True the system examines the source data to see if there is any difference between the result of the calculation in the source data and the actual value seen in the source data. If there is a difference then this difference will be modelled and the difference replicated in the synthetic data.

An example of where this is useful is when looking at bank transactions where each transaction row includes the account balance which is a sum of the transactions so far plus a starting balance. In this case the error will be the starting balance and so if this parameter is set to true then the synthetic data will have a similar distribution of starting balances. The system will either model the difference for each row, or if the error in the source is constant for all the records in a sequence then the error added in the synthetic data will be constant over the sequence.

field group_by: List[Union[str, ExpressionConfig]] = []

When provided the data is grouped according to the provided set of keys or expressions before applying the handler to each group.

This parameter must be provided as a list of either:

  • Column name as a string

  • A dictionary to be used to specify an expression. For example, this could be used to group rows by the month of the year based on a datetime column e.g.:

{
    "type": "expression",
    "expression": "a+b",
    "column_map": {
        "a": "col_1",
        "b": "col_2",
    }
}
field sort_by: List[Union[str, ExpressionConfig]] = []

When provided the data is sorted by the provided columns or expressions before applying the formula.

The parameters follows the same syntax as group_by.