Formula Settings¶
- class hazy_configurator.settings.formula_setting.FormulaSetting¶
Bases:
HazyBaseModel
This class is used to specify the formula that should be used to specify the data.
- Fields:
- field expression: str [Required]¶
Formula to apply to the column.
Some examples of valid formulas are
"a + b + c"
or"if(is_last(x), y, z)"
. Note that the formula cannot contain static values such as integers or strings. If this is required use a static value in the column map. See Expression syntax for available syntax and examples.
- field column_map: Dict[str, Union[str, StaticValue, ColId]] = {}¶
A mapping between variables defined in the formula and the columns of the provided data.
The dictionary value can either be a string specifying the name of a column or a ColId object (for when the column exists in different table to the target column), or another dictionary specifying a static value . If it is a dictionary specifying a static value then it should be in the format
StaticValue(value="30 days, 2 hours", dtype="timedelta")
where the type is always static, the value is the value and the dtype is a required parameter which is used to convert the provided value into a datatype. The dtype can be either"string"
,"float"
,"integer"
,"boolean"
,"datetime"
or"timedelta"
.When using “timedelta” in the
column_map
the following units can be used in strings:W
D
/days
/day
hours
/hour
/hr
/h
m
/minute
/min
/minutes
/T
S
/seconds
/sec
/second
ms
/milliseconds
/millisecond
/milli
/millis
/L
us
/microseconds
/microsecond
/micro
/micros
/U
ns
/nanoseconds
/nano
/nanos
/nanosecond
/N
It is recommended to use these in a comma separated list. Some examples are:
30 days, 2 hours
1W
- i.e. 1 week1 hour, 30 mins, 30 seconds
When using a “datetime” value in the
column_map
it is recommended to use isoformat followingYYYY-MM-DD[*HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]]
where * can match any single character.Some examples are:
2011-11-04
2011-11-04T00:05:23
2011-11-04 00:05:23.283
2011-11-04 00:05:23.283+00:00
2011-11-04T00:05:23+04:00
- field condition: str = None¶
Query condition to use when attempting to apply the formula to only a subset of the data verifying the specified condition.
Examples of condition syntax, where
a
,b
,c
,d
,col with spaces
are column names. Note backticks`
should be used around column names which contain spaces:(a < b) & (b < c)
i.e. apply formula when columnb
is betweena
andc
.a not in b
i.e. apply formula when the value in columna
is not a value in columnb
.a in b and c < d
i.e. apply formula when the value in columna
is in columnb
and value inc
is less than value ind
.a in (b + c + d)
i.e. apply formula when value ina
is in a column which is the sum ofb
,c
andd
.b == ["a", "b", "c"]
i.e. apply formula when columnb
is equal to the value “a”, “b” or “c”. Notice quotes are used for possible values.c != [1, 2]
i.e. apply formula when columnc
is not equal to 1 or 2.[1, 2] in c
i.e. apply formula when columnc
is equal to 1 or 2. Can also be writtenc == [1, 2]
.`col with spaces` < b
i.e. apply formula when values in columncol with spaces
is less than values in columnb
.
- field model_error: bool = False¶
When set to True the system examines the source data to see if there is any difference between the result of the calculation in the source data and the actual value seen in the source data. If there is a difference then this difference will be modelled and the difference replicated in the synthetic data.
An example of where this is useful is when looking at bank transactions where each transaction row includes the account balance which is a sum of the transactions so far plus a starting balance. In this case the error will be the starting balance and so if this parameter is set to true then the synthetic data will have a similar distribution of starting balances. The system will either model the difference for each row, or if the error in the source is constant for all the records in a sequence then the error added in the synthetic data will be constant over the sequence.
- field group_by: List[Union[str, ExpressionConfig]] = []¶
When provided the data is grouped according to the provided set of keys or expressions before applying the handler to each group.
This parameter must be provided as a list of either:
Column name as a string
A dictionary to be used to specify an expression. For example, this could be used to group rows by the month of the year based on a datetime column e.g.:
{ "type": "expression", "expression": "a+b", "column_map": { "a": "col_1", "b": "col_2", } }
- field sort_by: List[Union[str, ExpressionConfig]] = []¶
When provided the data is sorted by the provided columns or expressions before applying the formula.
The parameters follows the same syntax as
group_by
.