Database Subsetting¶
Database subsetting allows training to occur on a smaller subset of the data. It recursively samples the database structure producing a subset of all tables which maintains referential integrity.
Component Preserving Sampling¶
When a table has multiple foreign keys, typical when modelling many-to-many relationships, these can form connected components: a subgraph in which each pair of keys is connected with each other via a path, a relationship across multiple rows of the table. It can be important to keep these components intact when sampling. To do this, instead of sampling row-wise, the product can sample component-wise. Since each row/key belongs to one component only, once those are identified, we can sample from components. Note this can only be run on the target_table.
Caching¶
There is the option to turn on an encrypted cache. To set this up the following environment variables must be configured “HAZY_CACHE_FOLDER” and “HAZY_CACHE_PASSWORD”. The customer is responsible for choosing a strong password. This can save time on subsequent training runs by allowing the subsetting to read directly from the cache. Each file in the cache is encrypted with a different key. This feature can be turned on by setting use_cache=True. By setting valid_cache=False you can ensure no tables are read from the cache on that run, however if use_cache=True they will still be subsequently written to the cache ready for future training runs.
Target Table¶
Often the most sensible choice of table to begin subsetting is a a pure parent table ie. a table with no foreign keys to other tables. However, it really depends on the use case.
In Memory¶
At the moment only in-memory subsetting is supported, as most enterprise clients opt for moving data from their production environments into curated environments so they can run the product away from their production systems and minimise their risk.
Skip Reference Tables¶
This will ensure every Reference Table is skipped from the database subsetting process.
Filters¶
This parameter allows the user to specify filters to each table on the database, to be applied before the subsetting process. These filters should be written like a SQL WHERE clause, in JSON format. Example:
from hazy_configurator import DBSubsetFilter
DBSubsetFilter(table="table_A", where="amount > 1000 AND type = 'debit'")
{
"table": "table_A",
"where": "amount > 1000 AND type = 'debit'"
}
Out of Core Sampling¶
It is possible to sample down larger than memory datasets via the subsetting pipeline, by leveraging Dask’s parallel computing capabilities. To enable this, toggle on the use_dask switch in the subsetting configuration. It is advised to exercise caution and enable this feature only when necessary as Dask will not be optimal for smaller datasets. In such cases, its utilization could introduce unnecessary overhead, and potentially slow down performance. Also ensure that the ratio specified is small enough such that the subsetted dataset can comfortably fit into memory.
Configuration¶
- class hazy_configurator.general_params.database_subsetting_config.DatabaseSubsettingParams¶
Bases:
HazyBaseModel
Database subsetting parameters.
These define how a database subset should be obtained from the original database.
- field magnitude: float = 1.0¶
Ratio of original data to be kept. For example, a value of 0.5 will (try to) keep around 50% of the data.
- Constraints:
exclusiveMinimum = 0
maximum = 1
- field target_table: Optional[str] = None¶
The name of the table to be sampled first. Primary/foreign keys dropped/kept will be propagated throughout the rest of the database. Connected components can be preserved in this table (in_memory sampling mode only). Connected components exist where there are multiple Foreign Keys in a table.
- field preserve_components: bool = False¶
Boolean flag to preserve connected components on target_table.
- field use_cache: bool = False¶
Use cache to save time reading tables from source. Files in the cache are encrypted.
- field valid_cache: bool = True¶
By setting the cache as invalid (False), no tables will be read from the cache, however they will still be written to the cache after sampling which can save time on subsequent runs if use_cache is set to True.
- field skip_reference: bool = False¶
Whether or not reference tables should be skipped during subsetting.
- field enforce_ref_integrity: bool = True¶
The database subsetting algorithm is guaranteed to maintain referential integrity if the source data does. However in cases where this is not true, this parameter can enforce those records are dropped.
Defines the navigation mode/algorithm to be used when calculating the route to go through the database schema when subsetting it.
- field sampling_mode: SamplingMode = SamplingMode.IN_MEMORY¶
Defines where the computation of the sampling will be done.
- field schema_name: Optional[str] = None¶
Name of the schema to write temporary sampled tables into. If none is provided, the default schema of the database will be used. If the provided schema does not exist, it will be created - if this fails, the default schema will be used.
- field use_dask: bool = False¶
Enable Dask to leverage parallel computing for efficiently subsetting a dataset using a local cluster. When set to True, Dask will distribute the subsetting task across multiple cores on the local machine, allowing to read in and then subset larger-than-memory datasets. It is recommended to use Dask only when the dataset size exceeds available memory, as it provides an advantage in handling larger-than-memory datasets, but can result in slower processing compared to the standard Pandas pipeline when enabled for smaller datasets. Dask is not recommended for datasets less then 7.0GB in size.