Graph modelling at Hazy: our approach & a case study
At Hazy, we’ve had the opportunity to work on several projects where the data available represents a graph. In these instances, it becomes impossible to treat the data records, which typically represent edges between nodes, as records independently sampled from the same underlying distribution. As such, a new overall approach is needed; one that focuses on maintaining the overall graph properties while still generating meaningful and representative nodes and edges in a safe and private fashion.
Our approach
In any synthetic data problem, a definition of a record is required, not only during training, where it can heavily influence the overall training strategy, but also during generation as the user typically requires a certain count per record type. In the case of a graph dataset, the immediate record types that come to mind are nodes and edges, as they fully define a network.
During the training step, the system extracts node, edge and graph properties and utilises this information to train two separate generative models to model nodes and edges. During the generation step, nodes are generated in a structure informed by the graph properties captured during the training step. The edge types are then conditionally generated to build a complete, realistic network.
Training
With the nodes and edges as inputs, the training approach has been designed as a two step process in order to accommodate the particularities of graph data while effectively leveraging as much of our underlying existing generative capabilities as possible.
1. Graph analysis
Training starts by analysing the graph, this usually implies ingesting edges in batches to build three main outputs:
- Nodes Feature Table: a table where each row matches a node within the graph. The table contains the various properties that a node might have, this can include static attributes that are assigned to it, degree information, the number of triangles it belongs to, whether it belongs to a sub-connected component as well as any other graph property.
- Edge Feature Table: Similarly to the Node feature table, this is a summary for each edge within the graph of its attributes as well as any relevant graph property that can be assigned to it.
- Graph Statistics and Properties: A collection of distributions and properties that can be extracted from the graph such as cycle size distribution, proportion of cycles in the graph, distribution of sub-connected component size, density, etc.
2. Feature analysis
Following the graph analysis, standard tabular modelling /training can then be applied to both the Node Feature Table and the Edge Feature Table (using conditional generation). The output of this process is two trained generative models responsible for generating the nodes as well as populating the edges appropriately.
For this step, Hazy’s standard tooling works well and can directly be used to accomplish the training, and produce a safe synthetic data generator for both nodes and edges.
Generation
During generation, the user specifies a number of nodes and a number of edges (it is also possible to infer the missing parameter if only one is provided). This allows the user a significant amount of control over the shape, size and density of the graph while preserving the realism of the generated graph. To do so, the generation process can be broken down into a three step process.
1. Node generation
Using our standard Hazy Generation Pipeline and the Nodes Feature model trained, a Nodes Feature table can be generated as explained above.
2. Graph resolution
Once nodes have been generated, the Graph Resolver can then attempt to optimally to solve the problem of creating a skeleton graph with “empty” edges between nodes that aims to match both the node graph properties as well as the overall graph properties extracted from the Graph Analysis training step, such as sub-component & clique related distributions, cycle sizes and frequencies, average distance between nodes, etc.
3. Edge sampling
The previously generated empty edges can then be populated by conditionally feeding them through the Edge Feature model in order to generate the missing edges properties appropriately.
Use case: Los Alamos National Laboratory
As part of an engagement with a Government customer, our team was involved in enabling a third party to prove the capabilities of their software to ingest and analyse large graphical datasets. To do so, the Los Alamos National Laboratory (LANL) Unified Network Dataset was selected. While it was larger, it does not cover all Graph properties that require testing which meant we had to develop and produce a dynamic Graph Synthetic Data Generator that provides more the user with more control over the shape of the generated graph data in order to allow our customer to test the third-party software in various graph sizes and types.
What is the dataset?
The Unified Host and Network Dataset is a subset of network and computer (host) events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days.
The data can be broken down into two sets of files: Netflow event logs and Host event logs. Host event logs, is a set of JSON files that record activity on a set of devices. From a graph point of view, the Host events contain two different types of events or edges, authentication edges that can connect any two device nodes, and other events that are ‘loops’ marking events happening within the same device. On the other hand, netflow, or network events, is a set of CSV files containing logs of data exchanges between nodes/devices.
The LANL Dataset is a standard of graphical data analysis that is often used when testing any graph related feature.
Why synthetic data?
The LANL dataset is a large dataset spanning 90 days with more than 100 gigabytes of data in total. However, while the network is large, it has the particularity that while it has 100+ million edges the network only has tens of thousands of nodes. This makes this dataset particularly useful for use cases when a dense graph is needed, but less so in instances where a large but sparser graph data (i.e a large amount of nodes) is needed.
A synthetic data generation tool in this instance can turn out to be quite useful as it allows a user to generate a realistic graph dataset with full control over the shape of the graph while preserving the overall statistical properties of each node.
What are the challenges?
Users
Each host event is linked to a user, in order to appropriately model the overall graph, it is necessary to model the degree distribution (number of edges per table) and overall user properties to ensure overall user-related behaviour is preserved.
Nodes
In this LANL graph, devices/computers constitute the nodes, however due to the different types of edges, nodes have different types of degree distributions. Furthermore, given that edges have a time component, edge distribution has to be modelled both in terms of number of “unique” edges as well as the number of occurrences in time of each edge.
Sequences
The time component is not only an edge property but is also a mechanism through which edges from different “layers” (login, event, netflow) can be linked to map an overall event. In this particular use-case a typical event type that was to be modelled is a set of sequences of edges (login -> event -> netflow) all starting and ending with the same two nodes.
Although there’s no unique identifier for each sequence, the time component can be used to identify which elements belong to each sequence.
Scale
The amount of data to both train against and generate is likely larger than memory, and any provided solution will need to support the generation of large numbers of nodes and edges despite memory constraints.
The solution
Training
In order to train a LANL graph model generation, we introduced a two-step training strategy; both steps are batch friendly.
First is the graph analysis step. The full dataset is ingested in batches and all edges are merged into a single sorted pile before being ingested sequentially. During this analysis, all graph information is captured; including degree distributions, sequence properties, etc. Additionally, any edge within a sequence is tagged with a unique sequence identifier as well as an unique event identifier - if it belongs to an ‘event’ as outlined in the section above.
Secondly, once the graph properties are determined, separate models are trained for both users and nodes. Additionally, conditional edge generation models are trained by conditioning on the node and user as well as the previous edges within a sequence.
Generation
Similarly, generation is also batch friendly, where each step can be run in multiple batches. The generation starts by generating users and nodes, the generated records contain all relevant degree distribution information for each node and user.
Given that a user controls the number of edges the various edge counts generated during that step are recalibrated to match the requested number of edges. The Edge generation and event generation algorithms then come into play by appropriately matching the various nodes given their degree distributions. Given an existing set of edges, ‘events’ as outlined above are then generated.
Once a graph structure has been appropriately generated, the actual edge properties are then generated conditioned on all related nodes, users and previous steps within a sequence.
Finally all edges are then serialised in their source format ensuring that they remain equivalent to source data when ingested by pre-existing graph analysis pipelines.
To find out more about this graph modelling case study, get in touch.