Data Set Keys

A Data Set Key uniquely identifies the Data Sets targeted from the source system.

The choice of Key is often constrained by the source system and likely represents the minimum set of parameters required to uniquely target data for ingestions.

Making the right choice for the Key is a crucial first step as much of the definition of the Descriptor will be influenced by this choice, but broadly speaking the Key will affect the targetting of the ingestion and will influence how much data is ingested into a single data set.

The targeting of the Keys can be either:

  • Coarse - Targets a larger amount of data (e.g. the entire Customers table)

  • Fine - Targets a smaller amount of data (e.g. a Customer Record from an API)

In either case, some good rules to follow in terms of making a good choice of keys are :

  • Mutually Exclusive - Should guarantee data from one Data Partition cannot exist in another Data Partition. If duplicate data is found downstream, it could indicate an improper choice of keys resulting in the same data being fetched twice. e.g. A category could be a bad choice if data is able to exist in multiple categories or a tag.

  • Atomic - Should identify data from the source system and retrieve it in a single request. If multiple requests are needed to retrieve data based on these Partition Keys, it could mean too few Partition Keys have been defined. e.g. A Partition Key that requires some additional parameter to fetch the data that must be discovered

  • Symmetric - Optimistically, the loader should fetch data from the source system and utilise the entire payload from the request without filtering or removing data. If all the data from the response is retained, then the symmetry of the source and target is high. Low symmetry could mean too many Partition Keys have been defined resulting in data that is requested but thrown away. e.g. Using a Partition Key that exists in the body of a response and not the URL parameters

Note: These are guides and not quite an exact science, and often requires a bit of tweaking to get right

Depending on Data Ingestion needs the following are all good choices for a Partition Key:

  • Table - The name of a Table in a Database

  • Primary/Foreign Keys - For more granular ingestion

  • Database Partition - For larger ingestions

  • URI - The URL of a website

  • REST Endpoint with Primary Parameters - The Endpoint and Non-Nullable parameters

Avoid choosing a Partition Key that requires inspecting the response payload to determine

Last updated

Was this helpful?