Conceptually, a hook is simply something you hang things from. In the HOOK methodology, a hook represents a point of integration in the warehouse. A hook is a logical construct, and in practical terms, you don’t really need to physically build them. Regardless of whether we choose to physically implement hooks or not, they have a structure that we need to understand…
Simple Hooks
A simple hook represents a single core business concept. By default, there will be a hook defined for every core business concept in the business glossary, which, by convention, will have the same name. If there is a core business concept called Product then there will also be a hook defined with the same name. In Figure 4.1 we can see the Product hook, from which a bag has been hung that contains the Product Details data.
It is a simple visual that mirrors the pegs outside the classroom that I talked about in Chapter 2. Now imagine a row of hooks labelled in alignment with the core business concepts in the business glossary with bags of data hanging from them.
We can hang any number of bags from the same hook, as shown in Figure 4.2.
From this diagram, we can deduce that the Product Details, Stock Levels and Product Suppliers data all contain business keys that identify Products. Equally, we can say that Customer Details and Orders data contain business keys that identify Customers.
By convention, simple hooks are named after the core business concepts from which they are derived. However, we could give the hook a different name. Although I wouldn’t recommend it, we could name the hook that represents the Product core business concept as Commodity. We should keep the name of a hook in lock-step with the name of the core business concept from which it is derived. Just because you can use a different name doesn’t mean you should.
We could also define multiple hooks that represent the same core business concept as a way to represent synonyms. For example, we could create hooks named Product, Widget and Gadget, all referring to the same Product core business concept. Although this is possible, the recommended method for representing synonyms is within the glossary itself. We can define three business concepts named Product, Widget and Gadget. Product would be a core business Concept, whilst Widget and Gadget are non-core business concepts which both refer to Product as their synonym.
There is another method that uses Key Sets to resolve synonyms which we will explore further in Chapter 6.
Figure 4.1 shows the Product Details data hanging from a single Product hook, but we can extend our model to allow a bag to hang from more than one hook. Figure 4.3 shows the Order Lines Bag hanging from the Order and Product hooks. In effect, the Order Lines data has been classified into both Order and Product categories.
There is no practical limit to the number of hooks we can hang a bag from. If the underlying data contains identifiable business keys, we can use them to associate with a hook.
Compound Hooks
In some situations, it may be preferable to create hooks which incorporate more than one core business concept. These are called compound hooks and can be used to represent relationships between core business concepts. For example, we could create an Order Line hook, which forms a relationship between Order and Product core business concepts, as shown in Figure 4.4.
Here is a compound hook labelled Order Line. The symbol for a compound hook is a hook enclosed in braces. We can explicitly show the contributing core business concepts, as shown in Figure 4.5.
I have expanded the Order Line compound hook to show that it is derived from the Order and Product core business concepts. The dotted lines show the association. Alternatively, we could use the shorthand as shown in Figure 4.6.
Here we have called out that the Order Line hook is made up of the Order and Product core business concepts. In other words, the dimensionality of the Order Line hook is Order x Product.
There will be times when the same core business concept is referred to more than once within a compound hook, in which case we must qualify each reference with a unique name, as shown in Figure 4.7.
The Employee core business concept is used twice in the Reporting Structure hook to represent the employee/manager relationship. Both employee and manager are defined by the Employee core business concept, so we need to qualify one, or both of them, to ensure uniqueness. This is done by appending the name of the role played by the employee with the relationship. In this case, I have chosen to qualify the manager resulting in the label Employee.Manager. Going back to the Order Line example in Figure 4.4, the business key in the Order Line hook will be a concatenation of the Customer and Product business key values.
But we don’t have to model it this way. If we don’t want to call out the Customer/Product relationship, we can achieve the same result using simple hooks as shown previously in Figure 4.4. Either approach works; it depends on how you prefer to represent the relationship.
But consider this example. What if the ordering system allowed Order Lines to be related to one another? Maybe the customer responded to an offer that was generated from another order. In this scenario, we can refer to the Order and Product hooks twice, as shown in Figure 4.8.
Now we have four distinct references to hooks, and we must qualify any duplicate references for the same hook, in this case, Original and Offer. It is beginning to look a bit messy. To clean things up a bit, we could make use of the Order Line compound hook, as shown in Figure 4.9.
We still need to reference the hook twice, and as such, each reference must be qualified, but now the number of references has halved from four to two.
Hooks as Physical Tables
In its simplest form, a hook represents a set of business keys, and a business key is an identifier associated with a core business concept defined in the business glossary. A hook, if implemented as a physical table, contains only a list of distinct business key values for a core business concept.
To be clear, just because a hook has been defined does not mean that we have to build a physical hook table. The HOOK methodology does not mandate that hook tables be constructed and populated. But if we did, the table definition for the Customer hook would look something like this:
CREATE TABLE HOOK_CUSTOMER ( HK_Customer VARBINARY(100));
Yep, that’s it. The table - in this case, called HOOK_CUSTOMER
-
contains a single column also named in line with the associated core business concept, in this case, HK_Customer
. For completeness, and if the database management system hosting the warehouse supports it, we might also add an index to enforce uniqueness and improve join performance.
CREATE UNIQUE INDEX IX_HOOK_CUSTOMER ON HOOK_CUSTOMER (HK_Customer);
In this example, I’ve defined the type for the hook key as a VARBINARY
. The reason for this is that joins between binary fields are usually more efficient (and reliable) than joins between text fields. But if you did want fixed-width hook keys then you could use a hashing function, though I wouldn’t recommend it for a number of reasons:
The width of the hash key, in most cases, will exceed the size of the fixed width hashed value. To minimise the probability of key clashes, the hash key would need to be a minimum of 32 bytes (256 bits).
Calculating the hash requires significant computing resources. When the same hash is applied to the hook keys in the bags, which are calculated at query time, this could prove to be expensive.
With hash keys, there is always the possibility, however small, that the same hash value could be generated for different business key values.
Using a Hook as an Anchor
Hook tables are used as a convenient way to join bags. The hook table will contain a superset of all the business keys exposed in all bags that refer to that core business concept. We can use it as an anchoring table for LEFT JOIN
s as follows.
SELECT *
FROM HOOK_CUSTOMER HOOK
LEFT JOIN BAG_Customer_Detail BAG1
ON HOOK.HK_Customer = BAG1.HK_Customer
LEFT JOIN BAG_Customer_Order BAG2
ON HOOK.HK_Customer = BAG2.HK_Customer
Of course, we could join the two bags directly using a full outer join as follows.
SELECT *
FROM BAG_Customer_Detail BAG1
FULL OUTER JOIN BAG_Customer_Order BAG2
ON BAG1.HK_CUstomer = BAG2.HK_Customer
Therefore, we don’t really need the hook table at all. However, utilising full outer joins is generally discouraged for performance reasons. There really is no real need to implement hook tables. Unless you need, for some reason, to maintain a complete list of business keys in a single table, then I see no reason why you should bother.
Summary
Hooks are more of a logical concept rather than something you need to build. For the sake of simplicity, start out with the assumption that you don’t need them. Only implement them if there is a requirement for a reference list of business keys.
Next Chapter
Chapter 5 will talk about the bag construct in some detail. If you read just one chapter of this book, Chapter 5 is probably the one you want.
It would be great to get more details about the pros/cons of materializing hooks and potentially include some metadata (I know that would be basically the hub concept in DV), because I would consider creating a hook table once I have an integration challenge per hook (reconciliate 2+ datasets). I think full joins with COALESCE calculations to standardize keys are definitely not a good choice to combine datasets, the hook tables won't be difficult to create or maintain and would ease querying, but maybe I am missing something.