The truth of the matter is that HOOK is nothing new, not really. If I’m honest, it is just a re-imagining of ideas put forward by the leading lights of our industry, such as Bill Inmon, Dan Linstedt and Ralph Kimball. And for this very reason, I have been reluctant to put my name to something that isn’t what I would consider particularly original thinking. However, the longer I have worked in this industry and the more broken projects I have encountered, it seems that maybe a different approacis warranted.
There is nothing inherently wrong with the current data warehousing approaches; Data Vault and Dimensional modelling work well if done right. Unfortunately, they are rarely done right, and this is where things start to go wrong. These methodologies require high levels of discipline and consistency. For these methodologies to be successful, they must be followed rigorously; no shortcuts are allowed! Failing to do so will ultimately lead to failure. Unfortunately, this level of rigour can be expensive and time-consuming, and when the business is applying pressure to deliver benefits, rigour is the first casualty.
We need an approach that is a little more, dare I say it, forgiving. An approach that allows results to be delivered quickly but at the same time makes it easier to change when (not if) we get things wrong the first time around. We need a truly flexible and agile approach.
In the HOOK methodology, data is contained in bags and can be hung from hooks. It's a simple metaphor - there are others I could have used - but what is the thinking behind it? Was there a light bulb moment? Kind of, but it was more of a gradual twist on the dimmer switch.
A Very Old Idea
One of my earliest memories was my first day at school. It is a significant event, particularly in the life of a five-year-old. I remember the last-minute activity leading up to that day as my parents organised all the things I needed for that first day. The school had provided a list of essential items: a satchel to carry my books; a gym bag and gym gear (t-shirt, shorts and gym shoes); a pencil case with pencils (too young to be trusted with a pen!), a sharpener, and a ruler. There were new school clothes, a winter coat, and shoes. My mother spent hours labelling every one of those items with my name.
Laden down with all our stuff, my new classmates and I were first directed to a row of pegs on the wall outside our classroom. Each peg was labelled with each of our names, in alphabetical order, and, because we couldn’t read yet, a picture of an animal (funny the things you remember, but I think mine was a giraffe). We were then told to hang our belongings on our respective hooks.
A row of hooks, a simple flat structure, delivers an incredibly effective way to organise the items from different sources. It occurred to me that each of these hooks acted like a point of integration for each student. Each child’s belongings would be co-located on their hook. If items are suitably labelled, we can quickly figure out where they belong.
As previously discussed, to qualify as a data warehouse, the Inmon Criteria must be met. The idea of a hook seemed like a suitable representation of a point of integration within a warehouse. Hooks are the mechanism by which the Subject Oriented and, to a certain extent, the Integrated Inmon criteria are met. We will effectively hang data in the data warehouse from a set of defined hooks. Each hook will represent one or more Core Business Concepts.
HOOK Components
The HOOK approach is meant to be lightweight and simple to grasp. There are very few moving parts to the approach, yet it offers a great deal of flexibility when building a data warehouse. Let’s look briefly at the main components.
Business Glossary
The fundamental foundational construct for a HOOK data warehouse is the business glossary. It is critical that there is a common understanding of the business terminology used within the organisation. If a company cannot clearly define what it means by, say, Customer, then it is unreasonable to assume we can integrate customer data.
The business glossary is a list of business terms or concepts that are clearly defined by the business. These concepts can be represented as entities within the Enterprise Conceptual Model (“ECM”) to demonstrate how they relate to one another, but the priority is the identification and definition of the concepts themselves.
We may only align data to a business concept if it has been formally defined and documented. I will demonstrate how to identify these business concepts using the ELM methodology later in the book.
The business glossary is described in more detail in Chapter 3.
Hooks
A hook is a representation of a business concept. More accurately, it is a structure that records business keys that identify instances of business concepts. If we have a Customer concept, then we can also have a Customer hook which is used to record all the business keys for customers.
Hooks may also be used to represent relationships between business concepts. For example, if a Customer buys a Product, we can construct a Sales hook that associates the two concepts.
Although hooks can be represented as physical tables, they are entirely optional, and their existence may be implied.
Hooks are described in more detail in Chapter 4.
Bags
A bag is a virtual construct that is used to align raw data acquired from operational systems with concepts in the business glossary. Using a strict naming convention, a bag will clearly define how the underlying source-aligned data references business concepts.
Bags can be implemented as physical tables but are more commonly defined as views. We will discuss these options in more detail later.
Bags are described in more detail in Chapter 5.
KeySets
When data is brought together from multiple sources, business keys will be used to identify the business concepts within each system. There is a high chance that those business key values overlap. Customer #123 from System X might not be referring to Customer #123 from System Y. If their respective key spaces do overlap, we need to qualify them to ensure their key values are unique.
We do this by formally defining keysets that describe the set of business key values. For example, we might define a keyset called “Customer identifiers from System X” and one called “Customer identifiers from System Y”. Each keyset has its own identifying number or code, which is then used to qualify the customer business keys from the respective systems.
Keysets are described in more detail in Chapter 6.
Data Lake + HOOK = Data Lakehouse
To round off this chapter, I want to revisit the discussion from the previous chapter when I suggested that HOOK was, by accident, an ideal approach for building a data lakehouse.
Modern data warehouses are riddled with holes left by silver bullets supplied by vendors selling us new and shiny tools and toys that will supposedly solve all the problems we never really knew we had. One relatively recent development is the Data Lakehouse. I’ve (tried to) read articles on this new idea, and I’m still not entirely clear about what it is. Different vendors present different definitions of a Data Lakehouse heavily influenced by the features and capabilities of their respective products. To avoid confusion, it would be better if we had a technology-agnostic definition of the Data Lakehouse. I can’t seem to find one, and believe me, I’ve looked.
We can give it a go here, but we need to make some assumptions:
The Data Lakehouse must be technology-agnostic. Stating the obvious, I know, but a definition based on specific technologies is not a useful definition.
The Data Lakehouse is built on Data Lake technologies. Data Lakes are fairly well understood; they at least have a definition on Wikipedia. In simple terms, a Data Lake is a scaleable file store in which we can record information in different formats. Data Lake technologies excel with getting data in and then back out again. However, updating data once it has been loaded is less straightforward. [However, implementing a data lake on traditional database technologies is also an option assuming we adhere to the principle of immutability.]
The Data Lakehouse serves as a data warehouse and, as such, must meet the Inmon Criteria that we discussed in Chapter 0.
Summing up into this simple definition:
data lakehouse [dey-tuh leyk-houz] (noun)
A repository of data stored in its natural/raw format that is subject-oriented, integrated, time-variant and non-volatile.
Certainly, there is no reference to any specific technology. “Repository” is a generic enough term that could be translated to any type of data storage technology. In Modern Data Stacks, the repository would be implemented on data lake technologies in the cloud or any relational database management system (RDBMS). There are numerous options available from Microsoft, Amazon, Google, Databricks and Snowflake (to name but a few), and they are similar in many respects. But a repository is not a data warehouse. A data warehouse must meet the Inmon Criteria.
Fortunately, data lake technologies meet some of these criteria out-the-box. For a start, data lakes are inherently non-volatile. It is generally difficult to change the contents of a file in a data lake. Say you stored table-like data in CSV or Parquet files in the data lake, and each of these files contains thousands, maybe millions, of rows of data, and you wanted to update one record. On most platforms, you can’t update the file directly. You’d have to copy the file, make changes to the copy and then overwrite the old file. That means having to copy thousands or millions of rows of data to change just one. Highly inefficient.
The recommended approach is to create a new file which contains only updated records and leave the old file well alone. We then have an audit of changes over time. The data lake technologies have effectively forced us to implement the time-variant and non-volatile requirements. But we still need to address the other two criteria. Our data is not subject-oriented nor integrated. Without these criteria, we end up with a data swamp, an unmanageable jumble of unrelated information.
HOOK delivers the subject-oriented and integrated criteria without touching the data in the data lake. We can turn it into a data lakehouse by layering the HOOK artefacts over the top. However, this is based on the assumption that the data can be exposed by a SQL-compliant engine, which, fortunately, most platform vendors provide.
So based on our definition, then:
Data Lake + HOOK = Data Lakehouse
In particular, the bag construct alone allows an unorganised source-aligned data lake table to be transformed into a business-aligned data artefact. All we need to do is define a view. It seems almost too easy.
Summary
There are four basic components in the HOOK methodology: the Business Glossary, Hooks, Bags, and KeySets; there isn’t much more to it. Once you grasp these simple concepts, you will be ready to start designing and building your HOOK data warehouse.
Next Chapter
In Chapter 3, we will look at the first of these components, the Business Glossary, in greater depth.
It seems that I or you (or maybe both of us) has misunderstood something fundamental here and I am writing this to help you develop you book. So don’t take this as a criticism (I might be wrong here and will learn something from you then, win-win for both of us actually). Going back to the core of Data warehousing when I see that something needs to be “subject-oriented, nonvolatile, integrated, time-variant” aka Inmon criteria I don’t interpret these as you have done. For you it seems that if the source data is in Data lake then we have ticked of two of the criterias (non-volatile and time-variant) but the way I see it (and hopefully Inmon will agree), this 4 criteria’s are working together meaning that you must have a subject-oriented integrated layer with historical data that don’t gets updated (only new data arrives). Once you have that, you have a DW that can be used for audit and analytics. Both current data and new data can be compared and using the data we can feed source systems(to get better data in future) as well as users. So having files stored in Data Lake does make it non-volatile and time-variant but not from DW-perspective since the other 2 parts are missing. One can say that I see the criteria’s like in this picture: https://commons.wikimedia.org/wiki/File:Jigsaw.svg. Each representing one of the Inmon criteria, and without one of them you don’t have the whole picture (full puzzle). So, I don’t agree with the fact that just because you have the data in DL, you have ticked off the 2 criterias.
"
We all like a good analogy, and it occurred to me that the HOOK approach to data warehousing mirrors how a library works. A library is just a big room that contains a whole bunch of shelves on which there are a whole bunch of books. The books happen to be organised or indexed so that it should be easy to locate books about a particular subject. Its organising structure is what makes a library work; otherwise, it is just a room full of books and finding what you want is an almost impossible task.
"
For me the DL is a library that is not organized. Still ticks of 2 of the criterias but not as a whole.