It seems that I or you (or maybe both of us) has misunderstood something fundamental here and I am writing this to help you develop you book. So don’t take this as a criticism (I might be wrong here and will learn something from you then, win-win for both of us actually). Going back to the core of Data warehousing when I see that something needs to be “subject-oriented, nonvolatile, integrated, time-variant” aka Inmon criteria I don’t interpret these as you have done. For you it seems that if the source data is in Data lake then we have ticked of two of the criterias (non-volatile and time-variant) but the way I see it (and hopefully Inmon will agree), this 4 criteria’s are working together meaning that you must have a subject-oriented integrated layer with historical data that don’t gets updated (only new data arrives). Once you have that, you have a DW that can be used for audit and analytics. Both current data and new data can be compared and using the data we can feed source systems(to get better data in future) as well as users. So having files stored in Data Lake does make it non-volatile and time-variant but not from DW-perspective since the other 2 parts are missing. One can say that I see the criteria’s like in this picture: https://commons.wikimedia.org/wiki/File:Jigsaw.svg. Each representing one of the Inmon criteria, and without one of them you don’t have the whole picture (full puzzle). So, I don’t agree with the fact that just because you have the data in DL, you have ticked off the 2 criterias.
"
We all like a good analogy, and it occurred to me that the HOOK approach to data warehousing mirrors how a library works. A library is just a big room that contains a whole bunch of shelves on which there are a whole bunch of books. The books happen to be organised or indexed so that it should be easy to locate books about a particular subject. Its organising structure is what makes a library work; otherwise, it is just a room full of books and finding what you want is an almost impossible task.
"
For me the DL is a library that is not organized. Still ticks of 2 of the criterias but not as a whole.
I'm not sure I entirely agree. To qualify as a data warehouse we need to hit all four criteria. So if the DL ticks off two of the criteria, then we are half way there; but I agree it still isn't a data warehouse. At best, it's a swamp.
If I understand you correctly, are you suggesting that the data must be organised as it is loaded into the DL? In most cases that is impractical. That requires a high level of understanding of the data to be able to model it, which, at the time, we might not have. Do we refuse to load the data? Some would argue that is exactly what we should do, but I disagree. It is normal to have data we don't fully understand, but it might be useful in the future, so we should start loading it and building up history.
The problem we then have is the data is disorganised and to hit the other two criteria (subject-oriented and integrated) we would need to restructure the data and of course we don't want to do that either. And that is one of main reasons why I developed the HOOK approach. The idea is that we are able to apply the "model" AFTER we have loaded the data. So rather than a jigsaw I see it more as a layered approach. "Non-volatile" and "time-variant" as the foundation, with "subject-orient" and "integrated" built on top.
You asked; "If I understand you correctly, are you suggesting that the data must be organised as it is loaded into the DL?"
No i don't mean that. it's not practical. What I am saying is that you have taken a definition and made your own interpretation which I don't agree with. Where I see the definition as a jigsaw (all of 4 criterias working together) and you see it as something else (layered as you call it). Maybe you can call out to Inmon and ask what he actually means by the definition.
Also, when you say do don’t want to re-structure data, I totally disagree. Every system has its own purpose of life and part of the purpose of the DW is to re-structure and make it subject-oriented and integrated. And if we don't restructure data why even bother creating unified star schemas or HOOK DW or DV-DW?
Anyway, let's agree that we disagree on this one.
Let me explain the situation I am using your ideas for HOOK, and I think that’s where HOOK shines and has a clear advantage over DV.
I just stared a new assignment and for 10+ years people in this company have sourced data from different systems. What they have done is to take data from system 1 and then system 2 and lots of other source systems and only thing these different data sets have in commons are some keys: social security number and few other keys. They never thought of building a DW, they kind of jumped over the DW-layer part and just created wide-tables to do analytics. Staging layer directly to Data marts, nothing in-between.
When creating those large wide data sets they obviously used the commons keys (social security number etc) to join the data from different systems. What this has led to is no integration of data, no transparency, nobody knows what data we have, same data have different meaning in different mart-tables just to mention a few of the problems. And there are tons of more. So what are my options here?
“Start over” and build a DV-DW? That’s not going to work (just imagine the amounts of data I need to deal with and re-engineering etc!). And here is the beautiful part of the HOOK. Since I have the keys that connects the different data sources by keys(hooks). I just need to define a framework for how to populate those keys the ways you have done in you hook-book. Suddenly we can automate this creating of keys according to a hook-pattern. Now we still need to integrate the data (the hooks only gives me the possibility to integrate), the people I work with are saying how can we say that the data is not integrated if we can join them. So here I will use the concept of “bag of data” to visually show that based on HOOK-keys I can create a bag of data that answer a set of questions. Notice that I don’t even need to touch the data, I just need to create documentation of the joins already made by them but using you concept of bags (with my own interpretation). Now from users’ perspective, instead of looking at some tables they don’t understand (which has been pre-joined for them) they instead look at different bags of data and what kind of questions they answer. I see this situation as going to a grocery shop and shopping for ingredients (data) that can make me a meal (answer business questions). So, in my situation, we went from not having any pattern of integration to have a pattern for key-integration (hooks and key sets). I also went from a black box of data to “bags of data” situation from users’ perspective where they can shop for data (integrating by hooks) to answer business questions.
I still need to do a lot of work (ELM) and create a persistent stage (DL). What I am trying to say is that one of the biggest selling points of HOOK-should be “visual integration”*, a integration dose not have to be virtual or physical if you understand what I mean. And, what have I actually done here? I created a re-usable program code that can be used on any data to create hooks/keys. I just created documentation of how the data can be combined (bags of data). Compare this to creating a DV-DW…the time and effort…oh dear god. I would be done by 2030 if I was lucky.
Maybe above does not make any sense (my English is a bit limited) but I will contact you once I can show this to you. To summarize, I wouldn’t use hook when starting a new DW initiative, however like most DW when things go wrong and they do I would rather use HOOK method (hook, key sets and “bags of data”-concept) to make things go faster while I buy myself time to map all CBC:s and creating a persistent stage. And the best part of this is that all this is done without actually re-engineering or affecting end users.
* visual integration (documentation) must be followed by virtual/physical integration on database level Hopefully the visual part will match the actual virtual/physical implementation to make the data user friendly.
It seems that I or you (or maybe both of us) has misunderstood something fundamental here and I am writing this to help you develop you book. So don’t take this as a criticism (I might be wrong here and will learn something from you then, win-win for both of us actually). Going back to the core of Data warehousing when I see that something needs to be “subject-oriented, nonvolatile, integrated, time-variant” aka Inmon criteria I don’t interpret these as you have done. For you it seems that if the source data is in Data lake then we have ticked of two of the criterias (non-volatile and time-variant) but the way I see it (and hopefully Inmon will agree), this 4 criteria’s are working together meaning that you must have a subject-oriented integrated layer with historical data that don’t gets updated (only new data arrives). Once you have that, you have a DW that can be used for audit and analytics. Both current data and new data can be compared and using the data we can feed source systems(to get better data in future) as well as users. So having files stored in Data Lake does make it non-volatile and time-variant but not from DW-perspective since the other 2 parts are missing. One can say that I see the criteria’s like in this picture: https://commons.wikimedia.org/wiki/File:Jigsaw.svg. Each representing one of the Inmon criteria, and without one of them you don’t have the whole picture (full puzzle). So, I don’t agree with the fact that just because you have the data in DL, you have ticked off the 2 criterias.
"
We all like a good analogy, and it occurred to me that the HOOK approach to data warehousing mirrors how a library works. A library is just a big room that contains a whole bunch of shelves on which there are a whole bunch of books. The books happen to be organised or indexed so that it should be easy to locate books about a particular subject. Its organising structure is what makes a library work; otherwise, it is just a room full of books and finding what you want is an almost impossible task.
"
For me the DL is a library that is not organized. Still ticks of 2 of the criterias but not as a whole.
I'm not sure I entirely agree. To qualify as a data warehouse we need to hit all four criteria. So if the DL ticks off two of the criteria, then we are half way there; but I agree it still isn't a data warehouse. At best, it's a swamp.
If I understand you correctly, are you suggesting that the data must be organised as it is loaded into the DL? In most cases that is impractical. That requires a high level of understanding of the data to be able to model it, which, at the time, we might not have. Do we refuse to load the data? Some would argue that is exactly what we should do, but I disagree. It is normal to have data we don't fully understand, but it might be useful in the future, so we should start loading it and building up history.
The problem we then have is the data is disorganised and to hit the other two criteria (subject-oriented and integrated) we would need to restructure the data and of course we don't want to do that either. And that is one of main reasons why I developed the HOOK approach. The idea is that we are able to apply the "model" AFTER we have loaded the data. So rather than a jigsaw I see it more as a layered approach. "Non-volatile" and "time-variant" as the foundation, with "subject-orient" and "integrated" built on top.
You asked; "If I understand you correctly, are you suggesting that the data must be organised as it is loaded into the DL?"
No i don't mean that. it's not practical. What I am saying is that you have taken a definition and made your own interpretation which I don't agree with. Where I see the definition as a jigsaw (all of 4 criterias working together) and you see it as something else (layered as you call it). Maybe you can call out to Inmon and ask what he actually means by the definition.
Also, when you say do don’t want to re-structure data, I totally disagree. Every system has its own purpose of life and part of the purpose of the DW is to re-structure and make it subject-oriented and integrated. And if we don't restructure data why even bother creating unified star schemas or HOOK DW or DV-DW?
Anyway, let's agree that we disagree on this one.
Let me explain the situation I am using your ideas for HOOK, and I think that’s where HOOK shines and has a clear advantage over DV.
I just stared a new assignment and for 10+ years people in this company have sourced data from different systems. What they have done is to take data from system 1 and then system 2 and lots of other source systems and only thing these different data sets have in commons are some keys: social security number and few other keys. They never thought of building a DW, they kind of jumped over the DW-layer part and just created wide-tables to do analytics. Staging layer directly to Data marts, nothing in-between.
When creating those large wide data sets they obviously used the commons keys (social security number etc) to join the data from different systems. What this has led to is no integration of data, no transparency, nobody knows what data we have, same data have different meaning in different mart-tables just to mention a few of the problems. And there are tons of more. So what are my options here?
“Start over” and build a DV-DW? That’s not going to work (just imagine the amounts of data I need to deal with and re-engineering etc!). And here is the beautiful part of the HOOK. Since I have the keys that connects the different data sources by keys(hooks). I just need to define a framework for how to populate those keys the ways you have done in you hook-book. Suddenly we can automate this creating of keys according to a hook-pattern. Now we still need to integrate the data (the hooks only gives me the possibility to integrate), the people I work with are saying how can we say that the data is not integrated if we can join them. So here I will use the concept of “bag of data” to visually show that based on HOOK-keys I can create a bag of data that answer a set of questions. Notice that I don’t even need to touch the data, I just need to create documentation of the joins already made by them but using you concept of bags (with my own interpretation). Now from users’ perspective, instead of looking at some tables they don’t understand (which has been pre-joined for them) they instead look at different bags of data and what kind of questions they answer. I see this situation as going to a grocery shop and shopping for ingredients (data) that can make me a meal (answer business questions). So, in my situation, we went from not having any pattern of integration to have a pattern for key-integration (hooks and key sets). I also went from a black box of data to “bags of data” situation from users’ perspective where they can shop for data (integrating by hooks) to answer business questions.
I still need to do a lot of work (ELM) and create a persistent stage (DL). What I am trying to say is that one of the biggest selling points of HOOK-should be “visual integration”*, a integration dose not have to be virtual or physical if you understand what I mean. And, what have I actually done here? I created a re-usable program code that can be used on any data to create hooks/keys. I just created documentation of how the data can be combined (bags of data). Compare this to creating a DV-DW…the time and effort…oh dear god. I would be done by 2030 if I was lucky.
Maybe above does not make any sense (my English is a bit limited) but I will contact you once I can show this to you. To summarize, I wouldn’t use hook when starting a new DW initiative, however like most DW when things go wrong and they do I would rather use HOOK method (hook, key sets and “bags of data”-concept) to make things go faster while I buy myself time to map all CBC:s and creating a persistent stage. And the best part of this is that all this is done without actually re-engineering or affecting end users.
* visual integration (documentation) must be followed by virtual/physical integration on database level Hopefully the visual part will match the actual virtual/physical implementation to make the data user friendly.