Best way to Load Data in data warehouse

Multi tool use


Best way to Load Data in data warehouse
We are building a new near realtime(refreshed every 10 minutes) hospital data warehouse from scratch. Data lies on an oracle database and we are planning to use SSIS packages to extract data from Oracle and load in the SQL Server database. I am very new with DWHousing and need suggestions on building the DWH.
Questions :
1) When extracting the data from Oracle should I import the complete tables or should I get only selected columns which I want?
For example : If I have person,address,facility_hist tables should I join them in the extracting package and only get the selected columns? or should I extract all three tables separately in staging area and then combine them and load them in SQL Server.
2)Indexes should be created on the target tables or while extracting the data?
3) I am planning to do the initial load in batches by using the create_dt_tm for the records and load the data in months. Once everything is loaded, we are planning to use the updt_dt_tm and load any updates done in last 10 minutes.
4) Is connecting to the database a good idea to get the data every 10 minutes?
Image shows DWH ETL SSIS Package I am planning to use (instead of flat file I'll be connecting it to the oracle DB)enter image description here
Thanks for your reply Nick. We are in a very initial stage but I have designed data model to begin with. I have also identified a date field which is updated every time a new row is added to the tables or any old row is updated.(I am new to DWH but I have worked on the same health Database for which we need to design the DWH) I just need to plan the process now and try to create a temp DWH with may be 2-3 tables to test how we can extract the data from Oracle and load them in SQL Server DWH using SSIS Packages. Do you have any recommendations to my questions?
– NancyK
16 mins ago
1 Answer
1
This is a very broad question and you haven't yet got into the deep complexities of
mapping source medical data to a business friendly star schema.
Where did the data warehouse requirement and the ten minute requirement come from? What's the more important driver for the business: having a latency of only ten minutes, or having consistent correct information? I guess the focus is operational (admittance etc.), not financial or HR?
When extracting the data from Oracle should I import the complete
tables or should I get only selected columns which I want? For example
: If I have person,address,facility_hist tables should I join them in
the extracting package and only get the selected columns? or should I
extract all three tables separately in staging area and then combine
them and load them in SQL Server.
In an "ODS Layer" approach, there is a full replica of Oracle tables in your SQL Database (usually in a different SQL Server schema. Note "schema" in Oracle and SQL Server mean subtly different things)
An ODS is basically a copy of the source system. If you want 10 minute refreshes then you will need to do incremental (changes only) loads into your ODS. To do that you need to know what changed in the source since you last loaded your ODS. If there is a 'last changed' field in the source, this is easy. Otherwise it gets very difficult.
The ODS approach aligns with a 'batch load' pattern, when data is batched through in layers. An ODS allows you to test and troubleshoot at different layers.
The ODS approach is different to a 'stream' approach, which supports real time reporting, but contains isolated transformation rules all along the stream.
Indexes should be created on the target tables or while extracting the data?
Indexes should generally be disabled/dropped on tables before data loads then rebuilt afterwards
I am planning to do the initial load in batches by using the
create_dt_tm for the records and load the data in months. Once
everything is loaded, we are planning to use the updt_dt_tm and load
any updates done in last 10 minutes.
I don't know what updt_dt_tm
is but that sounds like a good idea
updt_dt_tm
Is connecting to the database a good idea to get the data every 10 minutes?
Yes. How else will you get the data out?
In summary, focus on getting a ten minute ODS replica working, based off incremental loads. Then your next step is to transform the data from the ODS to the star schema.
The 10 minute requirement came from the hospitals and people working on gathering data and analyzing data. That was the reason this project started. "Indexes should generally be disabled/dropped on tables before data loads then rebuilt afterwards"
– NancyK
5 mins ago
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
For a ten minute refresh cycle you're definitely going to work out an incremental data load method. The easiest way to do this is identify an existing field in the source that records whenever a record is changed. If you have this, then your incremental load is going to be a lot faster and more likely to fit into the ten minute window. Have you designed the data warehouse model yet? Medical data warehouses are tricky because the data sources are very 'key/value', which is not the easiest thing to model.
– Nick.McDermaid
28 mins ago