While conducting a data engineering task, I loaded data from a columnar database, stored it in a Parquet file, and then placed the Parquet file in a data lake. I aimed to comprehend why a data lake was more suitable for this use case than a data warehouse. My curiosity prompted me to conduct research, ask questions, learn, and, ultimately, share my findings through this article.
I would be using the database for an e-commerce website to explain what a data warehouse and a data lake mean and when to choose one over the other. I will call the e-commerce database eDB.
From OLTP to OLAP
eDB stores the current transaction data for the e-commerce website. The company has decided to enhance its decision-making capabilities using data and has recruited several data professionals for this vital task.
The Data Analysts already have a predefined structure for customer records, enabling them to create dashboards for monitoring and reporting. Machine Learning Engineers prefer having access to the raw data for more in-depth analysis. Both sets of professionals aim to conduct analytical processing on the data stored in eDB.
eDB serves as an OLTP database, specifically designed for real-time storage of day-to-day transactions on the e-commerce website. However, for complex analysis and aggregations, a different type of database is needed. This is where OLAP comes into play.
OLAP: Data Warehouse
A data warehouse can be likened to a physical goods warehouse. Similar to how a goods warehouse organizes items in specific categories and arrangements, a data warehouse structures data with a predefined method.
A data warehouse, a type of OLAP database, operates with a predefined schema, requiring data transformation for alignment. This schema consistency makes it a preferred choice for Data Analysts working with DB data.
Unlike OLTP databases, data warehouses amalgamate both current and historical data. Data, sourced from various origins such as eDB, is periodically ingested (daily, weekly, monthly, bi-weekly). Consequently, the data in the warehouse may not be up-to-date at all times.
Storing data in a data warehouse involves intricate orchestration, incurring costs. This process is recommended when the data structure is known in advance. Various data warehouses exist, and the choice should align with specific business needs.
OLAP: Data Lake
A data lake is comparable to a natural lake, a body of water where various entities, such as fishes and insects, can coexist.
A data lake, similar to a data warehouse, functions as an OLAP database but stands out by accommodating diverse data types—structured, semi-structured, and unstructured. This makes it ideal for eDB Machine Learning Engineers dealing with uncertain data structures.
Similar to a data warehouse, a data lake incorporates both current and historical data through periodic ingestion. The key distinction lies in the absence of a fixed schema, making a data lake more flexible and less reliant on extensive data orchestration.
However, this flexibility demands careful data governance to prevent the data lake from turning into a chaotic "data swamp," emphasizing the need for responsible management practices.
Based on these findings, I now understand why I opted for a data lake over a data warehouse for my data loading. I hope this article provides a clear insight into the distinctions and similarities, helping you make informed choices between the two. 🙂