Apache Spark + Delta Lake concepts

1) Leave it up to your data scientists. They should be comfortable working in the silver and gold regions, some more advanced data scientists will want to go back to raw data and parse out additional information that may not have been included in the silver/gold tables.

2) Bronze = raw data in native format/delta lake format. Silver = sanitized and cleaned data in delta lake. Gold = data that is accessed via the delta lake or pushed to a data warehouse, depending on business requirements.

3) Delta architecture is an easy version of lambda architecture. Delta architecture is a commercial term at this point, we'll see if that changes in the future.

4) Delta Lake + Spark is the most scalable data storage mechanism with a reasonable price. You're welcome to test the performance based on your business requirements. Delta lake will be far cheaper than any data warehouse for storage. Your requirements around data access and latency will be the larger question.

5) Kafka, Kinesis or Eventhub are sources for getting data from the edge to the data lake. Delta lake can act as a source and sink to a streaming application. There are actually very few problems using delta as a source. The delta lake source lives on blob storage so we actually get around many problems of the infrastructure issues, but add the consistentcy issues of the blob storage. Delta lake as a source of streaming jobs is way more scalable than a kafka/kinesis/event hub, but you still need those tools to get data from the edge into the delta lake.


  1. The medallion tables are a recommendation based on how our customers are using Delta lake. You do not have to follow it exactly; however, it does align nicely to how people design EDW's. As for machine learning and which table to use. That is going to be a choice by the folks doing machine learning. Some may want to access the Bronze tables because that is the raw data, nothing has been done to it. Others may want the Silver table because it is presumed to be clean albeit augmented. Usually the Gold tables are highly refined and specific to answering well defined business questions.

  2. Not exactly. The Bronze tables are the raw event data, e.g. one row per event or measurement, etc. The Silver tables are also at the event/measurement level, but they are highly refined and are ready to for queries, reporting, dashboards etc. The Gold table can be fact and dimension tables, aggregate tables, or curated data sets. It is important to remember that Delta is not meant to be used as a transnational, OLTP system. It is really meant for OLAP workloads.

  3. Delta architecture is a the name we gave a particular implementation of Delta Lake. It is not a commercial term per se but hopefully it becomes one. There is enough information out there to compare and contrast Kappa and Lambda architectures. The Delta architecture is well defined throughout Delta documentation and Databricks blogs, tech talks, YouTube videos, etc.

  4. I would ask exactly what it is you want to compare? Speed, features, products, ...?

  5. Delta Lake is not trying to replace any messaging pub/sub systems, they have different use cases. Delta Lake can connect to each of the product you mention both as a subscriber and publisher. Don't forget that Delta Lake is an open storage layer that bring ACID compliant transactions, high performance, and high reliability to data lakes.

Louis.