Where's your metadata? Part 2

Exploring open-source metadata systems

Oct 27, 2021

As I continue my research into different metadata systems, I want to share my experiences and commentary along the way. My exploration of metadata systems is actually a part of my overall exploration of the modern data stack.

During my career, I have been involved in building several platforms and data products. I feel that discussions on data platforms and stacks don't actively include metadata systems. Perhaps we all take metadata for granted, which would be a mistake.

Metadata should be at the heart of the modern data stack, not at the periphery, not an afterthought.

Development and interest in metadata management have been steadily growing, with many contenders - open source and commercial. Data-first companies (like those listed below) saw an explosion of data, demand for access to that data, and therefore needed to invent ways to manage and govern their data. In the process, each created their own iteration of a metadata system to support data discovery, exploration, enrichment, compliance, governance, and usage. Here is an incomplete list of data discovery/metadata projects:

Uber’s Databook
LinkedIn’s Datahub
Lyft’s Amundsen
Netflix’s Metacat
WeWork’s Marquez
Spotify’s Lexikon
Airbnb’s Dataportal
Shopify’s Artifact

As you can see, many teams have been at it for a while - some for several years and some more recently. Commercial products like Alation and Collibra (and perhaps others?) have been around for years. I don’t have experience with them, so can't say much and won’t cover them. Cloud providers offer their own “data catalog” solutions: Google Cloud Data Catalog, AWS Glue Data Catalog, and Azure Data Catalog. I don’t plan to research these solutions either as they are specific to their ecosystems.

My focus is on open-source systems. I am interested in the needs of teams similar to the ones that I have led, teams that build data platforms and data stacks — trying to understand what should be the considerations for metadata management as part of a modern data stack.

If you are involved in building data platforms and stacks, I am sure you too have seen your share of challenges and needs that resulted in these metadata systems. The need for a metadata system can be felt regardless of the variations in scale, size, and complexity. Everyone building a data-first ecosystem will need to think seriously about metadata management. Pay attention to metadata now rather than wait and regret.

Of all the projects above, Open Metadata (from ex-Uber folks) and Datahub (from LinkedIn) seem the most interesting and contemporary to me.

Open Metadata

Last time, I wrote about Open Metadata which appears to be the newest of this group. I have installed it and studied the APIs. You can play with their demo sandbox without installing.

Dispatches by Deepak Alur

Where's your metadata

As you build any system, you will see a gradual or sometimes sudden explosion of the need for data from the system. This may be driven by increasing capabilities of the system or increasing usage of the data in your system or both. What you built for your system, is also needed by others - if not now, later. Soon other teams in your company or external …

4 years ago · Deepak Alur

Datahub

Datahub is ahead in terms of development and maturity, and community. Highlights of what I liked:

Extensible Schema-based approach which is super important (uses PDL, though I prefer standard JSON-based schemas)
Rich set of GraphQL APIs - it is smart to use GraphQL
An impressive list of sources to ingest out-of-the-box
Demo site to play around

Shirshanka Das, who founded and architected Datahub while at LinkedIn1, wrote this great article on different metadata systems — Popular metadata architectures explained. He compares and contrasts various architectures (1st, 2nd, and 3rd generation) at play among these systems. It is important to understand the various architectures and tradeoffs, do not automatically write off a solution just because it is not the latest generation. Architecture is neither good nor bad, without understanding the overall context and the value it delivers. Architecture aside, my focus as a user/adopter of these systems is on the value these metadata systems can deliver to data teams and users.

datahub-architecture — Source: Datahub Architecture Overview

So far, I have tried to lay the foundation for exploring metadata systems here and in my previous article.

Going forward, I want to discuss several topics and concepts related to data, metadata, data platforms, and data stacks: use cases, user personas, the role of metadata in modern data stacks, how it relates to Data Mesh2, how metadata aids and assists automation, machine learning, and AI. I also want to dig into topics around metadata including versioning, dependency management, and standardization. Let me know if these topics interest you.

What are you doing about metadata? Share your experiences and opinions.

Shirshanka Das (ex-LinkedIn) and Swaroop Jagadish (ex-Airbnb) are cofounders of Acryl Data, the company behind Datahub.

I first heard about Data Mesh from Todd Fast as we embarked on building our new data platform at OpenGov. Todd joined us from Intuit who are firm believers in Data Mesh. Recommended read: Data Mesh by Zhamak Dehghani.

Dispatches by Deepak Alur

Where's your metadata? Part 2

Exploring open-source metadata systems

Open Metadata

Datahub

Discussion about this post