As I continue my research into different metadata systems, I want to share my experiences and commentary along the way. My exploration of metadata systems is actually a part of my overall exploration of the modern data stack.
During my career, I have been involved in building several platforms and data products. I feel that discussions on data platforms and stacks don't actively include metadata systems. Perhaps we all take metadata for granted, which would be a mistake.
Metadata should be at the heart of the modern data stack, not at the periphery, not an afterthought.
Development and interest in metadata management have been steadily growing, with many contenders - open source and commercial. Data-first companies (like those listed below) saw an explosion of data, demand for access to that data, and therefore needed to invent ways to manage and govern their data. In the process, each created their own iteration of a metadata system to support data discovery, exploration, enrichment, compliance, governance, and usage. Here is an incomplete list of data discovery/metadata projects:
Uber’s Databook
LinkedIn’s Datahub
Lyft’s Amundsen
Netflix’s Metacat
WeWork’s Marquez
Spotify’s Lexikon
Airbnb’s Dataportal
Shopify’s Artifact
As you can see, many teams have been at it for a while - some for several years and some more recently. Commercial products like Alation and Collibra (and perhaps others?) have been around for years. I don’t have experience with them, so can't say much and won’t cover them. Cloud providers offer their own “data catalog” solutions: Google Cloud Data Catalog, AWS Glue Data Catalog, and Azure Data Catalog. I don’t plan to research these solutions either as they are specific to their ecosystems.
My focus is on open-source systems. I am interested in the needs of teams similar to the ones that I have led, teams that build data platforms and data stacks — trying to understand what should be the considerations for metadata management as part of a modern data stack.
If you are involved in building data platforms and stacks, I am sure you too have seen your share of challenges and needs that resulted in these metadata systems. The need for a metadata system can be felt regardless of the variations in scale, size, and complexity. Everyone building a data-first ecosystem will need to think seriously about metadata management. Pay attention to metadata now rather than wait and regret.
Of all the projects above, Open Metadata (from ex-Uber folks) and Datahub (from LinkedIn) seem the most interesting and contemporary to me.
Open Metadata
Last time, I wrote about Open Metadata which appears to be the newest of this group. I have installed it and studied the APIs. You can play with their demo sandbox without installing.
Datahub
Datahub is ahead in terms of development and maturity, and community. Highlights of what I liked:
Extensible Schema-based approach which is super important (uses PDL, though I prefer standard JSON-based schemas)
Rich set of GraphQL APIs - it is smart to use GraphQL
An impressive list of sources to ingest out-of-the-box
Demo site to play around
Shirshanka Das, who founded and architected Datahub while at LinkedIn1, wrote this great article on different metadata systems — Popular metadata architectures explained. He compares and contrasts various architectures (1st, 2nd, and 3rd generation) at play among these systems. It is important to understand the various architectures and tradeoffs, do not automatically write off a solution just because it is not the latest generation. Architecture is neither good nor bad, without understanding the overall context and the value it delivers. Architecture aside, my focus as a user/adopter of these systems is on the value these metadata systems can deliver to data teams and users.
So far, I have tried to lay the foundation for exploring metadata systems here and in my previous article.
Going forward, I want to discuss several topics and concepts related to data, metadata, data platforms, and data stacks: use cases, user personas, the role of metadata in modern data stacks, how it relates to Data Mesh2, how metadata aids and assists automation, machine learning, and AI. I also want to dig into topics around metadata including versioning, dependency management, and standardization. Let me know if these topics interest you.
What are you doing about metadata? Share your experiences and opinions.
Shirshanka Das (ex-LinkedIn) and Swaroop Jagadish (ex-Airbnb) are cofounders of Acryl Data, the company behind Datahub.