In my earlier post, I shared that I narrowed my exploration of open-source metadata systems to two options - Open Metadata (from ex-Uber folks) and Datahub (from LinkedIn).
I want to share another article by Julia Valenti from Cisco. She did an excellent job sharing her insights here recently. I hope she writes a similar article on Datahub as well for comparison.
See: Why OpenMetadata is taking the right approach to metadata cataloging
Julia observes the three main things we should care about in any metadata system - Schema-based, APIs and Lineage. At a high level, both these systems support them all but have differences.
On APIs, note that Datahub has a rich set of GraphQL APIs, while Open Metadata has a rich collection of REST APIs. What is more important other than GraphQL or conventional REST APIs is the actual features supported by these APIs that allow your developers to use and integrate with ease to derive maximum value.
On lineage, as Julia notes, everyone is working on improving lineage information. In my opinion, this is where a "standard" can come into play at some point. I hope the metadata community converges and collaborates to establish some standards among all these emerging systems. I, as a user, would want that from these systems. It becomes essential for interoperability and portability, but I am getting ahead of myself here.
Metadata aside, I am getting curious about the actual data access management. Sure we get great information on data from metadata systems. How then, when tools or programs access the underlying datasets/data sources, do the metadata systems make it easier? How and where should we implement "access control" to the datasets/data sources? Where do these access control policies reside, and where do they get enforced? What is the role of an API gateway concerning metadata systems?
I believe that metadata plays a significant role in policy definitions for access control and management. If you have opinions or pointers for these areas, please share.