I work for a Data Science Software company where we make tools to help people do data science(DS). Makes sense. Actually, we make 1 tool that covers the range of “data wrangling”, normalizing, testing, modeling, hosting and monitoring. There’s a lot more to DS than the cool stuff DS’s love to do. Date normalization is an easy target. We might come back to this later; in the mean time…
What’s a Datamesh? According to http://Starburst.io the answer is: “Data mesh is a new approach based on a modern, distributed architecture for analytical data management. It enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product.”
Let’s discuss: New approach? Not really, it’s just the correct way to manage data. Saying new approach always sounds cooler.
Modern, distributed architecture for analytical data management. So… distributed OLAP? Distributed how? Distributed why? We don’t like warehouses anymore?
It enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. “easily” is always a fun word. Sorta like “user friendly” from 80’s software. Right, play the data where it lies… AKA remote query. So instead of moving data, you connect to the source and local processing engine directly. That last bit is important, not all data source are SQL engines. Sometimes you have to deal with clusters.
The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product. Awesome. What tools are you providing to enable these “domain-specific teams?” (We used to call them Domain Experts or SME’s.) I’m all about SME’s being able to curate and publish data products. I’m all about making it easily searchable. I’m also all about making sure it’s well explained, quality controlled and secure when needed.
So, is Datamesh a product? Nope. It’s a concept. It’s data management done the way it should be. Sorta like Cloud is basic IT done the way it should be. Individual corporations struggle to implement either solution effectively because it’s “overhead.” Especially if you have to build the solution. Elastic Search, Airflow, some Wiki, some platform for hosting all of this, user management, security, etc. Plus monitoring and hopefully some Data freaking Science!
Get this all in one product; Dataiku DSS. It is a centralized, data platform that allows you to connect to multiple data stores, process data where it lies, create DS models, deploy them and monitor them in one stop. It can also generate model documentation to be reviewed so we don’t have another Apple Credit card incident. Plus, it allows SME’s to contribute to the process via “Visual Recipes.” I hear all of the DS’s now crying about visual tools. You can still code the cool bits, in your own IDE or a notebook. This just lets SMEs do easy stuff, like Normalize Date/time fields, expand Log file attributes, or a couple of hundred other things to prep your data.
The easy example they use in their 101 Tutorial, is normalizing T-shirts. M-T-Blk should be Men-T-Black to match all of the other records in the table. Done visually, so the SME doesn’t have to worry about learning Python, Rust, Go, R, Julia, etc.
DSS also has “Project” level wikis. Document your heart out with common wiki tools. It’ll be attached to your project and you can even link to specific parts of your flow for further clarification. And when your curated data is ready, you can share it easily. Other users can find shared dataset using the Data Catalog that is also included.
I agree with the concepts of Datamesh. Just know it’s a concept, not a product and it will take a lot of work to implement. Should you? YES.
I imagine a large company with may unique datasets with a centralized DS Team. They allow their SMEs to “Curate” data sets for exposure to the corporate hub. Then the DS Team can search the Data Catalog to find appropriate data sets, read about how they were curated and use them in unique projects as needed.
According to the above definition; Dataiku DSS can be a Datamesh if you use it correctly. Just my $0.02.