The modern data stack: inspiration from 100 stacks from the MDN.
Building a data stack is akin to lining up a winning sports team. The data decision-maker is the team’s coach. Trophies are data quality, reliability, business insights, ML models, dashboards, etc. Coaches define their strategy, selecting players based on their capabilities. They make sure they fit together in a delicate choreography. All of this within budget and staffing constraints. Data decision-makers perform the same exercise, with their data stack components.
This is a difficult exercise. Money can’t buy your way to the Champions’ League. Likewise, you can’t buy your way out of data problems. You need to build the proper data stack to solve them.
Do you want inspiration from 100 companies with best-in-class modern data stacks? You are at the right place. In this article, we share statistics and analyses out of the 100+ companies in the Modern Data Network (MDN). Being data-driven people, we love to base our decisions on numbers. So here they are!
The MDN’s objective is to build a community to inform data teams’ decision-making. We achieve this by sharing experience and learning from each other. We believe this will help make better decisions. And ultimately facilitates data teams’ mission to deliver value to their stakeholders. If you don’t know the Modern Data Network you can get a look at this post for the full story behind our mission. If you don’t have time here is its most condensed form: share and learn.
Before jumping in, one important comment. Not all companies play the same game. Winning your data trophies depends on your company’s business model, its industry, your team’s maturity, etc. Don’t buy the stack of someone else. Build the stack that solves your challenges. Leave a comment to let us know your view on this matter.
Who are the 100?
Let’s start with three high-level statistics about our sample:
- Business model: 45% B2B and 42% B2C.
- Industry: 10% finance, 9% food, 5% travel, 5% retail, 3% health, 3% advertising.
- Size: 52% have fewer than 100 employees, 10% have over 500
A data stack solves a need in a specific context. This context defines the objectives and constraints for the stack. For instance, the objective could be providing data to the customers of a B2B fintech company of 2,000 employees. One constraint would be data security and privacy. Maybe that prevents having public cloud providers in your stack. Maybe your stack objective is to enable real-time processing for ML applications, without a strong cost constraint. Obviously, there are more than three dimensions determining the context a stack operates in. But these three already help you interpret the rest of the stats by putting them in context.
If Singer and Stitch make you think of sewing, the following will be surprising. Likewise if you associate Snowflake with winter, and Redshift with the Doppler effect. Nevermind if Tableau is what you expect to hang in museums, or Periscope evokes submarines. These are all part of the Data Stack ecosystem.
The following introduces tools to manage the different steps in creating value out of data. These steps are below. For each step, you’ll get usage statistics from the MDN. Hopefully that helps you make decisions. NB: the sequence is not meant to be the “correct” one (if it exists).
- extract data from the system where it was produced,
- load it into your system,
- transform it through business logic,
- store it, to access it at will,
- and last, but not least, visualize it.
Disclaimer #1: The data comes from a snapshot of the MDN as of 31/05/2021. The reality behind this snapshot is fluid: companies evolve their stacks, newcomers join the MDN, new products are launched, etc. If you’re reading this article in Q4 2021, beware that the snapshot will be outdated!
Disclaimer #2: The data is declarative. The MDN doesn’t guarantee it is exact. If you’re curious about a specific piece, please reach out with your questions.
Despite growing adoption of market tools, homemade solutions dominate the extract, load and transform steps.
56% of companies rely only on in-house platforms to extract and load data into their stacks. 14% have the opposite approach, using only out-of-the-box tools. And 26% leverage a mix of both. With the recent democratization of solutions like Fivetran or Stitch, this could be surprising. After some discussions, we converged on three assumptions explaining this phenomenon. First is legacy: companies built their stacks when these solutions were not well known. For instance, Google Trends for France shows that “Fivetran” as a company started to be searched in late 2019.
The two other assumptions were:
- Costs — Scale is expensive. For example Stitch’s list price is 12K$/year for 300M/rows/month. At low volumes, SaaS tools save data teams time and effort. When business scales up, SaaS running costs and specific use cases may push companies towards custom solutions.
- Security — SaaS tools need to connect to your cloud or on-premise architecture. This means giving a read-only account to your production database. Even if these tools don’t store data, this dependency could be a hindrance in certain situations.
Splitting by company size, we observe that companies with less than 100 employees are twice as likely to use only SaaS tools than companies larger than 100. We saw two reasons. The first is scale. A small data team wants to deliver fast the tasks covered by these tools; this focuses the company on its product-market fit. And usage-based pricing makes market solutions much cheaper at a small scale. The second is legacy. Smaller companies tend to be recent ones. That means less legacy and more opportunities to start with recent tools.
Python and SQL dominate the transformation space. 70% of the community use one of them. Third and growing fast is dbt. More than a third of our sample implements it for their transformations. Companies having switched to dbt seem to love it, although operating it at scale remains a hot topic. Less than 10% of the sample uses Spark. We wonder if ubiquitous Cloud Data Warehouses are making Spark less relevant. But we also note that only a couple companies in the sample are managing PB-scale data.
Cloud Data Warehouse: the ubiquitous piece
The Cloud Data Warehouse (CDW) is the cornerstone of the modern data stack. Their limitless capabilities brought fast adoption. Big Query is leading the pack: 46% of companies in the sample use it. Redshift is second with 29% and Snowflake ranks third at 17%. Only 11% of our sample declared having only an on-premise data warehouse. CDW leveraged the innovation / adoption feedback-loop well. The ecosystem developed innovations and tools for CDW. This improved the value of having one, which further grew the ecosystem by attracting more developers. Spinning this flywheel resulted in 90% of our sample to include at least one CDW component.
A possible explanation for BigQuery’s dominance is its integration within Google’s ecosystem. BigQuery is available by default to Google Cloud Platform customers (like Redshift is for AWS). Moreover, 63% of our sample use Google Analytics, whose exports to BigQuery are trivial. This could further support adoption of BigQuery. And then, there is Data Studio, Google’s visualization solution.
The top 3 CDWs have their own strengths and weaknesses. Snowflake stands out by not being the default CDW of a public cloud provider. BigQuery and Redshift both leverage integration within their respective ecosystem, while Snowflake touts its multi-cloud capabilities. Competition on product features is fierce. Selecting a CDW is a difficult choice. We unfortunately have no magic recipe for making it. But below is our modest contribution: a table comparing Big Query, Snowflake and Redshift.
Visualization: the tip of the iceberg.
A picture is worth a thousand words goes the saying. Unsurprisingly, most people consume data from dashboards and graphs rather than straight out of their data warehouse. Anyone that has searched for a pattern in a query output would agree. Producing a graph makes the pattern stand out immediately. That’s why building and maintaining reporting consumes a substantial share of a data team’s bandwidth. It is the best way to find insights for the majority of data consumers. From a data consumer standpoint, the visualization layer is like the tip of the data stack iceberg. Its most visible part, hiding 90% of the stack behind it.
In this context, picking the right tool to translate data into charts is intimidating. It becomes the prism through which stakeholders see their data team’s output. How often have you heard that “the data is broken” when a dashboard doesn’t work? On the one hand, stakeholders have no PhDs in Chart Engineering or Dashboard Design. They want something simple, intuitive and–alas!–where it’s easy to copy and paste data into slides or spreadsheets. On the other hand, data teams need versatile and powerful tools that scale. They are heavy users and optimize for their productivity. Balancing both isn’t easy!
That’s why the MDN organized a workshop for its members putting Tableau and Looker in perspective. The format worked well and we’re considering repeating it with other solutions.
To finally break the suspense, here are the most popular visualization tools. The top 4 are: Tableau (35%), Data Studio (29%), Metabase (19%) and Looker (16%). NB: a third of respondents have implemented more than one visualization tool. With Google’s acquisition of Looker in Feb 2020, their aggregated market share puts them ahead of Salesforce’s Tableau. This is nuanced however by the fact that Data Studio paired with another solution ⅔ of the time. The complete league is available below.
A hundred stacks to inspire you
We hope going over the stacks of a hundred members of the MDN was a source of inspiration. You can take a look at the underlying data here. If you were to remember only two things from this article, please keep in mind to:
- Identify your challenges and build the stack that solves them.
- Share and learn. Ask for feedback before picking your stack’s components.
We left over a number of pieces of a modern data stack. Things like reverse ETL, cataloging, lineage, MLOps, etc. They are important as well, let us know if you’d like to know more about them.
Please share back with us how useful this content is, as well as topics you’re interested in reading about. Send us an email to email@example.com, we’re looking forward to it!
Thanks for reading this far and see you soon in the MDN community!
Article by Christophe Blefari and Emmanuel Martin-Chave. Reviewed by the MDN founders.