“There's more than one way to skin a cat,” as the saying goes. If you’re a cat person, you should probably try not to think too hard about this metaphor, yet the sentiment remains. Moreover, we would argue that the adage applies perfectly to this particular topic.
While most of your data problems can be solved with a terminal command, python script, or even a simple Excel spreadsheet, once you take speed, consistency, and scale into account, things become slightly more complicated.
Moreover, the variety of processes and tools available in the data space has led tool-set developers to specialize in specific tasks rather than facilitating the collaboration of core concepts such as data modeling, statistics, and effective data visualization.
Fortunately, there are exceptions to this. The best approach to building data platforms today seems to be a combination of DIY and a modern data stack, managing services and engineering your platform to be flexible enough to anticipate unknown scenarios.
5 Steps to Building a Modular Data Platform
If implemented correctly, the modern data stack allows data professionals to focus on science and math instead of getting bogged down in archaic processes revolving around documentation and administration.Modern data platforms are based on the concept of modularization. Despite clever marketing and sales campaigns, no single vendor or technology currently owns the entire data landscape. To put together the right solution for your specific project, you must understand each component, from the source, through integration, transformation, and other aspects, all the way to presentation and transportation.
Let’s look at each stage separately and discuss the tools and services you can use to set up your modern data platform. As the source depends on your data set, we will skip this step and move straight to integration.
Integration
Data sources can take many forms, and the integration layer should be flexible enough to accommodate them all. Airflow is one of the most popular DIY tools in this category, enabling users to build robust end-to-end pipelines.Other Apache options, such as Kafka, offer a more event-based approach to data integration. Combining these tools to extend your data pipeline further is a viable strategy as well.
Managed services have seen significant progress in the integration space in recent years. Several leaders in this space, including Confluent and Astronomer, offer flexibility while helping accelerate development. While Segment stands out as the leader in event-based data integration, Meltano is emerging as a top-notch open-source solution for more traditional ELT implementations.
Data Warehouse
Modern data platforms are incomplete without a data warehouse, which may be the most complex and critical part. This is partly because legacy database technologies such as MySQL, Postgres, and SQL Server are still very effective.Despite this, newcomers like Snowflake provide a clear path for the future. BigQuery, RedShift, Snowflake, and others are all cloud-based data warehouses that offer numerous benefits over their predecessors.
The concept of partitioning a cloud-based data warehouse into layers is still evolving, no matter which cloud-based data warehouse you choose. The growing consensus is that your data warehouse must have two separate "zones.” One zone should store raw/unstructured data, while the other should store normalized/transformed data.
Though there is much to debate on this topic, the benefit of splitting your data into two distinct areas is that they enable you to manage the ever-changing rules regarding transforming unprocessed data into usable information.
Transformation
While the data warehouse is probably the most crucial component of the modern data stack, the transformation side of things is the most neglected. In most organizations, data transformations are dispersed across visualization platforms, business tools, and even handwritten documents, but centrally managing data transformations is a clear characteristic of a mature organization.As the battle between ETL and ELT emerged in the mainstream, the idea of efficiently managing transformations began to take root. Even though it may seem pedantic, simply rearranging the letters of a common acronym paved the way for data products to be built by people with no data background.
As a result of this paradigm shift, concepts such as data governance and master data management, which are heavily dependent on stakeholder input, have been given a new lease on life.
The Python programming language is supreme from a DIY perspective since it can easily handle fundamental SQL/task-based transformations with modules like Airflow and SQLAlchemy and is designed for machine learning transformations via Scikit-Learn and Tensorflow.
In terms of managed services, dbt is hard to beat. Although all three major cloud providers (AWS, Microsoft, and Google) have transformation management tools specialized for their platforms, dbt appears to be ahead of the competition from a platform-agnostic perspective.
Presentation
So far, we have discussed mostly infrastructure components. Although the data warehouse and transformation components are essential for data scientists, engineers, and analysts, most end users will not deal with this. The majority will only get involved in the presentation layer, i.e., the dashboard.It is fair to say that the presentation component covers a wide range. Who says that a tool that incorporates elements of transformation cannot also be used for presentation? For example, Databricks has successfully used this strategy and appears to be on the verge of becoming the next big tech IPO.
From a historical perspective, visualization tools like Tableau, Sisense, Qlik, Power BI, and Looker have dominated both the transformation and presentation categories, demonstrating how managing transformations and creating beautiful visualizations aren't mutually exclusive.
With the data stack continuing to evolve, we believe those focusing on visualization capabilities will be more likely to succeed than those only interested in transformation.
It becomes increasingly challenging to manage transformations at the presentation level as more data sources are integrated and data volumes increase exponentially. As a result, organizations risk creating ill-defined information and inaccurate analysis due to transforming data at the presentation level.
Transportation
For this approach to be considered uniquely modern, the transportation component must be considered.It was acceptable for end users to consume data through external analytics tools and dashboards in the past, but now it seems that if data professionals cannot feed their insights into systems of record, their efforts may be wasted.
A concept often referred to as embedded analytics, the idea of data transportation bridges the gap between data tools and systems of record (such as CRM, marketing automation, and customer success platforms). Managed services have not effectively solved this problem, and even those that have emerged are still in development.
Unless you have copious developer resources and experience in automating information exchange, companies like Syncari, Census, and Hightouch seem to be your only options on most projects.
Wrapping Up
We are currently in the middle of a data landscape shift. Companies are materializing overnight to solve problems related to the observability and security of data platforms. In light of this, we should understand that the key takeaways from this message are flexibility and agnosticism.The entire data stack will eventually be distilled into one platform, but it might take a decade for one vendor to do so. Therefore, put this framework into practice in the future, understanding that new ideas will be accepted daily and your thinking will need to change.
Awesome Blog, Great content
ReplyDeletedo visit us: https://ainews.today