There’s never been a better time to be a data engineer, in large part due to the rapid innovation and rising popularity of modern warehouses and data processing tools. When working with customer data, though, collecting all of your relevant information in one place and making it usable around the organization is a non-trivial challenge.
Customer Data Platforms (CDPs) have tried to solve for data collection and activation, but unfortunately most of them make the problem worse by creating additional data silos and integration gaps. …
This post looks at Mattermost’s customer data stack, which allows them to seamlessly leverage unlimited, real-time data across multiple sources to drive various analytics use-cases. We also look at how this data stack aligns with their open-source values and complies with their strict data privacy and security requirements.
Mattermost is an open-source messaging and collaboration platform that is a popular alternative to enterprise business communication tools like Slack. It is built for high-trust environments, and the deployment is fully self-hosted and brings together all enterprise-wide communications into one place.
As you’d expect from an open-source tool, they offer hundreds of third-party integrations and connect to popular DevOps and developer workflow tools. …
Congratulations to the team at Segment for their massive acquisition by Twilio. From being founded in 2012 to getting acquired in 2020 (for billions) is a huge success and a testament to their team’s outstanding execution.
When Segment initially released
analytics.js, it was criticized for being only marginally better than a tag manager but developers on HackerNews loved the idea. Hence, Segment’s success is also a testament to the power of the developer community on HN—hats off to everyone who supported the project, especially in the early days.
Segment recently published a CDP report where they shared some data around top destinations. Interestingly, the SMS & push category was in a distant 11th place, with only 13% of businesses leveraging those types of connections. Even more interesting is that within the SMS and push category companies like Braze & Customer.IO …
Over the last 5 years, cloud SaaS tools have made the jobs of developers and data engineering teams much easier in many ways. One of the most profound improvements has been the ability for teams to ‘outsource’ the build and infrastructure management of core functionalities.
It’s a good time to be building software when Stripe manages payments infrastructure, Okta takes care of SSO, Algolia provides robust search, and so on.
When it comes to customer data, though, cloud SaaS tooling often tells a different story.
There’s no shortage of powerful software for creating audiences, running user analytics, and other use cases, all of which are valuable for downstream teams like marketing and product. The way most of the systems are built, though, creates an unintended consequence for data engineers: an additional data silo, which almost always means some sort of integration project or security discussion at some point, or, at the very least, a challenge around data disparity between platforms. …
In our previous post, we discussed why Apache Kafka wasn’t the right solution for RudderStack’s core streaming/queuing engine. Instead, we built our own streaming engine on top of PostgreSQL. This article discusses the internals of our implementation using the queuing system in more detail.
The core concept behind any queuing system is trivial. A CS101 implementation involves a linked list of items. A queuing system adds elements (or, in our case, events) to one end while consuming them from the other, as shown in the figure below. Once the system consumes an event, one can remove it from the list.
For using a queuing system in practice, though, especially in a complex system like RudderStack, it must address several challenges. …
In this post, we break down the data stack built by 1mg that allows them to harness unlimited, real-time data securely. We will also look at the tools they use to activate this data for their downstream analytics and personalization use-cases.
1mg is an online platform that provides services for medical diagnostics, consultation, lab tests, and general healthcare. Every day, millions of users visit the 1mg website or use their apps to buy medicine, schedule time with doctors, or simply find helpful medical information.
In this post, we answer the all-important question — “Why we did not prefer Apache Kafka over PostgreSQL for building RudderStack”. We discuss some of the challenges with using Apache Kafka over our implemented solution that uses PostgreSQL.
At its core, RudderStack is a queuing system. It gets events from multiple sources, persists them, and then sends them to different destinations. Persisting the events is crucial because RudderStack needs to be able to handle different kinds of failures.
Let’s consider an example here — a destination could be down for any length of time due to some reason. In such a scenario, RudderStack should ideally retain the events and then retry sending the events when that destination is functional once again. …
In this article, we break down the ideal architecture for “the complete customer data stack” from the perspective of the data engineer. With new customer data software tools being launched every day and unclear definitions around terms like “customer data platform,” we make the argument that these individual tools are always part of a much more comprehensive customer data stack that should be managed by IT and engineering.
In business software, terms like “unified customer data” and “360º view of the customer” are hot marketing buzzwords. …
This blog presents an approach for routing data to RudderStack using Amazon Kinesis and AWS Lambda Functions.
Many organizations today make use of streaming event data from their applications and websites. For collecting the data streams, they use tools like Amazon Kinesis. But how can these businesses turn the data streams into actionable insights? A popular approach to do this is through a process that is called activation. In this process, we transform the raw data and then route to different applications and services for insights. …
In our previous blog, The Tale of Identity Graph and Identity Resolution, we described the problem of identity resolution. We used a concrete example of a user visiting an eCommerce site from multiple websites. Specifically, we showed how the app events can be associated with multiple identities and how these identities can be tied together using the
identify() call. We captured the association using the following identity graph:
The identity graph is stored in a SQL database in the
identify table as below:
In this blog, we will show how we can associate a virtual ID with all these IDs. This association between (anonymous or user ID) and the virtual ID will let us tie all the events as originating from one end user. …