There’s never been a better time to be a data engineer, in large part due to the rapid innovation and rising popularity of modern warehouses and data processing tools. When working with customer data, though, collecting all of your relevant information in one place and making it usable around the organization is a non-trivial challenge.
Customer Data Platforms (CDPs) have tried to solve for data collection and activation, but unfortunately most of them make the problem worse by creating additional data silos and integration gaps. …
How concerned are you about companies collecting your personal data? This isn’t a new concern. In fact, it’s been almost eight years since Edward Snowden’s release of materials detailing the sharing of personal data by companies like Facebook, Google, and Apple with the NSA.
It’s common to see people take action on this concern. Hacker News regularly sees first-page articles about individuals who go through the effort of self-hosting digital products like email avoid sharing their data.
Consumer-facing products like Gmail, Amazon, and others come to mind first when we think about personal data privacy. …
Open source is a component of almost all software development that takes place today. If you look back, the influence has been potent. For example, the main reason Python became the language most-suited for machine learning is the open-source contributors. In fact, because of the enormous size of the open-source community that is tirelessly developing Python, Google open-sourced TensorFlow.
Joe Worrall, Director of Open Source and Developer Advocacy at New Relic, describes the dynamics behind the power of building contributor-centric systems:
“Contributors don’t give to the cause. They are a part of it.”
RudderStack is an open-source customer data pipeline tool for developers. Being open source is a tag we wear with pride, so much so that we recently partnered with GitHub for GitHub Sponsors for Companies. We value the developer community that works hard to build and support open source projects. In a recent blog, we explained why RudderStack directly compensates developers for their contributions to our project. This post discusses why and how we open-sourced our content and took the next step in our open source journey. …
This post looks at Mattermost’s customer data stack, which allows them to seamlessly leverage unlimited, real-time data across multiple sources to drive various analytics use-cases. We also look at how this data stack aligns with their open-source values and complies with their strict data privacy and security requirements.
Mattermost is an open-source messaging and collaboration platform that is a popular alternative to enterprise business communication tools like Slack. It is built for high-trust environments, and the deployment is fully self-hosted and brings together all enterprise-wide communications into one place.
As you’d expect from an open-source tool, they offer hundreds of third-party integrations and connect to popular DevOps and developer workflow tools. …
Congratulations to the team at Segment for their massive acquisition by Twilio. From being founded in 2012 to getting acquired in 2020 (for billions) is a huge success and a testament to their team’s outstanding execution.
When Segment initially released
analytics.js, it was criticized for being only marginally better than a tag manager but developers on HackerNews loved the idea. Hence, Segment’s success is also a testament to the power of the developer community on HN—hats off to everyone who supported the project, especially in the early days.
Segment recently published a CDP report where they shared some data around top destinations. Interestingly, the SMS & push category was in a distant 11th place, with only 13% of businesses leveraging those types of connections. Even more interesting is that within the SMS and push category companies like Braze & Customer.IO …
Over the last 5 years, cloud SaaS tools have made the jobs of developers and data engineering teams much easier in many ways. One of the most profound improvements has been the ability for teams to ‘outsource’ the build and infrastructure management of core functionalities.
It’s a good time to be building software when Stripe manages payments infrastructure, Okta takes care of SSO, Algolia provides robust search, and so on.
When it comes to customer data, though, cloud SaaS tooling often tells a different story.
There’s no shortage of powerful software for creating audiences, running user analytics, and other use cases, all of which are valuable for downstream teams like marketing and product. The way most of the systems are built, though, creates an unintended consequence for data engineers: an additional data silo, which almost always means some sort of integration project or security discussion at some point, or, at the very least, a challenge around data disparity between platforms. …
In our previous post, we discussed why Apache Kafka wasn’t the right solution for RudderStack’s core streaming/queuing engine. Instead, we built our own streaming engine on top of PostgreSQL. This article discusses the internals of our implementation using the queuing system in more detail.
The core concept behind any queuing system is trivial. A CS101 implementation involves a linked list of items. A queuing system adds elements (or, in our case, events) to one end while consuming them from the other, as shown in the figure below. Once the system consumes an event, one can remove it from the list.
In this post, we break down the data stack built by 1mg that allows them to harness unlimited, real-time data securely. We will also look at the tools they use to activate this data for their downstream analytics and personalization use-cases.
1mg is an online platform that provides services for medical diagnostics, consultation, lab tests, and general healthcare. Every day, millions of users visit the 1mg website or use their apps to buy medicine, schedule time with doctors, or simply find helpful medical information.
In this post, we answer the all-important question — “Why we did not prefer Apache Kafka over PostgreSQL for building RudderStack”. We discuss some of the challenges with using Apache Kafka over our implemented solution that uses PostgreSQL.
At its core, RudderStack is a queuing system. It gets events from multiple sources, persists them, and then sends them to different destinations. Persisting the events is crucial because RudderStack needs to be able to handle different kinds of failures.
Let’s consider an example here — a destination could be down for any length of time due to some reason. In such a scenario, RudderStack should ideally retain the events and then retry sending the events when that destination is functional once again. …
In this article, we break down the ideal architecture for “the complete customer data stack” from the perspective of the data engineer. With new customer data software tools being launched every day and unclear definitions around terms like “customer data platform,” we make the argument that these individual tools are always part of a much more comprehensive customer data stack that should be managed by IT and engineering.
In business software, terms like “unified customer data” and “360º view of the customer” are hot marketing buzzwords. …