There’s never been a better time to be a data engineer, in large part due to the rapid innovation and rising popularity of modern warehouses and data processing tools. When working with customer data, though, collecting all of your relevant information in one place and making it usable around the organization is a non-trivial challenge.
Customer Data Platforms (CDPs) have tried to solve for data collection and activation, but unfortunately most of them make the problem worse by creating additional data silos and integration gaps. …
How concerned are you about companies collecting your personal data? This isn’t a new concern. In fact, it’s been almost eight years since Edward Snowden’s release of materials detailing the sharing of personal data by companies like Facebook, Google, and Apple with the NSA.
It’s common to see people take action on this concern. Hacker News regularly sees first-page articles about individuals who go through the effort of self-hosting digital products like email avoid sharing their data.
Consumer-facing products like Gmail, Amazon, and others come to mind first when we think about personal data privacy. …
Open source is a component of almost all software development that takes place today. If you look back, the influence has been potent. For example, the main reason Python became the language most-suited for machine learning is the open-source contributors. In fact, because of the enormous size of the open-source community that is tirelessly developing Python, Google open-sourced TensorFlow.
Joe Worrall, Director of Open Source and Developer Advocacy at New Relic, describes the dynamics behind the power of building contributor-centric systems:
“Contributors don’t give to the cause. They are a part of it.”
This post looks at Mattermost’s customer data stack, which allows them to seamlessly leverage unlimited, real-time data across multiple sources to drive various analytics use-cases. We also look at how this data stack aligns with their open-source values and complies with their strict data privacy and security requirements.
Mattermost is an open-source messaging and collaboration platform that is a popular alternative to enterprise business communication tools like Slack. It is built for high-trust environments, and the deployment is fully self-hosted and brings together all enterprise-wide communications into one place.
As you’d expect from an open-source tool, they offer hundreds of…
Congratulations to the team at Segment for their massive acquisition by Twilio. From being founded in 2012 to getting acquired in 2020 (for billions) is a huge success and a testament to their team’s outstanding execution.
When Segment initially released
analytics.js, it was criticized for being only marginally better than a tag manager but developers on HackerNews loved the idea. Hence, Segment’s success is also a testament to the power of the developer community on HN—hats off to everyone who supported the project, especially in the early days.
Segment recently published a CDP report where they shared some data around…
Over the last 5 years, cloud SaaS tools have made the jobs of developers and data engineering teams much easier in many ways. One of the most profound improvements has been the ability for teams to ‘outsource’ the build and infrastructure management of core functionalities.
It’s a good time to be building software when Stripe manages payments infrastructure, Okta takes care of SSO, Algolia provides robust search, and so on.
When it comes to customer data, though, cloud SaaS tooling often tells a different story.
There’s no shortage of powerful software for creating audiences, running user analytics, and other use…
In our previous post, we discussed why Apache Kafka wasn’t the right solution for RudderStack’s core streaming/queuing engine. Instead, we built our own streaming engine on top of PostgreSQL. This article discusses the internals of our implementation using the queuing system in more detail.
The core concept behind any queuing system is trivial. A CS101 implementation involves a linked list of items. A queuing system adds elements (or, in our case, events) to one end while consuming them from the other, as shown in the figure below. Once the system consumes an event, one can remove it from the list.
In this post, we break down the data stack built by 1mg that allows them to harness unlimited, real-time data securely. We will also look at the tools they use to activate this data for their downstream analytics and personalization use-cases.
1mg is an online platform that provides services for medical diagnostics, consultation, lab tests, and general healthcare. Every day, millions of users visit the 1mg website or use their apps to buy medicine, schedule time with doctors, or simply find helpful medical information.
In this post, we answer the all-important question — “Why we did not prefer Apache Kafka over PostgreSQL for building RudderStack”. We discuss some of the challenges with using Apache Kafka over our implemented solution that uses PostgreSQL.
At its core, RudderStack is a queuing system. It gets events from multiple sources, persists them, and then sends them to different destinations. Persisting the events is crucial because RudderStack needs to be able to handle different kinds of failures.
Let’s consider an example here — a destination could be down for any length of time due to some reason. In such…
In this article, we break down the ideal architecture for “the complete customer data stack” from the perspective of the data engineer. With new customer data software tools being launched every day and unclear definitions around terms like “customer data platform,” we make the argument that these individual tools are always part of a much more comprehensive customer data stack that should be managed by IT and engineering.
In business software, terms like “unified customer data” and “360º view of the customer” are hot marketing buzzwords. …