Data Engineering Podcast

Verfügbare Folgen

5 von 461

Simplifying Data Pipelines with Durable Execution
SummaryIn this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and reduce operational overhead, ensuring exactly-once execution in distributed systems through the use of the Transact library. He discusses the importance of version management in long-running workflows and how DBOS simplifies system design by reducing infrastructure needs like queues and CI pipelines, making it beneficial for data pipelines, AI workloads, and agentic AI.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logicInterviewIntroductionHow did you get involved in the area of data management?Can you describe what DBOS is and the story behind it?What is durable execution?What are some of the notable ways that inclusion of durable execution in an application architecture changes the ways that the rest of the application is implemented? (e.g. error handling, logic flow, etc.)Many data pipelines involve complex, multi-step workflows. How does DBOS simplify the creation and management of resilient data pipelines? How does durable execution impact the operational complexity of data management systems?One of the complexities in durable execution is managing code/data changes to workflows while existing executions are still processing. What are some of the useful patterns for addressing that challenge and how does DBOS help?Can you describe how DBOS is architected?How have the design and goals of the system changed since you first started working on it?What are the characteristics of Postgres that make it suitable for the persistence mechanism of DBOS?What are the guiding principles that you rely on to determine the boundaries between the open source and commercial elements of DBOS?What are the most interesting, innovative, or unexpected ways that you have seen DBOS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DBOS?When is DBOS the wrong choice?What do you have planned for the future of DBOS?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksDBOSExactly Once SemanticsTemporalSempahorePostgresDBOS TransactPython Typescript Idempotency KeysAgentic AIState MachineYugabyteDBPodcast EpisodeCockroachDBSupabaseNeonPodcast EpisodeAirflowThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
--------
39:49
Overcoming Redis Limitations: The Dragonfly DB Approach
SummaryIn this episode of the Data Engineering Podcast Roman Gershman, CTO and founder of Dragonfly DB, explores the development and impact of high-speed in-memory databases. Roman shares his experience creating a more efficient alternative to Redis, focusing on performance gains, scalability, and cost efficiency, while addressing limitations such as high throughput and low latency scenarios. He explains how Dragonfly DB solves operational complexities for users and delves into its technical aspects, including maintaining compatibility with Redis while innovating on memory efficiency. Roman discusses the importance of cost efficiency and operational simplicity in driving adoption and shares insights on the broader ecosystem of in-memory data stores, future directions like SSD tiering and vector search capabilities, and the lessons learned from building a new database engine.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Roman Gershman about building a high-speed in-memory database and the impact of the performance gains on data applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what DragonflyDB is and the story behind it?What is the core problem/use case that is solved by making a "faster Redis"?The other major player in the high performance key/value database space is Aerospike. What are the heuristics that an engineer should use to determine whether to use that vs. Dragonfly/Redis?Common use cases for Redis involve application caches and queueing (e.g. Celery/RQ). What are some of the other applications that you have seen Redis/Dragonfly used for, particularly in data engineering use cases?There is a piece of tribal wisdom that it takes 10 years for a database to iron out all of the kinks. At the same time, there have been substantial investments in commoditizing the underlying components of database engines. Can you describe how you approached the implementation of DragonflyDB to arive at a functional and reliable implementation?What are the architectural elements that contribute to the performance and scalability benefits of Dragonfly?How have the design and goals of the system changed since you first started working on it?For teams who migrate from Redis to Dragonfly, beyond the cost savings what are some of the ways that it changes the ways that they think about their overall system design?What are the most interesting, innovative, or unexpected ways that you have seen Dragonfly used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DragonflyDB?When is DragonflyDB the wrong choice?What do you have planned for the future of DragonflyDB?Contact InfoGitHubLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksDragonflyDBRedisElasticacheValKeyAerospikeLaravelSidekiqCelerySeastar FrameworkShared-Nothing Architectureio_uringmidi-redisDunning-Kruger EffectRustThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
--------
43:58
Bringing AI Into The Inner Loop of Data Engineering With Ascend
SummaryIn this episode of the Data Engineering Podcast Sean Knapp, CEO of Ascend.io, explores the intersection of AI and data engineering. He discusses the evolution of data engineering and the role of AI in automating processes, alleviating burdens on data engineers, and enabling them to focus on complex tasks and innovation. The conversation covers the challenges and opportunities presented by AI, including the need for intelligent tooling and its potential to streamline data engineering processes. Sean and Tobias also delve into the impact of generative AI on data engineering, highlighting its ability to accelerate development, improve governance, and enhance productivity, while also noting the current limitations and future potential of AI in the field.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Sean Knapp about how Ascend is incorporating AI into their platform to help you keep up with the rapid rate of changeInterviewIntroductionHow did you get involved in the area of data management?Can you describe what Ascend is and the story behind it?The last time we spoke was August of 2022. What are the most notable or interesting evolutions in your platform since then?In that same time "AI" has taken up all of the oxygen in the data ecosystem. How has that impacted the ways that you and your customers think about their priorities?The introduction of AI as an API has caused many organizations to try and leap-frog their data maturity journey and jump straight to building with advanced capabilities. How is that impacting the pressures and priorities felt by data teams?At the same time that AI-focused product goals are straining data teams capacities, AI also has the potential to act as an accelerator to their work. What are the roadblocks/speedbumps that are in the way of that capability?Many data teams are incorporating AI tools into parts of their workflow, but it can be clunky and cumbersome. How are you thinking about the fundamental changes in how your platform works with AI at its center?Can you describe the technical architecture that you have evolved toward that allows for AI to drive the experience rather than being a bolt-on?What are the concrete impacts that these new capabilities have on teams who are using Ascend?What are the most interesting, innovative, or unexpected ways that you have seen Ascend + AI used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on incorporating AI into the core of Ascend?When is Ascend the wrong choice?What do you have planned for the future of AI in Ascend?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksAscendCursor AI Code EditorDevinGitHub CopilotOpenAI DeepResearchS3 TablesAWS GlueAWS BedrockSnowparkCo-Intelligence: Living and Working with AI by Ethan Mollick (affiliate link)OpenAI o3The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
--------
52:47
Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy
SummaryIn this episode of the Data Engineering Podcast Pete DeJoy, co-founder and product lead at Astronomer, talks about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3. Pete shares his journey into data engineering, discusses Astronomer's contributions to the Airflow project, and highlights the critical role of Airflow in powering operational data products. He covers the evolution of Airflow, its position in the data ecosystem, and the challenges faced by data engineers, including infrastructure management and observability. The conversation also touches on the upcoming Airflow 3 release, which introduces data awareness, architectural improvements, and multi-language support, and Astronomer's observability suite, Astro Observe, which provides insights and proactive recommendations for Airflow users.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3InterviewIntroductionCan you describe what Astronomer is and the story behind it?How would you characterize the relationship between Airflow and Astronomer?Astronomer just released your State of Airflow 2025 Report yesterday and it is the largest data engineering survey ever with over 5,000 respondents. Can you talk a bit about top level findings in the report?What about the overall growth of the Airflow project over time?How have the focus and features of Astronomer changed since it was last featured on the show in 2017?Astro Observe GA’d in early February, what does the addition of pipeline observability mean for your customers? What are other capabilities similar in scope to observability that Astronomer is looking at adding to the platform?Why is Airflow so critical in providing an elevated Observability–or cataloging, or something simlar - experience in a DataOps platform? What are the notable evolutions in the Airflow project and ecosystem in that time?What are the core improvements that are planned for Airflow 3.0?What are the most interesting, innovative, or unexpected ways that you have seen Astro used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airflow and Astro?What do you have planned for the future of Astro/Astronomer/Airflow?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.LinksAstronomerAirflowMaxime BeaucheminMongoDBDatabricksConfluentSparkKafkaDagsterPodcast EpisodePrefectAirflow 3The Rise of the Data Engineer blog postdbtJupyter NotebookZapiercosmos library for dbt in AirflowRuffAirflow Custom OperatorSnowflakeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
--------
51:41
Accelerated Computing in Modern Data Centers With Datapelago
SummaryIn this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architectureInterviewIntroductionHow did you get involved in the area of data management?Can you start by outlining the main factors that contribute to performance challenges in data lake environments?The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?What was the motivating insight that led you to invest in the technology that powers Datapelago?Can you describe the system design of Datapelago and how it integrates with existing data engines?The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?When is Datapelago the wrong choice?What do you have planned for the future of Datapelago?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksDatapelagoMIPS ArchitectureARM ArchitectureAWS NitroMellanoxNvidiaVon Neumann ArchitectureTPU == Tensor Processing UnitFPGA == Field-Programmable Gate ArraySparkTrinoIcebergPodcast EpisodeDelta LakePodcast EpisodeHudiPodcast EpisodeApache GlutenIntermediate RepresentationTuring CompletenessLLVMAmdahl's LawLSTM == Long Short-Term MemoryThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
--------
55:36

Weitere Technologie Podcasts

Trending Technologie Podcasts

Über Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Podcast-Website

Technologie Bildung