PoddsändningarTeknologiData Engineering Podcast

Data Engineering Podcast

Tobias Macey
Data Engineering Podcast
Senaste avsnittet

513 avsnitt

  • Data Engineering Podcast

    Holding Kafka Right: Product-Friendly Streaming with TypeStream

    2026-06-18 | 49 min.
    Summary
    In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
    Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you describe what Typestream is and the story behind it?
    What are the common challenges that teams encounter when trying to build on top of Kafka?
    How do those challenges/misconfigurations impact the team's ability to deliver on product goals?
    What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?
    There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?
    What makes the original Kafka project so resilient in the face of all of that competition?
    Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?
    For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?
    If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)
    What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?
    When is Typestream the wrong choice?
    What do you have planned for the future of Typestream?

    Contact Info

    Website

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Typestream
    Zapier
    Airflow
    Kafka
    KTables
    KSQL
    RedPanda
    Pulsar
    AutoMQ
    Kafka Schema Registry
    Debezium
    Change Data Capture
    Kafka Connect
    Terraform
    Kafka Compacted Topic

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

    2026-06-08 | 52 min.
    Summary
    In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
    Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at Kaarvi
    Interview
    Introduction
    How did you get involved in the area of data management?
    Can you describe what Kaarvi is and the story behind it?
    "AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?
    What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?
    What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?
    When is Kaarvi the wrong choice?
    What do you have planned for the future of Kaarvi?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Kaarvi
    Synthetic Data
    n8n

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

    2026-06-01 | 54 min.
    Summary
    In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
    Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graph
    Interview
    Introduction
    How did you get involved in the area of data management?
    Can you start by describing what PuppyGraph is and the story behind it?
    What are some of the key use cases that people are turning to PuppyGraph and graph data models for?
    Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?
    latency/data exploration
    types of traversals and limitations
    lakehouse architecture pros/cons for graphs
    data modeling/translation
    shortcomings of zero-ETL and how transforming the underlying representation could provide benefits
    For someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?
    What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?
    When is PuppyGraph the wrong choice?
    What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?
    Contact Info
    LinkedIn
    Parting Question
    From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    PuppyGraph
    TigerGraph
    Google F1
    Graph Database
    Google Pregel
    Iceberg
    Graph Supernode
    MPP == Massively Parallel Processing
    Spark GraphX
    Trino
    Ladybug DB
    lance-graph
    KuzuDB
    MemGraph
    Labelled Property Graph
    RDF Triples
    Cypher Query Language
    Gremlin
    CDC == Change Data Capture
    Neo4J
    JanusGraph
    NetworkX
    PyTorch
    DuckDB
    Iceberg Array
    LanceDB
    Palo Alto Networks
    Columnar ADBC
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    %
  • Data Engineering Podcast

    Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

    2026-05-06 | 58 min.
    Summary
    In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    Your host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applications
    Interview
    Introduction
    How did you get involved in the area of data management?
    Can you start by giving an overview of the major contributors to wasted or idle compute?
    Why does it matter if the available compute isn't being maximized?
    What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)?
    What are the most interesting, innovative, or unexpected ways that you have seen Ray used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?
    When is Ray the wrong choice?
    What do you have planned for the future of Ray?
    Contact Info
    LinkedIn
    Parting Question
    From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    AnyScale
    Ray
    Deep Learning
    Computer Vision
    Kubernetes
    Cursor
    Claude Code
    Kube-Ray
    PyTorch
    Tensorflow
    Theano
    Caffe
    vLLM
    SGLang
    Ray Tune
    Neural Network
    Learning Rates
    Reinforcement Learning
    AlphaGo
    Cursor Composer 2
    ImageNet
    Transformer Architecture
    Stochastic Gradient Descent
    Airflow
    Dagster
    Flyte
    Mixture of Experts
    Prefill
    Temporal
    Actor Framework
    RDMA == Remote Direct Memory Access
    Neoclouds
    AI Engineering Podcast Episode
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    The AI-First Data Engineer: 10–50x Productivity and What Changes Next

    2026-04-07 | 59 min.
    Summary
    In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests.
    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026

    Interview
    Introduction
    How did you get involved in the area of data management?
    What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?
    What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?
    How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Blog Post
    Datafold
    Claude Opus 4.5
    Harry Potter - Muggles
    Jevon's Paradox
    Modern Data Stack
    Dagster Compass
    Gravity Orion
    MCP == Model Context Protocol
    Qwen

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Fler podcasts i Teknologi
Om Data Engineering Podcast
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Podcast-webbplats

Lyssna på Data Engineering Podcast, Search Engine och många andra poddar från världens alla hörn med radio.se-appen

Hämta den kostnadsfria radio.se-appen

  • Bokmärk stationer och podcasts
  • Strömma via Wi-Fi eller Bluetooth
  • Stödjer Carplay & Android Auto
  • Många andra appfunktioner
Data Engineering Podcast: Poddsändningar i Familj
Sociala nätverk
v8.10.0| © 2007-2026 radio.de GmbH
Generated: 6/19/2026 - 4:57:58 AM