Larger Than Memory Data Workflows with Apache Arrow and Ibis

This the website for the workshop “Larger Than Memory Data Workflows with Apache Arrow and Ibis” taught at the J on The Beach conference on May 10th, 2023.

Abstract

As datasets become larger and more complex, the boundaries between data engineering and data science are becoming blurred. Data analysis pipelines with larger-than-memory data are becoming commonplace, creating a gap that needs to be bridged: between engineering tools designed to work with very large datasets on the one hand, and data science tools that provide the analysis capabilities used in data workflows on the other. One way to build this bridge is with Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data. Arrow is designed to improve performance and efficiency, and places emphasis on standardization and interoperability among workflow components, programming languages, and systems. We’ll combine the power of Arrow for compressing data into its most compact form with Ibis for data analysis in Python. Ibis is a pandas-like, Python interface that allows you to do robust analytics work on data of any size. It does this by letting you choose from a range of powerful database engines that can process your data queries. In this workshop we’ll be working with a very large dataset and drawing insights from it all from our local machine.

Introduction

Ice-Breaking Activity

Type in the chat:

  • what you know about Apache Arrow in one sentence
  • what you know about Ibis in one sentence

What You’ll Learn

By the end of this tutorial you will learn:

  • How to access and convert large CSV files into Parquet using Apache Arrow
  • How changing CSV files into parquet compresses them efficiently letting you work with them on your local machine
  • How to analyse larger than memory parquet files using Ibis
  • How to drawing interesting insights from the PUMS census dataset efficiently
  • How to switch between and local and remote database using Ibis

Demonstration

Demonstration of what the training is working towards. The instructor will go through some slides explaining the dataset, Arrow and Ibis

Specifics:

  • Introduction to PUMS
  • What to consider when working with large datasets
  • Introduction to Apache Arrow
  • Introduction to Ibis

Overview of the Training

  • Importing data in the Apache Arrow ecosystems, what are the different file formats, how do they differ, which ones to use?
  • Learning how to use Ibis for data wrangling on Parquet files
  • Use Ibis with remote data sources: Snowflake (demo to try at home)

Diagram showing the components that will be covered during the workshop. We will use PyArrow to convert CSV files into Parquet. We will use Ibis to query the data in the Parquet format. We will also use Ibis to query the data stored in Snowflake.

Diagram showing the components that will be covered during the workshop (black arrows). Ibis uses different engines to interact with the data (in orange). Dashed, gray arrows show things that are possible but not covered in the workshop.