Larger Than Memory Data Workflows with Apache Arrow and Ibis
This the website for the workshop “Larger Than Memory Data Workflows with Apache Arrow and Ibis” taught at the J on The Beach conference on May 10th, 2023.
Abstract
As datasets become larger and more complex, the boundaries between data engineering and data science are becoming blurred. Data analysis pipelines with larger-than-memory data are becoming commonplace, creating a gap that needs to be bridged: between engineering tools designed to work with very large datasets on the one hand, and data science tools that provide the analysis capabilities used in data workflows on the other. One way to build this bridge is with Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data. Arrow is designed to improve performance and efficiency, and places emphasis on standardization and interoperability among workflow components, programming languages, and systems. We’ll combine the power of Arrow for compressing data into its most compact form with Ibis for data analysis in Python. Ibis is a pandas-like, Python interface that allows you to do robust analytics work on data of any size. It does this by letting you choose from a range of powerful database engines that can process your data queries. In this workshop we’ll be working with a very large dataset and drawing insights from it all from our local machine.
Introduction
Ice-Breaking Activity
Type in the chat:
- what you know about Apache Arrow in one sentence
- what you know about Ibis in one sentence
What You’ll Learn
By the end of this tutorial you will learn:
- How to access and convert large CSV files into Parquet using Apache Arrow
- How changing CSV files into parquet compresses them efficiently letting you work with them on your local machine
- How to analyse larger than memory parquet files using Ibis
- How to drawing interesting insights from the PUMS census dataset efficiently
- How to switch between and local and remote database using Ibis
Demonstration
Demonstration of what the training is working towards. The instructor will go through some slides explaining the dataset, Arrow and Ibis
Specifics:
- Introduction to PUMS
- What to consider when working with large datasets
- Introduction to Apache Arrow
- Introduction to Ibis
Overview of the Training
- Importing data in the Apache Arrow ecosystems, what are the different file formats, how do they differ, which ones to use?
- Learning how to use Ibis for data wrangling on Parquet files
- Use Ibis with remote data sources: Snowflake (demo to try at home)