pandas - Python Data Analysis Library

What's Next?

Source: datasframe | Author: Tom Augspurger | Published: Nov 11, 2020

Some personal news: Last Friday was my last day at Anaconda. Next week, I'm joining Microsoft's AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I'm extremely excited about what I'll be working on at Microsoft. Reflections …

Maintaing Performance

Source: datasframe | Author: Tom Augspurger | Published: Apr 01, 2020

As pandas' documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas' current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions …

pandas 1.0

Source: pandas blog | Author: pandas team | Published: Jan 29, 2020

Today pandas celebrates its 1.0.0 release. In many ways this is just a normal release with a host of new features, performance improvements, and bug fixes, which are documented in

Towards consistent missing value handling in Pandas

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Nov 30, 2019

This blogpost gives some background and motivation for my proposal on better missing value support in pandas, and the changes that have been merged in the development version (to be released in pandas 1.0): a new pd.NA scalar is introduced that can be used consistently across all data types..

2019 NumFOCUS Awards and New Contributor Recognition

Source: pandas | NumFOCUS | Author: Admin | Published: Nov 15, 2019

The post 2019 NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.

Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects

Source: pandas | NumFOCUS | Author: Admin | Published: Nov 14, 2019

The post Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects appeared first on NumFOCUS.

Highlights From The 2019 Pandas Hack

Source: pandas | NumFOCUS | Author: nf-admin | Published: Sep 13, 2019

The post Highlights From The 2019 Pandas Hack appeared first on NumFOCUS.

2019 pandas user survey

Source: pandas blog | Author: pandas team | Published: Aug 22, 2019

Pandas recently conducted a user survey to help guide future development. Thanks to everyone who participated! This post presents the high-level results. This analysis and the raw data can be found on

GeoPandas now uses the pandas ExtensionArray interface

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Aug 13, 2019

Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!

pandas + binder

Source: datasframe | Author: Tom Augspurger | Published: Jul 21, 2019

This post describes the start of a journey to get pandas' documentation running on Binder. The end result is this nice button: For a while now I've been jealous of Dask's examples repository. That's a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some …

pandas extension arrays

Source: pandas blog | Author: pandas team | Published: Jan 04, 2019

Extensibility was a major theme in pandas development over the last couple of releases. This post introduces the pandas extension array interface: the motivation behind it and how it might affect you

Inaugural NumFOCUS Awards and New Contributor Recognition

Source: pandas | NumFOCUS | Author: Admin | Published: Sep 27, 2018

The post Inaugural NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.

Tabular Data in Scikit-Learn and Dask-ML

Source: datasframe | Author: Tom Augspurger | Published: Sep 17, 2018

Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We'll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays …

Distributed Auto-ML with TPOT with Dask

Source: datasframe | Author: Tom Augspurger | Published: Aug 30, 2018

This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose …

Moral Philosophy for pandas or: What is .values?

Source: datasframe | Author: Tom Augspurger | Published: Aug 14, 2018

The other day, I put up a Twitter poll asking a simple question: What's the type of series.values? Pop Quiz! What are the possible results for the following:>>> type(pandas.Series.values)— Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I'll expand on …

Modern Pandas (Part 8): Scaling

Source: datasframe | Author: Tom Augspurger | Published: Apr 23, 2018

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in …

The Worldwide Pandas Documentation Sprint: A Closer Look

Source: pandas | NumFOCUS | Author: Admin | Published: Mar 27, 2018

The post The Worldwide Pandas Documentation Sprint: A Closer Look appeared first on NumFOCUS.

Activity on the pandas github repo during the March 10 documentation sprint

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Mar 13, 2018

Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!

dask-ml 0.4.1 Released

Source: datasframe | Author: Tom Augspurger | Published: Feb 13, 2018

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml …

Extension Arrays for Pandas

Source: datasframe | Author: Tom Augspurger | Published: Feb 12, 2018

This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren't a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we'd …

Easy distributed training with Joblib and dask

Source: datasframe | Author: Tom Augspurger | Published: Feb 05, 2018

This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I'm thankful to them for hosting me …

dask-ml

Source: datasframe | Author: Tom Augspurger | Published: Oct 26, 2017

Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with pip install dask-ml Packages are currently building for conda-forge, and will be up later today. conda install -c conda-forge dask-ml The Goals dask is, to quote the …

Scalable Machine Learning (Part 2): Partial Fit

Source: datasframe | Author: Tom Augspurger | Published: Sep 15, 2017

This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part two of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit You can download a notebook of this post here. Scikit-learn supports out-of-core learning (fitting a …

Scalable Machine Learning (Part 1)

Source: datasframe | Author: Tom Augspurger | Published: Sep 11, 2017

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available …

Introducing Stitch

Source: datasframe | Author: Tom Augspurger | Published: Aug 30, 2016

Today I released stitch into the wild. If you haven't yet, check out the examples page to see an example of what stitch does, and the Github repo for how to install. I'm using this post to explain why I wrote stitch, and some issues it tries to solve. Why …

Modern Pandas (Part 7): Timeseries

Source: datasframe | Author: Tom Augspurger | Published: May 13, 2016

This is part 7 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Timeseries Pandas started out in the financial world, so naturally it has strong timeseries support. The first half of this post will look at pandas' …

Modern Pandas (Part 6): Visualization

Source: datasframe | Author: Tom Augspurger | Published: Apr 28, 2016

This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren't …

Modern Pandas (Part 5): Tidy Data

Source: datasframe | Author: Tom Augspurger | Published: Apr 22, 2016

This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you've sat down to analyze a new dataset. What do you do first? In …

Modern Panadas (Part 3): Indexes

Source: datasframe | Author: Tom Augspurger | Published: Apr 11, 2016

This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they're somewhat peculiar to pandas. These aren't like the …

Modern Pandas (Part 4): Performance

Source: datasframe | Author: Tom Augspurger | Published: Apr 08, 2016

This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas …

Modern Pandas (Part 2): Method Chaining

Source: datasframe | Author: Tom Augspurger | Published: Apr 04, 2016

This is part 2 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Method Chaining Method chaining, where you call methods on an object one after another, is in vogue at the moment. It's always been a style …

Modern Pandas (Part 1)

Source: datasframe | Author: Tom Augspurger | Published: Mar 21, 2016

This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It's targeted …

Why pandas users should be excited about Apache Arrow

Source: Wes McKinney - pandas | Author: Wes McKinney | Published: Feb 22, 2016

I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable Substantially improved data access speeds Closer to native performance Python extensions for big data systems like Apache Spark New in-memory analytics functionality for nested / JSON-like data There's plenty of places you can learn more about Arrow, but this post is about how it's specifically relevant to pandas users. See, for example: "Python and Hadoop: A State of the Union" "Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard" "Introducing Apache Arrow: Columnar In-Memory Analytics"

NumFOCUS Announces New Fiscally Sponsored Project: pandas

Source: pandas | NumFOCUS | Author: nf-admin | Published: Oct 09, 2015

by Gina Helfrich NumFOCUS is pleased to announce pandas as our newest fiscally sponsored project. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas enables users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like […] The post NumFOCUS Announces New Fiscally Sponsored Project: pandas appeared first on NumFOCUS.

dplyr and pandas

Source: datasframe | Author: Tom Augspurger | Published: Oct 16, 2014

This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We'll work through the introductory dplyr vignette to analyze some flight …

Practical Pandas Part 3 - Exploratory Data Analysis

Source: datasframe | Author: Tom Augspurger | Published: Sep 16, 2014

Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code …

Practical Pandas Part 2 - More Tidying, More Data, and Merging

Source: datasframe | Author: Tom Augspurger | Published: Sep 04, 2014

This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It's a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As …

Practical Pandas Part 1 - Reading the Data

Source: datasframe | Author: Tom Augspurger | Published: Aug 26, 2014

This is the first post in a series where I'll show how I use pandas on real-world datasets. For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at …

Using Python to tackle the CPS (Part 4)

Source: datasframe | Author: Tom Augspurger | Published: May 19, 2014

Last time, we got to where we'd like to have started: One file per month, with each month laid out the same. As a reminder, the CPS interviews households 8 times over the course of 16 months. They're interviewed for 4 months, take 8 months off, and are interviewed four …

Using Python to tackle the CPS (Part 3)

Source: datasframe | Author: Tom Augspurger | Published: May 19, 2014

In part 2 of this series, we set the stage to parse the data files themselves. As a reminder, we have a dictionary that looks like id length start end 0 HRHHID 15 1 15 1 HRMONTH 2 16 17 2 HRYEAR4 4 18 21 3 HURESPLI 2 22 23 …

Tidy Data in Action

Source: datasframe | Author: Tom Augspurger | Published: Mar 27, 2014

Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific …

Organizing Papers

Source: datasframe | Author: Tom Augspurger | Published: Feb 13, 2014

As a graduate student, you read a lot of journal articles... a lot. With the material in the articles being as difficult as it is, I didn't want to worry about organizing everything as well. That's why I wrote this script to help (I may have also been procrastinating from …

Using Python to tackle the CPS (Part 2)

Source: datasframe | Author: Tom Augspurger | Published: Feb 04, 2014

Last time, we used Python to fetch some data from the Current Population Survey. Today, we'll work on parsing the files we just downloaded. We downloaded two types of files last time: CPS monthly tables: a fixed-width format text file with the actual data Data Dictionaries: a text file describing …

Using Python to tackle the CPS

Source: datasframe | Author: Tom Augspurger | Published: Jan 27, 2014

The Current Population Survey is an important source of data for economists. It's modern form took shape in the 70's and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven't done everything …