What's Next?
Source: datasframe | Author: Tom Augspurger | Published: Nov 11, 2020
Some personal news: Last Friday was my last day at Anaconda.
Next week, I'm joining Microsoft's AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I'm extremely excited about what I'll be working on at Microsoft.
Reflections …
Read more
Maintaing Performance
Source: datasframe | Author: Tom Augspurger | Published: Apr 01, 2020
As pandas' documentation claims: pandas
provides high-performance data structures. But how do we verify that the claim
is correct? And how do we ensure that it stays correct over many releases.
This post describes
pandas' current setup for monitoring performance
My personal debugging strategy for understanding and fixing performance
regressions …
Read more
pandas 1.0
Source: pandas blog | Author: pandas team | Published: Jan 29, 2020
Today pandas celebrates its 1.0.0 release. In many ways this is just a normal release with a host of new features, performance improvements, and bug fixes, which are documented in
Read more
Towards consistent missing value handling in Pandas
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Nov 30, 2019
This blogpost gives some background and motivation for my proposal on better
missing value support in pandas, and the changes that have been merged in the
development version (to be released in pandas 1.0): a new pd.NA scalar is
introduced that can be used consistently across all data types..
Read more
2019 NumFOCUS Awards and New Contributor Recognition
Source: pandas | NumFOCUS | Author: Admin | Published: Nov 15, 2019
The post 2019 NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.
Read more
Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects
Source: pandas | NumFOCUS | Author: Admin | Published: Nov 14, 2019
The post Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects appeared first on NumFOCUS.
Read more
Highlights From The 2019 Pandas Hack
Source: pandas | NumFOCUS | Author: nf-admin | Published: Sep 13, 2019
The post Highlights From The 2019 Pandas Hack appeared first on NumFOCUS.
Read more
2019 pandas user survey
Source: pandas blog | Author: pandas team | Published: Aug 22, 2019
Pandas recently conducted a user survey to help guide future development.
Thanks to everyone who participated! This post presents the high-level results.
This analysis and the raw data can be found on
Read more
GeoPandas now uses the pandas ExtensionArray interface
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Aug 13, 2019
Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!
Read more
pandas + binder
Source: datasframe | Author: Tom Augspurger | Published: Jul 21, 2019
This post describes the start of a journey to get pandas' documentation running
on Binder. The end result is this nice button:
For a while now I've been jealous of Dask's examples
repository. That's a repository containing a
collection of Jupyter notebooks demonstrating Dask in action. It stitches
together some …
Read more
pandas extension arrays
Source: pandas blog | Author: pandas team | Published: Jan 04, 2019
Extensibility was a major theme in pandas development over the last couple of
releases. This post introduces the pandas extension array interface: the
motivation behind it and how it might affect you
Read more
Inaugural NumFOCUS Awards and New Contributor Recognition
Source: pandas | NumFOCUS | Author: Admin | Published: Sep 27, 2018
The post Inaugural NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.
Read more
Tabular Data in Scikit-Learn and Dask-ML
Source: datasframe | Author: Tom Augspurger | Published: Sep 17, 2018
Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data.
This blogpost will introduce those improvements with a small demo.
We'll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays …
Read more
Distributed Auto-ML with TPOT with Dask
Source: datasframe | Author: Tom Augspurger | Published: Aug 30, 2018
This work is supported by Anaconda Inc.
This post describes a recent improvement made to TPOT. TPOT is an
automated machine learning library for Python. It does some feature
engineering and hyper-parameter optimization for you. TPOT uses genetic
algorithms to evaluate which models are performing well and how to choose …
Read more
Moral Philosophy for pandas or: What is .values?
Source: datasframe | Author: Tom Augspurger | Published: Aug 14, 2018
The other day, I put up a Twitter poll asking a simple question: What's the type of series.values?
Pop Quiz! What are the possible results for the following:>>> type(pandas.Series.values)— Tom Augspurger (@TomAugspurger) August 6, 2018
I was a bit limited for space, so I'll expand on …
Read more
Modern Pandas (Part 8): Scaling
Source: datasframe | Author: Tom Augspurger | Published: Apr 23, 2018
This is part 1 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in …
Read more
The Worldwide Pandas Documentation Sprint: A Closer Look
Source: pandas | NumFOCUS | Author: Admin | Published: Mar 27, 2018
The post The Worldwide Pandas Documentation Sprint: A Closer Look appeared first on NumFOCUS.
Read more
Activity on the pandas github repo during the March 10 documentation sprint
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Mar 13, 2018
Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!
Read more
dask-ml 0.4.1 Released
Source: datasframe | Author: Tom Augspurger | Published: Feb 13, 2018
This work is supported by Anaconda Inc and the Data
Driven Discovery Initiative from the Moore
Foundation.
dask-ml 0.4.1 was released today with a few enhancements. See the
changelog for all the changes from 0.4.0.
Conda packages are available on conda-forge
$ conda install -c conda-forge dask-ml …
Read more
Extension Arrays for Pandas
Source: datasframe | Author: Tom Augspurger | Published: Feb 12, 2018
This is a status update on some enhancements for pandas. The goal of the work
is to store things that are sufficiently array-like in a pandas DataFrame,
even if they aren't a regular NumPy array. Pandas already does this in a few
places for some blessed types (like Categorical); we'd …
Read more
Easy distributed training with Joblib and dask
Source: datasframe | Author: Tom Augspurger | Published: Feb 05, 2018
This work is supported by Anaconda Inc and the Data
Driven Discovery Initiative from the Moore
Foundation.
This past week, I had a chance to visit some of the scikit-learn developers at
Inria in Paris. It was a fun and productive week, and I'm thankful to them for
hosting me …
Read more
dask-ml
Source: datasframe | Author: Tom Augspurger | Published: Oct 26, 2017
Today we released the first version of dask-ml, a library for parallel and
distributed machine learning. Read the documentation or install it with
pip install dask-ml
Packages are currently building for conda-forge, and will be up later today.
conda install -c conda-forge dask-ml
The Goals
dask is, to quote the …
Read more
Scalable Machine Learning (Part 2): Partial Fit
Source: datasframe | Author: Tom Augspurger | Published: Sep 15, 2017
This work is supported by Anaconda, Inc. and the
Data Driven Discovery Initiative from the Moore Foundation.
This is part two of my series on scalable machine learning.
Small Fit, Big Predict
Scikit-Learn Partial Fit
You can download a notebook of this post here.
Scikit-learn supports out-of-core learning (fitting a …
Read more
Scalable Machine Learning (Part 1)
Source: datasframe | Author: Tom Augspurger | Published: Sep 11, 2017
This work is supported by Anaconda Inc. and the Data Driven Discovery
Initiative from the Moore Foundation.
Anaconda is interested in scaling the scientific python ecosystem. My current
focus is on out-of-core, parallel, and distributed machine learning. This series
of posts will introduce those concepts, explore what we have available …
Read more
Introducing Stitch
Source: datasframe | Author: Tom Augspurger | Published: Aug 30, 2016
Today I released stitch into the
wild. If you haven't yet, check out the examples
page to see an example of what stitch does,
and the Github repo for how to
install. I'm using this post to explain why I wrote stitch, and some
issues it tries to solve.
Why …
Read more
Modern Pandas (Part 7): Timeseries
Source: datasframe | Author: Tom Augspurger | Published: May 13, 2016
This is part 7 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Timeseries
Pandas started out in the financial world, so naturally it has strong timeseries support.
The first half of this post will look at pandas' …
Read more
Modern Pandas (Part 6): Visualization
Source: datasframe | Author: Tom Augspurger | Published: Apr 28, 2016
This is part 6 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Visualization and Exploratory Analysis
A few weeks ago, the R community went through some hand-wringing about plotting packages.
For outsiders (like me) the details aren't …
Read more
Modern Pandas (Part 5): Tidy Data
Source: datasframe | Author: Tom Augspurger | Published: Apr 22, 2016
This is part 5 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Reshaping & Tidy Data
Structuring datasets to facilitate analysis (Wickham 2014)
So, you've sat down to analyze a new dataset.
What do you do first?
In …
Read more
Modern Panadas (Part 3): Indexes
Source: datasframe | Author: Tom Augspurger | Published: Apr 11, 2016
This is part 3 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Indexes can be a difficult concept to grasp at first.
I suspect this is partly becuase they're somewhat peculiar to pandas.
These aren't like the …
Read more
Modern Pandas (Part 4): Performance
Source: datasframe | Author: Tom Augspurger | Published: Apr 08, 2016
This is part 4 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas …
Read more
Modern Pandas (Part 2): Method Chaining
Source: datasframe | Author: Tom Augspurger | Published: Apr 04, 2016
This is part 2 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Method Chaining
Method chaining, where you call methods on an object one after another, is in vogue at the moment.
It's always been a style …
Read more
Modern Pandas (Part 1)
Source: datasframe | Author: Tom Augspurger | Published: Mar 21, 2016
This is part 1 in my series on writing modern idiomatic pandas.
Modern Pandas
Method Chaining
Indexes
Fast Pandas
Tidy Data
Visualization
Time Series
Scaling
Effective Pandas
Introduction
This series is about how to make effective use of pandas, a data analysis library for the Python programming language.
It's targeted …
Read more
Why pandas users should be excited about Apache Arrow
Source: Wes McKinney - pandas | Author: Wes McKinney | Published: Feb 22, 2016
I'm super excited to be involved in the new open source Apache Arrow
community initiative. For Python (and R, too!), it will help enable
Substantially improved data access speeds
Closer to native performance Python extensions for big data systems like
Apache Spark
New in-memory analytics functionality for nested / JSON-like data
There's plenty of places you can learn more about Arrow, but this post is about
how it's specifically relevant to pandas users. See, for example:
"Python and Hadoop: A State of the Union"
"Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard"
"Introducing Apache Arrow: Columnar In-Memory Analytics"
Read more
NumFOCUS Announces New Fiscally Sponsored Project: pandas
Source: pandas | NumFOCUS | Author: nf-admin | Published: Oct 09, 2015
by Gina Helfrich NumFOCUS is pleased to announce pandas as our newest fiscally sponsored project. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas enables users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like […]
The post NumFOCUS Announces New Fiscally Sponsored Project: pandas appeared first on NumFOCUS.
Read more
dplyr and pandas
Source: datasframe | Author: Tom Augspurger | Published: Oct 16, 2014
This notebook compares pandas
and dplyr.
The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.
We'll work through the introductory dplyr vignette to analyze some flight …
Read more
Practical Pandas Part 3 - Exploratory Data Analysis
Source: datasframe | Author: Tom Augspurger | Published: Sep 16, 2014
Welcome back. As a reminder:
In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store
In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io.
You can find the full source code …
Read more
Practical Pandas Part 2 - More Tidying, More Data, and Merging
Source: datasframe | Author: Tom Augspurger | Published: Sep 04, 2014
This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish.
It's a misconception that we can cleanly separate the data analysis pipeline into a linear
sequence of steps from
data acqusition
data tidying
exploratory analysis
model building
production
As …
Read more
Practical Pandas Part 1 - Reading the Data
Source: datasframe | Author: Tom Augspurger | Published: Aug 26, 2014
This is the first post in a series where I'll show how I use pandas on real-world datasets.
For this post, we'll look at data I collected with Cyclemeter on
my daily bike ride to and from school last year.
I had to manually start and stop the tracking at …
Read more
Using Python to tackle the CPS (Part 4)
Source: datasframe | Author: Tom Augspurger | Published: May 19, 2014
Last time, we got to where we'd like to have started: One file per month, with each month laid out the same.
As a reminder, the CPS interviews households 8 times over the course of 16 months. They're interviewed for 4 months, take 8 months off, and are interviewed four …
Read more
Using Python to tackle the CPS (Part 3)
Source: datasframe | Author: Tom Augspurger | Published: May 19, 2014
In part 2 of this series, we set the stage to parse the data files themselves.
As a reminder, we have a dictionary that looks like
id length start end
0 HRHHID 15 1 15
1 HRMONTH 2 16 17
2 HRYEAR4 4 18 21
3 HURESPLI 2 22 23 …
Read more
Tidy Data in Action
Source: datasframe | Author: Tom Augspurger | Published: Mar 27, 2014
Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis.
I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific …
Read more
Organizing Papers
Source: datasframe | Author: Tom Augspurger | Published: Feb 13, 2014
As a graduate student, you read a lot of journal articles... a lot.
With the material in the articles being as difficult as it is, I didn't want to worry about organizing everything as well.
That's why I wrote this script to help (I may have also been procrastinating from …
Read more
Using Python to tackle the CPS (Part 2)
Source: datasframe | Author: Tom Augspurger | Published: Feb 04, 2014
Last time, we used Python to fetch some data from the Current Population Survey. Today, we'll work on parsing the files we just downloaded.
We downloaded two types of files last time:
CPS monthly tables: a fixed-width format text file with the actual data
Data Dictionaries: a text file describing …
Read more
Using Python to tackle the CPS
Source: datasframe | Author: Tom Augspurger | Published: Jan 27, 2014
The Current Population Survey is an important source of data for economists. It's modern form took shape in the 70's and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven't done everything …
Read more