In learning data science, you’ll soon discover that your skillset doesn’t end with Python – there are a number of other tools you’ll need to grasp. One of those is Pandas, a Python library which facilitates data processing. We asked Joe Eddy, Senior Data Scientist at Metis’ Data Science Bootcamp to explains what Pandas is, how data scientists and real companies are using it, and how beginners who want to learn Pandas can start dabbling on their own.
Joe is a Senior Data Scientist and instructor at the Metis bootcamp.
Joe has a bachelor’s degree in math. Before Metis, he worked as a marketing data analyst and as an equity research associate.
Fun fact: Before he wound up as an instructor, Joe was actually a Metis bootcamp student himself!
Joe has been teaching at Metis since the fall of 2017.
Why does Pandas exist and where did it come from?
Pandas is a catch-all Python library; a resource for doing data analysis and manipulation; any kind of data processing, analyzing, filtering, and aggregating. Pandas can be used for just about any process where you're trying to gain insight from data using code.
Pandas was developed by someone working in a quantitative role at a hedge fund called AQR. He was writing a lot of data processing scripts and doing a lot of it from scratch. He realized that it would be very helpful to himself and other people if there was a more standardized and systematic library for running these types of processes, and so he started developing the library and made it an open-source package. In the last few years, Pandas has caught on as the golden standard for Python data analysis.
What is a Python library?
Python, as a programming language, lets you do a set collection of operations and fairly low-level things. Any library is a pre-existing codebase that someone has taken the time to develop, and has made available to others to repurpose (essentially code work that someone else has done and shared, so that you don’t have to.)
Libraries use the lower-level operations of a coding language to build a more sophisticated program that operates in a more specialized or efficient way, or to build an interface for doing some kind of common processing task – allowing the programmer to abstract away the specifics of the lower-level processes that are happening. Libraries leverage these lower-level building blocks to build up more high-level functionalities. A library can also work in tandem with other libraries. Libraries make it so that you don't have to reinvent the wheel every time you want to perform a standard processing task.
When might Pandas be used in the workplace?
For those coming into data science with perhaps a background in Excel (or a similar analysis tool) it might be helpful to conceive of the central building-block of Pandas – the dataframe – as being similar to an Excel spreadsheet made up of several columns and rows storing different related data points.
Here are a couple of examples:
Let's say, for instance, that you wanted to study the housing market in New York – perhaps to see how those prices or homes changed over time, or maybe to study the effect of neighborhood on house price – then you could use Pandas.
Another example might be if you wanted to interpret a data set of weather readings where each row might show the weather forecast at a specific time. You could use Pandas to investigate weather trends at different times of the year, or to look at trends over time, or at certain times of the day.
Are there any specific technologies that often get used alongside Pandas?
Yes, definitely. The two biggest ones are NumPy and scikit-learn; both of which are also Python libraries.
NumPy, I would describe as (in some sense) being like a fancy math library in which there’s a lot of linear algebra, or standard math calculations, implemented in a really efficient way – so: operations on arrays, kind of. Pandas is actually built on top of this library, and therefore uses a lot of the underlying data structures and functionalities coming from NumPy. For that reason, they go hand-in-hand. You often wind up using them together, maybe referencing some specific NumPy function to do something on a Pandas-style object, or going back and forth.
Scikit-learn is the gold standard for most machine learning applications in Python. If you're working on a predictive modeling type of problem, a common workflow would be to first use Pandas and NumPy to load the data: to analyze, format and create the data that you would actually want to use to build a predictive model. And then, once you have completed that pre-processing step, you would feed the data to a model from scikit-learn and use that model to make predictions, for example. This workflow and combination of packages are extremely common in data science applications.
What alternatives are there to Pandas?
In Python, there's no real clear alternative to Pandas.
Outside of Python, you would find the main parallel options. For example, some people who work with R use a library called dplyr. I believe it's a similar idea to Pandas: a data manipulation library that makes certain types of functionalities easier and faster. While I haven’t used it myself, there's also something called Alteryx, which I think is used to set up data formatting pipelines. I think that that suite of technology is the main set of alternatives.
What sort of companies might need their data scientists to use Pandas for analysis?
Just about every company using Python for its data analysis would require Pandas, simply because of its versatility.
Pandas is well-suited to working with most tabular data structures – so any company with tabular data (i.e., data that can be represented as rows and columns) would find Pandas useful. In fact, pretty much any type of analysis that you'd possibly want to do is feasible with Pandas.
The only cases I can think of where it might be harder to use Pandas would be when working with incompatible data formats like images or audio or potentially something like text data, where the structure of those data types can't quite be captured in the normal structure that you'd use with Pandas.
Even so, most companies working with these less compatible styles of data will also have some tabular data they're working with.
Do you have any examples of specific companies that use Pandas?
Again, just about any company that I can think of uses it. Some examples are:
Any company that's working with relational data on customers and customer transactions to analyze trends and model behavior. Examples are the financial payments company Square and the grocery delivery company Instacart.
Companies like Zillow, which studies large quantities of house price and house characteristic data to determine housing trends and build market price forecasting models.
The list is endless. Any company doing data science in Python is probably using Pandas.
Why is Pandas “the golden standard” for Python data analysis?
The real reason that Pandas is so popular now is probably because it was the first library of its kind. Being the first of its kind and open-source (theoretically anyone can contribute to making it better), more and more people became invested in it. I think that because it was really the first Python library that served this function at scale, it just became very popular.
Aside from that, I'd say that it is just a really nice library: it has a very nice interface; it's very well-documented; it is fairly intuitive to use (once you get past the initial learning curve), and it leads to a coding style that feels fairly natural once you're comfortable with the package.
Overall, I think its popularity is due to the fact that it fits within Python, which is a popular language.
Is Pandas a useful library for beginner data scientists to learn? If so, why?
Pandas would be the next library that I would suggest a beginner learning, after learning basic Python. The reason being, that it opens the door to any data work that you want to do. If you want to work with a data set, try to understand it, format it, and extract information from it, then the easiest way will be with Pandas. And that type of information extraction (actually working with data) is what I think that you should aim to do when starting out as a data scientist; to get used to working with the data as quickly as possible.
I’d be remiss if I didn’t also say that a lot the job of a data scientist is spent purely on cleaning and processing data, and getting it into the right format to do a downstream analysis or to build a downstream model. Being that such a big part of the job, you should spend ample time on the packages that allow you to do that kind of work – and that’s exactly the role Pandas plays.
Are there any disadvantages to learning Pandas?
No, none that I can think of. But I do think that if someone tried to learn Pandas too quickly, before they had a strong grasp of Python, that they might end up taking some shortcuts that might stunt their understanding of Python syntax:
There are certain syntax constructions in Pandas, that are built on more basic Python data types (e.g. understanding things like dictionary reference style; how to pass a list as an argument to another function). Pandas makes use of those kinds of syntax.
If someone were too focused on trying to use Pandas, without really understanding Python – the language itself – they might end up learning in a more roundabout way, trying to adapt to how that particular syntax worked.
Whereas, if that individual had a stronger foundational grounding in the more basic language, they would probably have a much more intuitive experience and understanding of Pandas overall.
How hard is it to learn Pandas?
Pandas can be used to apply a lot of advanced functionality, so it does get complex. But most basic, core operations of Pandas can be picked up fairly quickly. There are, however, a number of other, more tricky core operations to learn which take more time:
Using groupby objects and performing aggregate calculations
Merging and/or concatenating dataframes
Time series processing (e.g. rolling statistics)
You’ll need a good amount of Pandas practice to become comfortable with those, but I think that with a reasonable amount of time, it’s not too hard to do.
Why does Metis teach Pandas?
Pandas is central to the data science workflows of so many companies and so commonly used in industry, that our graduates really should have a very strong grasp of it, if they want to be able to do the job effectively. It’s the sort of thing that will easily come up in interviews and take-home challenges where a student would be asked to take a data set; run some analysis and maybe build a model with that data set. I think pretty much any company would expect that a student using Python would use Pandas to do that analysis.
Pandas is core to the workflow and it is core to opening up the door to working on any data project; to being able to understand any data set, and to being able to build a predictive model with any data set.
How does Metis teach Pandas?
Our curriculum revolves around projects and developing a project portfolio. Students begin using Pandas the very first week, and use it throughout the rest of the course - in pretty much every project worked on.
The first project is very focussed on exploratory data analysis – with Pandas being one of the major tools for that.
During lectures, students are taught Pandas in depth, and have their understanding and skills reinforced through the completion of the practical exercises that we assign them, in and outside of lectures.
In the fourth week of the course we do a lecture on feature engineering which heavily leverages Pandas syntax and operations. This is a way to connect some of the concepts learned earlier on in the course to more advanced concepts in something like predictive modeling.
Students also use Pandas as a tool in their projects. We believe that the best way to teach Pandas is to have our students apply it to a data set that actually interests them, or is useful. In essence, our method at Metis is to introduce Pandas early on, and to reinforce understanding through continuous practical application.
What kind of jobs require knowledge of Pandas?
Any job title in the data stack that you can think about in Python would probably require Pandas: a data engineering role, or a data analyst role, and for sure a data scientist role. Analyst and scientist roles would heavily leverage Pandas. Even if you’re not doing something like statistical models or machine learning, you still want to be able to use Pandas to:
Clean and prepare data
Extract summary information and analyze data (if you’re using it to run an A/B test, for example.)
To create visualizations to help guide business decisions
Pandas is pretty monolithic in terms of any data application that’s using the Python technology stack.
How can beginners get started learning Pandas, and what resources do you suggest?
I’m a firm believer in applying Pandas as early on as possible. Here are some resources::
The Pandas documentation is pretty good and has some introductory materials like walk-throughs and explanations. This makes for a good starting point.
Online Pandas challenge repositories. These “repos” can be used as a way to regularly practice your core skills as they develop over time.
Sites like Kaggle, where you can find a variety of interesting different data sets and see how others have used Pandas to analyze them. This is a nice way to see the library brought to life, working with real data.
My greatest recommendation for learning Pandas is this: use it on your own project. This is the best way to learn Pandas.
Find an existing dataset online or scrape a dataset you’ve found that interests you. Take that data; try to analyze it; use Pandas to tease out interesting trends in the data. When working with a dataset that you find stimulating, you’ll find that Pandas becomes a lot more concrete, a lot more quickly.
Go back and scrutinize your work. Where could your code be better? In my experience, there’s usually a more efficient route that you could have taken. Maybe in one section the code is taking too long to run; maybe another section uses 20 lines of code and should be more simple. There’s a very good chance that you can improve something. That second pass is your opportunity to improve your own code and to learn Pandas a level deeper. This kind of workflow – pushing yourself to work with a real-life, interesting data set; getting your hands dirty – is the very best way to learn Pandas.
Did you know that Metis’ immersive program is now offered live online? Find out more and read Metis reviews on Course Report. This article was produced by the Course Report team in partnership with Metis.
These are the skills to add to your 2020 tech toolbox...
Our guide to choosing between Dataquest and Datacamp – two self-paced, online data science classes.
Every coding bootcamp will teach CSS – but what is it? Get a head start with this guide to CSS!