mf

Pandas: The Swiss Army Knife for Your Data, Part 2

This is part two of a two-part tutorial about Pandas, the amazing Python data analytics toolkit. 

In part one, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data. 

In this part, we’ll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting.

Dealing With Missing Values

One of the strongest points of pandas is its handling of missing values. It will not just crash and burn in the presence of missing data. When data is missing, pandas replaces it with numpy’s np.nan (not a number), and it doesn’t participate in any computation.

Let’s reindex our data frame, adding more rows and columns, but without any new data. To make it interesting, we’ll populate some values.

Note that df.index.append() returns a new index and doesn’t modify the existing index. Also, df.reindex() returns a new data frame that I assign back to the df variable.

At this point, our data frame has six rows. The last row is all NaNs, and all other rows except the third and the fourth have NaN in the “c” column. What can you do with missing data? Here are options:

  • Keep it (but it will not participate in computations).
  • Drop it (the result of the computation will not contain the missing data).
  • Replace it with a default value.

If you just want to check if you have missing data in your data frame, use the isnull() method. This returns a boolean mask of your dataframe, which is True for missing values and False elsewhere.

Manipulating Your Data

When you have a data frame, you often need to perform operations on the data. Let’s start with a new data frame that has four rows and three columns of random integers between 1 and 9 (inclusive).

Now, you can start working on the data. Let’s sum up all the columns and assign the result to the last row, and then sum all the rows (dimension 1) and assign to the last column:

You can also perform operations on the entire data frame. Here is an example of subtracting 3 from each and every cell:

For total control, you can apply arbitrary functions:

Merging Data

Another common scenario when working with data frames is combining and merging data frames (and series) together. Pandas, as usual, gives you different options. Let’s create another data frame and explore the various options.

Concat

When using pd.concat, pandas simply concatenates all the rows of the provided parts in order. There is no alignment of indexes. See in the following example how duplicate index values are created:

You can also concatenate columns by using the axis=1 argument:

Note that because the first data frame (I used only two rows) didn’t have as many rows, the missing values were automatically populated with NaNs, which changed those column types from int to float.

It’s possible to concatenate any number of data frames in one call.

Merge

The merge function behaves in a similar way to SQL join. It merges all the columns from rows that have similar keys. Note that it operates on two data frames only:

Append

The data frame’s append() method is a little shortcut. It functionally behaves like concat(), but saves some key strokes.

Grouping Your Data

Here is a data frame that contains the members and ages of two families: the Smiths and the Joneses. You can use the groupby() method to group data by last name and find information at the family level like the sum of ages and the mean age:

Time Series

A lot of important data is time series data. Pandas has strong support for time series data starting with data ranges, going through localization and time conversion, and all the way to sophisticated frequency-based resampling.

The date_range() function can generate sequences of datetimes. Here is an example of generating a six-week period starting on 1 January 2017 using the UTC time zone.

Adding a timestamp to your data frames, either as data column or as the index, is great for organizing and grouping your data by time. It also allows resampling. Here is an example of resampling every minute data as five-minute aggregations.

Plotting

Pandas supports plotting with matplotlib. Make sure it’s installed: pip install matplotlib. To generate a plot, you can call the plot() of a series or a data frame. There are many options to control the plot, but the defaults work for simple visualization purposes. Here is how to generate a line graph and save it to a PDF file.

Note that on macOS, Python must be installed as a framework for plotting with Pandas.

Conclusion

Pandas is a very broad data analytics framework. It has a simple object model with the concepts of series and data frame and a wealth of built-in functionality. You can compose and mix pandas functions and your own algorithms. 

Additionally, don’t hesitate to see what we have available for sale and for study in the marketplace, and don’t hesitate to ask any questions and provide your valuable feedback using the feed below.

Data importing and exporting in pandas are very extensive too and ensure that you can integrate it easily into existing systems. If you’re doing any data processing in Python, pandas belongs in your toolbox.

Powered by WPeMatico

Leave a Comment

Scroll to Top