In this tutorial you'll learn all about histograms and density plots.

Set up the notebook

As always, we begin by setting up the coding environment. (This code is hidden, but you can un-hide it by clicking on the "Code" button immediately below this text, on the right.)

Select a dataset

We'll work with a dataset of 150 different flowers, or 50 each from three different species of iris (Iris setosa, Iris versicolor, and Iris virginica).

tut4_iris

Load and examine the data

Each row in the dataset corresponds to a different flower. There are four measurements: the sepal length and width, along with the petal length and width. We also keep track of the corresponding species.

Histograms

Say we would like to create a histogram to see how petal length varies in iris flowers. We can do this with the sns.distplot command.

We customize the behavior of the command with two additional pieces of information:

Density plots

The next type of plot is a kernel density estimate (KDE) plot. In case you're not familiar with KDE plots, you can think of it as a smoothed histogram.

To make a KDE plot, we use the sns.kdeplot command. Setting shade=True colors the area below the curve (and data= has identical functionality as when we made the histogram above).

2D KDE plots

We're not restricted to a single column when creating a KDE plot. We can create a two-dimensional (2D) KDE plot with the sns.jointplot command.

In the plot below, the color-coding shows us how likely we are to see different combinations of sepal width and petal length, where darker parts of the figure are more likely.

Note that in addition to the 2D KDE plot in the center,

Color-coded plots

For the next part of the tutorial, we'll create plots to understand differences between the species. To accomplish this, we begin by breaking the dataset into three separate files, with one for each species.

In the code cell below, we create a different histogram for each species by using the sns.distplot command (as above) three times. We use label= to set how each histogram will appear in the legend.

In this case, the legend does not automatically appear on the plot. To force it to show (for any plot type), we can always use plt.legend().

We can also create a KDE plot for each species by using sns.kdeplot (as above). Again, label= is used to set the values in the legend.

One interesting pattern that can be seen in plots is that the plants seem to belong to one of two groups, where Iris versicolor and Iris virginica seem to have similar values for petal length, while Iris setosa belongs in a category all by itself.

In fact, according to this dataset, we might even be able to classify any iris plant as Iris setosa (as opposed to Iris versicolor or Iris virginica) just by looking at the petal length: if the petal length of an iris flower is less than 2 cm, it's most likely to be Iris setosa!

What's next?

Put your new skills to work in a coding exercise!


Have questions or comments? Visit the Learn Discussion forum to chat with other Learners.