Skip to content

Commit e29ab75

Browse files
committed
source commit: 653397b
0 parents  commit e29ab75

File tree

119 files changed

+20479
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+20479
-0
lines changed

01-intro.md

Lines changed: 419 additions & 0 deletions
Large diffs are not rendered by default.

02-numpy.md

Lines changed: 839 additions & 0 deletions
Large diffs are not rendered by default.

03-matplotlib.md

Lines changed: 361 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,361 @@
1+
---
2+
title: Visualizing Tabular Data
3+
teaching: 30
4+
exercises: 20
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Plot simple graphs from data.
10+
- Plot multiple graphs in a single figure.
11+
12+
::::::::::::::::::::::::::::::::::::::::::::::::::
13+
14+
:::::::::::::::::::::::::::::::::::::::: questions
15+
16+
- How can I visualize tabular data in Python?
17+
- How can I group several plots together?
18+
19+
::::::::::::::::::::::::::::::::::::::::::::::::::
20+
21+
## Visualizing data
22+
23+
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers,"
24+
and the best way to develop insight is often to visualize data. Visualization deserves an entire
25+
lecture of its own, but we can explore a few features of Python's `matplotlib` library here. While
26+
there is no official plotting library, `matplotlib` is the *de facto* standard. First, we will
27+
import the `pyplot` module from `matplotlib` and use two of its functions to create and display a
28+
[heat map](../learners/reference.md#heat-map) of our data:
29+
30+
:::::::::::::::::::::::::::::::::::::::::: prereq
31+
32+
## Episode Prerequisites
33+
34+
If you are continuing in the same notebook from the previous episode, you already
35+
have a `data` variable and have imported `numpy`. If you are starting a new
36+
notebook at this point, you need the following two lines:
37+
38+
```python
39+
import numpy
40+
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
41+
```
42+
43+
::::::::::::::::::::::::::::::::::::::::::::::::::
44+
45+
```python
46+
import matplotlib.pyplot
47+
image = matplotlib.pyplot.imshow(data)
48+
matplotlib.pyplot.show()
49+
```
50+
51+
![](fig/inflammation-01-imshow.svg){alt='Heat map representing the data variable. Each cell is colored by value along a color gradient from blue to yellow.'}
52+
53+
Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column
54+
corresponds to a day in the dataset. Blue pixels in this heat map represent low values, while
55+
yellow pixels represent high values. As we can see, the general number of inflammation flare-ups
56+
for the patients rises and falls over a 40-day period.
57+
58+
So far so good as this is in line with our knowledge of the clinical trial and Dr. Maverick's
59+
claims:
60+
61+
- the patients take their medication once their inflammation flare-ups begin
62+
- it takes around 3 weeks for the medication to take effect and begin reducing flare-ups
63+
- and flare-ups appear to drop to zero by the end of the clinical trial.
64+
65+
Now let's take a look at the average inflammation over time:
66+
67+
```python
68+
ave_inflammation = numpy.mean(data, axis=0)
69+
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
70+
matplotlib.pyplot.show()
71+
```
72+
73+
![](fig/inflammation-01-average.svg){alt='A line graph showing the average inflammation across all patients over a 40-day period.'}
74+
75+
Here, we have put the average inflammation per day across all patients in the variable
76+
`ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those
77+
values. The result is a reasonably linear rise and fall, in line with Dr. Maverick's claim that
78+
the medication takes 3 weeks to take effect. But a good data scientist doesn't just consider the
79+
average of a dataset, so let's have a look at two other statistics:
80+
81+
```python
82+
max_plot = matplotlib.pyplot.plot(numpy.amax(data, axis=0))
83+
matplotlib.pyplot.show()
84+
```
85+
86+
![](fig/inflammation-01-maximum.svg){alt='A line graph showing the maximum inflammation across all patients over a 40-day period.'}
87+
88+
```python
89+
min_plot = matplotlib.pyplot.plot(numpy.amin(data, axis=0))
90+
matplotlib.pyplot.show()
91+
```
92+
93+
![](fig/inflammation-01-minimum.svg){alt='A line graph showing the minimum inflammation across all patients over a 40-day period.'}
94+
95+
The maximum value rises and falls linearly, while the minimum seems to be a step function.
96+
Neither trend seems particularly likely, so either there's a mistake in our calculations or
97+
something is wrong with our data. This insight would have been difficult to reach by examining
98+
the numbers themselves without visualization tools.
99+
100+
### Grouping plots
101+
102+
You can group similar plots in a single figure using subplots.
103+
This script below uses a number of new commands. The function `matplotlib.pyplot.figure()`
104+
creates a space into which we will place all of our plots. The parameter `figsize`
105+
tells Python how big to make this space. Each subplot is placed into the figure using
106+
its `add_subplot` [method](../learners/reference.md#method). The `add_subplot` method takes
107+
3 parameters. The first denotes how many total rows of subplots there are, the second parameter
108+
refers to the total number of subplot columns, and the final parameter denotes which subplot
109+
your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a
110+
different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can
111+
be titled using the `set_xlabel()` command (or `set_ylabel()`).
112+
Here are our three plots side by side:
113+
114+
```python
115+
import numpy
116+
import matplotlib.pyplot
117+
118+
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
119+
120+
fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
121+
122+
axes1 = fig.add_subplot(1, 3, 1)
123+
axes2 = fig.add_subplot(1, 3, 2)
124+
axes3 = fig.add_subplot(1, 3, 3)
125+
126+
axes1.set_ylabel('average')
127+
axes1.plot(numpy.mean(data, axis=0))
128+
129+
axes2.set_ylabel('max')
130+
axes2.plot(numpy.amax(data, axis=0))
131+
132+
axes3.set_ylabel('min')
133+
axes3.plot(numpy.amin(data, axis=0))
134+
135+
fig.tight_layout()
136+
137+
matplotlib.pyplot.savefig('inflammation.png')
138+
matplotlib.pyplot.show()
139+
```
140+
141+
![](fig/inflammation-01-group-plot.svg){alt='Three line graphs showing the daily average, maximum and minimum inflammation over a 40-day period.'}
142+
143+
The [call](../learners/reference.md#function-call) to `loadtxt` reads our data,
144+
and the rest of the program tells the plotting library
145+
how large we want the figure to be,
146+
that we're creating three subplots,
147+
what to draw for each one,
148+
and that we want a tight layout.
149+
(If we leave out that call to `fig.tight_layout()`,
150+
the graphs will actually be squeezed together more closely.)
151+
152+
The call to `savefig` stores the plot as a graphics file. This can be
153+
a convenient way to store your plots for use in other documents, web
154+
pages etc. The graphics format is automatically determined by
155+
Matplotlib from the file name ending we specify; here PNG from
156+
'inflammation.png'. Matplotlib supports many different graphics
157+
formats, including SVG, PDF, and JPEG.
158+
159+
::::::::::::::::::::::::::::::::::::::::: callout
160+
161+
## Importing libraries with shortcuts
162+
163+
In this lesson we use the `import matplotlib.pyplot`
164+
[syntax](../learners/reference.md#syntax)
165+
to import the `pyplot` module of `matplotlib`. However, shortcuts such as
166+
`import matplotlib.pyplot as plt` are frequently used.
167+
Importing `pyplot` this way means that after the initial import, rather than writing
168+
`matplotlib.pyplot.plot(...)`, you can now write `plt.plot(...)`.
169+
Another common convention is to use the shortcut `import numpy as np` when importing the
170+
NumPy library. We then can write `np.loadtxt(...)` instead of `numpy.loadtxt(...)`,
171+
for example.
172+
173+
Some people prefer these shortcuts as it is quicker to type and results in shorter
174+
lines of code - especially for libraries with long names! You will frequently see
175+
Python code online using a `pyplot` function with `plt`, or a NumPy function with
176+
`np`, and it's because they've used this shortcut. It makes no difference which
177+
approach you choose to take, but you must be consistent as if you use
178+
`import matplotlib.pyplot as plt` then `matplotlib.pyplot.plot(...)` will not work, and
179+
you must use `plt.plot(...)` instead. Because of this, when working with other people it
180+
is important you agree on how libraries are imported.
181+
182+
183+
::::::::::::::::::::::::::::::::::::::::::::::::::
184+
185+
::::::::::::::::::::::::::::::::::::::: challenge
186+
187+
## Plot Scaling
188+
189+
Why do all of our plots stop just short of the upper end of our graph?
190+
191+
::::::::::::::: solution
192+
193+
## Solution
194+
195+
Because matplotlib normally sets x and y axes limits to the min and max of our data
196+
(depending on data range)
197+
198+
199+
:::::::::::::::::::::::::
200+
201+
If we want to change this, we can use the `set_ylim(min, max)` method of each 'axes',
202+
for example:
203+
204+
```python
205+
axes3.set_ylim(0, 6)
206+
```
207+
208+
Update your plotting code to automatically set a more appropriate scale.
209+
(Hint: you can make use of the `max` and `min` methods to help.)
210+
211+
::::::::::::::: solution
212+
213+
## Solution
214+
215+
```python
216+
# One method
217+
axes3.set_ylabel('min')
218+
axes3.plot(numpy.amin(data, axis=0))
219+
axes3.set_ylim(0, 6)
220+
```
221+
222+
:::::::::::::::::::::::::
223+
224+
::::::::::::::: solution
225+
226+
## Solution
227+
228+
```python
229+
# A more automated approach
230+
min_data = numpy.amin(data, axis=0)
231+
axes3.set_ylabel('min')
232+
axes3.plot(min_data)
233+
axes3.set_ylim(numpy.amin(min_data), numpy.amax(min_data) * 1.1)
234+
```
235+
236+
:::::::::::::::::::::::::
237+
238+
::::::::::::::::::::::::::::::::::::::::::::::::::
239+
240+
::::::::::::::::::::::::::::::::::::::: challenge
241+
242+
## Drawing Straight Lines
243+
244+
In the center and right subplots above, we expect all lines to look like step functions because
245+
non-integer values are not realistic for the minimum and maximum values. However, you can see
246+
that the lines are not always vertical or horizontal, and in particular the step function
247+
in the subplot on the right looks slanted. Why is this?
248+
249+
::::::::::::::: solution
250+
251+
## Solution
252+
253+
Because matplotlib interpolates (draws a straight line) between the points.
254+
One way to do avoid this is to use the Matplotlib `drawstyle` option:
255+
256+
```python
257+
import numpy
258+
import matplotlib.pyplot
259+
260+
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
261+
262+
fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
263+
264+
axes1 = fig.add_subplot(1, 3, 1)
265+
axes2 = fig.add_subplot(1, 3, 2)
266+
axes3 = fig.add_subplot(1, 3, 3)
267+
268+
axes1.set_ylabel('average')
269+
axes1.plot(numpy.mean(data, axis=0), drawstyle='steps-mid')
270+
271+
axes2.set_ylabel('max')
272+
axes2.plot(numpy.amax(data, axis=0), drawstyle='steps-mid')
273+
274+
axes3.set_ylabel('min')
275+
axes3.plot(numpy.amin(data, axis=0), drawstyle='steps-mid')
276+
277+
fig.tight_layout()
278+
279+
matplotlib.pyplot.show()
280+
```
281+
282+
![](fig/inflammation-01-line-styles.svg){alt='Three line graphs, with step lines connecting the points, showing the daily average, maximumand minimum inflammation over a 40-day period.'}
283+
284+
285+
286+
:::::::::::::::::::::::::
287+
288+
::::::::::::::::::::::::::::::::::::::::::::::::::
289+
290+
::::::::::::::::::::::::::::::::::::::: challenge
291+
292+
## Make Your Own Plot
293+
294+
Create a plot showing the standard deviation (`numpy.std`)
295+
of the inflammation data for each day across all patients.
296+
297+
::::::::::::::: solution
298+
299+
## Solution
300+
301+
```python
302+
std_plot = matplotlib.pyplot.plot(numpy.std(data, axis=0))
303+
matplotlib.pyplot.show()
304+
```
305+
306+
:::::::::::::::::::::::::
307+
308+
::::::::::::::::::::::::::::::::::::::::::::::::::
309+
310+
::::::::::::::::::::::::::::::::::::::: challenge
311+
312+
## Moving Plots Around
313+
314+
Modify the program to display the three plots on top of one another
315+
instead of side by side.
316+
317+
::::::::::::::: solution
318+
319+
## Solution
320+
321+
```python
322+
import numpy
323+
import matplotlib.pyplot
324+
325+
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
326+
327+
# change figsize (swap width and height)
328+
fig = matplotlib.pyplot.figure(figsize=(3.0, 10.0))
329+
330+
# change add_subplot (swap first two parameters)
331+
axes1 = fig.add_subplot(3, 1, 1)
332+
axes2 = fig.add_subplot(3, 1, 2)
333+
axes3 = fig.add_subplot(3, 1, 3)
334+
335+
axes1.set_ylabel('average')
336+
axes1.plot(numpy.mean(data, axis=0))
337+
338+
axes2.set_ylabel('max')
339+
axes2.plot(numpy.amax(data, axis=0))
340+
341+
axes3.set_ylabel('min')
342+
axes3.plot(numpy.amin(data, axis=0))
343+
344+
fig.tight_layout()
345+
346+
matplotlib.pyplot.show()
347+
```
348+
349+
:::::::::::::::::::::::::
350+
351+
::::::::::::::::::::::::::::::::::::::::::::::::::
352+
353+
354+
355+
:::::::::::::::::::::::::::::::::::::::: keypoints
356+
357+
- Use the `pyplot` module from the `matplotlib` library for creating simple visualizations.
358+
359+
::::::::::::::::::::::::::::::::::::::::::::::::::
360+
361+

0 commit comments

Comments
 (0)