|
| 1 | +--- |
| 2 | +title: Visualizing Tabular Data |
| 3 | +teaching: 30 |
| 4 | +exercises: 20 |
| 5 | +--- |
| 6 | + |
| 7 | +::::::::::::::::::::::::::::::::::::::: objectives |
| 8 | + |
| 9 | +- Plot simple graphs from data. |
| 10 | +- Plot multiple graphs in a single figure. |
| 11 | + |
| 12 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 13 | + |
| 14 | +:::::::::::::::::::::::::::::::::::::::: questions |
| 15 | + |
| 16 | +- How can I visualize tabular data in Python? |
| 17 | +- How can I group several plots together? |
| 18 | + |
| 19 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 20 | + |
| 21 | +## Visualizing data |
| 22 | + |
| 23 | +The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," |
| 24 | +and the best way to develop insight is often to visualize data. Visualization deserves an entire |
| 25 | +lecture of its own, but we can explore a few features of Python's `matplotlib` library here. While |
| 26 | +there is no official plotting library, `matplotlib` is the *de facto* standard. First, we will |
| 27 | +import the `pyplot` module from `matplotlib` and use two of its functions to create and display a |
| 28 | +[heat map](../learners/reference.md#heat-map) of our data: |
| 29 | + |
| 30 | +:::::::::::::::::::::::::::::::::::::::::: prereq |
| 31 | + |
| 32 | +## Episode Prerequisites |
| 33 | + |
| 34 | +If you are continuing in the same notebook from the previous episode, you already |
| 35 | +have a `data` variable and have imported `numpy`. If you are starting a new |
| 36 | +notebook at this point, you need the following two lines: |
| 37 | + |
| 38 | +```python |
| 39 | +import numpy |
| 40 | +data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',') |
| 41 | +``` |
| 42 | + |
| 43 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 44 | + |
| 45 | +```python |
| 46 | +import matplotlib.pyplot |
| 47 | +image = matplotlib.pyplot.imshow(data) |
| 48 | +matplotlib.pyplot.show() |
| 49 | +``` |
| 50 | + |
| 51 | +{alt='Heat map representing the data variable. Each cell is colored by value along a color gradient from blue to yellow.'} |
| 52 | + |
| 53 | +Each row in the heat map corresponds to a patient in the clinical trial dataset, and each column |
| 54 | +corresponds to a day in the dataset. Blue pixels in this heat map represent low values, while |
| 55 | +yellow pixels represent high values. As we can see, the general number of inflammation flare-ups |
| 56 | +for the patients rises and falls over a 40-day period. |
| 57 | + |
| 58 | +So far so good as this is in line with our knowledge of the clinical trial and Dr. Maverick's |
| 59 | +claims: |
| 60 | + |
| 61 | +- the patients take their medication once their inflammation flare-ups begin |
| 62 | +- it takes around 3 weeks for the medication to take effect and begin reducing flare-ups |
| 63 | +- and flare-ups appear to drop to zero by the end of the clinical trial. |
| 64 | + |
| 65 | +Now let's take a look at the average inflammation over time: |
| 66 | + |
| 67 | +```python |
| 68 | +ave_inflammation = numpy.mean(data, axis=0) |
| 69 | +ave_plot = matplotlib.pyplot.plot(ave_inflammation) |
| 70 | +matplotlib.pyplot.show() |
| 71 | +``` |
| 72 | + |
| 73 | +{alt='A line graph showing the average inflammation across all patients over a 40-day period.'} |
| 74 | + |
| 75 | +Here, we have put the average inflammation per day across all patients in the variable |
| 76 | +`ave_inflammation`, then asked `matplotlib.pyplot` to create and display a line graph of those |
| 77 | +values. The result is a reasonably linear rise and fall, in line with Dr. Maverick's claim that |
| 78 | +the medication takes 3 weeks to take effect. But a good data scientist doesn't just consider the |
| 79 | +average of a dataset, so let's have a look at two other statistics: |
| 80 | + |
| 81 | +```python |
| 82 | +max_plot = matplotlib.pyplot.plot(numpy.amax(data, axis=0)) |
| 83 | +matplotlib.pyplot.show() |
| 84 | +``` |
| 85 | + |
| 86 | +{alt='A line graph showing the maximum inflammation across all patients over a 40-day period.'} |
| 87 | + |
| 88 | +```python |
| 89 | +min_plot = matplotlib.pyplot.plot(numpy.amin(data, axis=0)) |
| 90 | +matplotlib.pyplot.show() |
| 91 | +``` |
| 92 | + |
| 93 | +{alt='A line graph showing the minimum inflammation across all patients over a 40-day period.'} |
| 94 | + |
| 95 | +The maximum value rises and falls linearly, while the minimum seems to be a step function. |
| 96 | +Neither trend seems particularly likely, so either there's a mistake in our calculations or |
| 97 | +something is wrong with our data. This insight would have been difficult to reach by examining |
| 98 | +the numbers themselves without visualization tools. |
| 99 | + |
| 100 | +### Grouping plots |
| 101 | + |
| 102 | +You can group similar plots in a single figure using subplots. |
| 103 | +This script below uses a number of new commands. The function `matplotlib.pyplot.figure()` |
| 104 | +creates a space into which we will place all of our plots. The parameter `figsize` |
| 105 | +tells Python how big to make this space. Each subplot is placed into the figure using |
| 106 | +its `add_subplot` [method](../learners/reference.md#method). The `add_subplot` method takes |
| 107 | +3 parameters. The first denotes how many total rows of subplots there are, the second parameter |
| 108 | +refers to the total number of subplot columns, and the final parameter denotes which subplot |
| 109 | +your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a |
| 110 | +different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can |
| 111 | +be titled using the `set_xlabel()` command (or `set_ylabel()`). |
| 112 | +Here are our three plots side by side: |
| 113 | + |
| 114 | +```python |
| 115 | +import numpy |
| 116 | +import matplotlib.pyplot |
| 117 | + |
| 118 | +data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',') |
| 119 | + |
| 120 | +fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) |
| 121 | + |
| 122 | +axes1 = fig.add_subplot(1, 3, 1) |
| 123 | +axes2 = fig.add_subplot(1, 3, 2) |
| 124 | +axes3 = fig.add_subplot(1, 3, 3) |
| 125 | + |
| 126 | +axes1.set_ylabel('average') |
| 127 | +axes1.plot(numpy.mean(data, axis=0)) |
| 128 | + |
| 129 | +axes2.set_ylabel('max') |
| 130 | +axes2.plot(numpy.amax(data, axis=0)) |
| 131 | + |
| 132 | +axes3.set_ylabel('min') |
| 133 | +axes3.plot(numpy.amin(data, axis=0)) |
| 134 | + |
| 135 | +fig.tight_layout() |
| 136 | + |
| 137 | +matplotlib.pyplot.savefig('inflammation.png') |
| 138 | +matplotlib.pyplot.show() |
| 139 | +``` |
| 140 | + |
| 141 | +{alt='Three line graphs showing the daily average, maximum and minimum inflammation over a 40-day period.'} |
| 142 | + |
| 143 | +The [call](../learners/reference.md#function-call) to `loadtxt` reads our data, |
| 144 | +and the rest of the program tells the plotting library |
| 145 | +how large we want the figure to be, |
| 146 | +that we're creating three subplots, |
| 147 | +what to draw for each one, |
| 148 | +and that we want a tight layout. |
| 149 | +(If we leave out that call to `fig.tight_layout()`, |
| 150 | +the graphs will actually be squeezed together more closely.) |
| 151 | + |
| 152 | +The call to `savefig` stores the plot as a graphics file. This can be |
| 153 | +a convenient way to store your plots for use in other documents, web |
| 154 | +pages etc. The graphics format is automatically determined by |
| 155 | +Matplotlib from the file name ending we specify; here PNG from |
| 156 | +'inflammation.png'. Matplotlib supports many different graphics |
| 157 | +formats, including SVG, PDF, and JPEG. |
| 158 | + |
| 159 | +::::::::::::::::::::::::::::::::::::::::: callout |
| 160 | + |
| 161 | +## Importing libraries with shortcuts |
| 162 | + |
| 163 | +In this lesson we use the `import matplotlib.pyplot` |
| 164 | +[syntax](../learners/reference.md#syntax) |
| 165 | +to import the `pyplot` module of `matplotlib`. However, shortcuts such as |
| 166 | +`import matplotlib.pyplot as plt` are frequently used. |
| 167 | +Importing `pyplot` this way means that after the initial import, rather than writing |
| 168 | +`matplotlib.pyplot.plot(...)`, you can now write `plt.plot(...)`. |
| 169 | +Another common convention is to use the shortcut `import numpy as np` when importing the |
| 170 | +NumPy library. We then can write `np.loadtxt(...)` instead of `numpy.loadtxt(...)`, |
| 171 | +for example. |
| 172 | + |
| 173 | +Some people prefer these shortcuts as it is quicker to type and results in shorter |
| 174 | +lines of code - especially for libraries with long names! You will frequently see |
| 175 | +Python code online using a `pyplot` function with `plt`, or a NumPy function with |
| 176 | +`np`, and it's because they've used this shortcut. It makes no difference which |
| 177 | +approach you choose to take, but you must be consistent as if you use |
| 178 | +`import matplotlib.pyplot as plt` then `matplotlib.pyplot.plot(...)` will not work, and |
| 179 | +you must use `plt.plot(...)` instead. Because of this, when working with other people it |
| 180 | +is important you agree on how libraries are imported. |
| 181 | + |
| 182 | + |
| 183 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 184 | + |
| 185 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 186 | + |
| 187 | +## Plot Scaling |
| 188 | + |
| 189 | +Why do all of our plots stop just short of the upper end of our graph? |
| 190 | + |
| 191 | +::::::::::::::: solution |
| 192 | + |
| 193 | +## Solution |
| 194 | + |
| 195 | +Because matplotlib normally sets x and y axes limits to the min and max of our data |
| 196 | +(depending on data range) |
| 197 | + |
| 198 | + |
| 199 | +::::::::::::::::::::::::: |
| 200 | + |
| 201 | +If we want to change this, we can use the `set_ylim(min, max)` method of each 'axes', |
| 202 | +for example: |
| 203 | + |
| 204 | +```python |
| 205 | +axes3.set_ylim(0, 6) |
| 206 | +``` |
| 207 | + |
| 208 | +Update your plotting code to automatically set a more appropriate scale. |
| 209 | +(Hint: you can make use of the `max` and `min` methods to help.) |
| 210 | + |
| 211 | +::::::::::::::: solution |
| 212 | + |
| 213 | +## Solution |
| 214 | + |
| 215 | +```python |
| 216 | +# One method |
| 217 | +axes3.set_ylabel('min') |
| 218 | +axes3.plot(numpy.amin(data, axis=0)) |
| 219 | +axes3.set_ylim(0, 6) |
| 220 | +``` |
| 221 | + |
| 222 | +::::::::::::::::::::::::: |
| 223 | + |
| 224 | +::::::::::::::: solution |
| 225 | + |
| 226 | +## Solution |
| 227 | + |
| 228 | +```python |
| 229 | +# A more automated approach |
| 230 | +min_data = numpy.amin(data, axis=0) |
| 231 | +axes3.set_ylabel('min') |
| 232 | +axes3.plot(min_data) |
| 233 | +axes3.set_ylim(numpy.amin(min_data), numpy.amax(min_data) * 1.1) |
| 234 | +``` |
| 235 | + |
| 236 | +::::::::::::::::::::::::: |
| 237 | + |
| 238 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 239 | + |
| 240 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 241 | + |
| 242 | +## Drawing Straight Lines |
| 243 | + |
| 244 | +In the center and right subplots above, we expect all lines to look like step functions because |
| 245 | +non-integer values are not realistic for the minimum and maximum values. However, you can see |
| 246 | +that the lines are not always vertical or horizontal, and in particular the step function |
| 247 | +in the subplot on the right looks slanted. Why is this? |
| 248 | + |
| 249 | +::::::::::::::: solution |
| 250 | + |
| 251 | +## Solution |
| 252 | + |
| 253 | +Because matplotlib interpolates (draws a straight line) between the points. |
| 254 | +One way to do avoid this is to use the Matplotlib `drawstyle` option: |
| 255 | + |
| 256 | +```python |
| 257 | +import numpy |
| 258 | +import matplotlib.pyplot |
| 259 | + |
| 260 | +data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',') |
| 261 | + |
| 262 | +fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) |
| 263 | + |
| 264 | +axes1 = fig.add_subplot(1, 3, 1) |
| 265 | +axes2 = fig.add_subplot(1, 3, 2) |
| 266 | +axes3 = fig.add_subplot(1, 3, 3) |
| 267 | + |
| 268 | +axes1.set_ylabel('average') |
| 269 | +axes1.plot(numpy.mean(data, axis=0), drawstyle='steps-mid') |
| 270 | + |
| 271 | +axes2.set_ylabel('max') |
| 272 | +axes2.plot(numpy.amax(data, axis=0), drawstyle='steps-mid') |
| 273 | + |
| 274 | +axes3.set_ylabel('min') |
| 275 | +axes3.plot(numpy.amin(data, axis=0), drawstyle='steps-mid') |
| 276 | + |
| 277 | +fig.tight_layout() |
| 278 | + |
| 279 | +matplotlib.pyplot.show() |
| 280 | +``` |
| 281 | + |
| 282 | +{alt='Three line graphs, with step lines connecting the points, showing the daily average, maximumand minimum inflammation over a 40-day period.'} |
| 283 | + |
| 284 | + |
| 285 | + |
| 286 | +::::::::::::::::::::::::: |
| 287 | + |
| 288 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 289 | + |
| 290 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 291 | + |
| 292 | +## Make Your Own Plot |
| 293 | + |
| 294 | +Create a plot showing the standard deviation (`numpy.std`) |
| 295 | +of the inflammation data for each day across all patients. |
| 296 | + |
| 297 | +::::::::::::::: solution |
| 298 | + |
| 299 | +## Solution |
| 300 | + |
| 301 | +```python |
| 302 | +std_plot = matplotlib.pyplot.plot(numpy.std(data, axis=0)) |
| 303 | +matplotlib.pyplot.show() |
| 304 | +``` |
| 305 | + |
| 306 | +::::::::::::::::::::::::: |
| 307 | + |
| 308 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 309 | + |
| 310 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 311 | + |
| 312 | +## Moving Plots Around |
| 313 | + |
| 314 | +Modify the program to display the three plots on top of one another |
| 315 | +instead of side by side. |
| 316 | + |
| 317 | +::::::::::::::: solution |
| 318 | + |
| 319 | +## Solution |
| 320 | + |
| 321 | +```python |
| 322 | +import numpy |
| 323 | +import matplotlib.pyplot |
| 324 | + |
| 325 | +data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',') |
| 326 | + |
| 327 | +# change figsize (swap width and height) |
| 328 | +fig = matplotlib.pyplot.figure(figsize=(3.0, 10.0)) |
| 329 | + |
| 330 | +# change add_subplot (swap first two parameters) |
| 331 | +axes1 = fig.add_subplot(3, 1, 1) |
| 332 | +axes2 = fig.add_subplot(3, 1, 2) |
| 333 | +axes3 = fig.add_subplot(3, 1, 3) |
| 334 | + |
| 335 | +axes1.set_ylabel('average') |
| 336 | +axes1.plot(numpy.mean(data, axis=0)) |
| 337 | + |
| 338 | +axes2.set_ylabel('max') |
| 339 | +axes2.plot(numpy.amax(data, axis=0)) |
| 340 | + |
| 341 | +axes3.set_ylabel('min') |
| 342 | +axes3.plot(numpy.amin(data, axis=0)) |
| 343 | + |
| 344 | +fig.tight_layout() |
| 345 | + |
| 346 | +matplotlib.pyplot.show() |
| 347 | +``` |
| 348 | + |
| 349 | +::::::::::::::::::::::::: |
| 350 | + |
| 351 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 352 | + |
| 353 | + |
| 354 | + |
| 355 | +:::::::::::::::::::::::::::::::::::::::: keypoints |
| 356 | + |
| 357 | +- Use the `pyplot` module from the `matplotlib` library for creating simple visualizations. |
| 358 | + |
| 359 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 360 | + |
| 361 | + |
0 commit comments