Wednesday, November 26, 2014

Think Stats: Exploratory Data Analysis by Allen B. Downey; O'Reilly Media

I recently finished reading Think Stats: Exploratory Data Analysis by Allen B. Downey, which is an introduction to using probability and statistics to perform analysis on data sets.  This book uses Python to explore and perform statistical analysis on several example data sets.


I have a decent statistics background (several undergraduate and graduate level statistics courses), and this book definitely took a different approach than I have seen before.  The focus is on an exploratory and computational approach to analyzing a data set.  This approach is very valuable, and provides a much more easily applied skill set than a traditional statistics introduction.

This book is not a thorough reference (though it often provides links to Wikipedia or other external sources for more information), and it won't replace my other statistics textbooks.  However, it is a good introduction to the field (including many more advanced topics) and is easy to follow.  I would be very interested in seeing a class that used this book as the text and followed the approach presented here.  The book flows logically, but the topics were presented in a very different order than I was originally exposed to them.

To get the most out of this book, I would definitely recommend working through the examples.  An even better approach would be to work through the topics on a data set you have at hand that is of interest to you.

Most of the examples in the book use the author's "thinkstats2.py" module.  You can get the thinkstats2.py module (along with other sample code) at the book's GitHub page, and all of the examples can be viewed in IPython Notebooks.  The examples are fairly straightforward, but I have not used to module enough to know whether I would consider it a candidate for a general purpose tool beyond working through the book.  The author assumes you are familiar with Python, and having the module available is a useful tool to allow the reader to focus on the data and analysis.

The non-core Python packages used in this book are: pandas, NumPy, SciPy, StatsModels, and matplotlib.  Pandas, in particular, is used quite heavily in thinkstats2.py.  The author recommends the Anaconda distribution, which gives you all of these packages and many more.  I've been using Anaconda as my primary distribution for the past 6 months or so and am very happy with it.

The book is also available under a CC BY-NC 3.0 license at Green Tea Press.

Disclaimer: I received a free Ebook copy of this work under the O'Reilly Blogger Review Program.