Wednesday, November 26, 2014

Think Stats: Exploratory Data Analysis by Allen B. Downey; O'Reilly Media

I recently finished reading Think Stats: Exploratory Data Analysis by Allen B. Downey, which is an introduction to using probability and statistics to perform analysis on data sets.  This book uses Python to explore and perform statistical analysis on several example data sets.


I have a decent statistics background (several undergraduate and graduate level statistics courses), and this book definitely took a different approach than I have seen before.  The focus is on an exploratory and computational approach to analyzing a data set.  This approach is very valuable, and provides a much more easily applied skill set than a traditional statistics introduction.

This book is not a thorough reference (though it often provides links to Wikipedia or other external sources for more information), and it won't replace my other statistics textbooks.  However, it is a good introduction to the field (including many more advanced topics) and is easy to follow.  I would be very interested in seeing a class that used this book as the text and followed the approach presented here.  The book flows logically, but the topics were presented in a very different order than I was originally exposed to them.

To get the most out of this book, I would definitely recommend working through the examples.  An even better approach would be to work through the topics on a data set you have at hand that is of interest to you.

Most of the examples in the book use the author's "thinkstats2.py" module.  You can get the thinkstats2.py module (along with other sample code) at the book's GitHub page, and all of the examples can be viewed in IPython Notebooks.  The examples are fairly straightforward, but I have not used to module enough to know whether I would consider it a candidate for a general purpose tool beyond working through the book.  The author assumes you are familiar with Python, and having the module available is a useful tool to allow the reader to focus on the data and analysis.

The non-core Python packages used in this book are: pandas, NumPy, SciPy, StatsModels, and matplotlib.  Pandas, in particular, is used quite heavily in thinkstats2.py.  The author recommends the Anaconda distribution, which gives you all of these packages and many more.  I've been using Anaconda as my primary distribution for the past 6 months or so and am very happy with it.

The book is also available under a CC BY-NC 3.0 license at Green Tea Press.

Disclaimer: I received a free Ebook copy of this work under the O'Reilly Blogger Review Program.

Thursday, September 18, 2014

High Performance Python by Micha Gorelick & Ian Ozsvald; O'Reilly Media

One of the big draws of the Python programming language is that it is very easy to develop something relatively complex quite rapidly.  However, Python is much more than a prototyping language, and High Performance Python: Practical Performant Programming for Humans is a great resource to help you think about how you approach problems in Python, as well as tracking down and improving bottlenecks in your code.


This book is definitely not going to teach you Python.  There are many other tutorials and references out there to learn about the language, and this book assumes you are already a proficient Python programmer and will be able to read and understand the code examples they provide.  This book is about tuning your Python code to run faster. 

The progression of the chapters is very logical, and some of the same toy problems re-appear throughout the book as additional optimizations provide even greater efficiency improvements.  The book introduces a large number of tools, and it mostly gives you an idea of what the tool is and why you might consider it.  To really use any of the tools in practice, you'll want to reference online documentation, but this book gives you a good idea of where to start looking.


I was particularly interested in reading the "Clusters and Job Queues" chapter before I got the book, and it helped guide me to an IPython.parallel solution that fits my current problem quite nicely, as well as provide some other tools I may investigate in the future.

The authors recommend the Anaconda Python distribution by Continuum Analytics on several occasions, and I definitely agree.  Some of the tools and techniques in the book use only the Standard Library, but most of the more advanced topics require external modules.  Many of the modules referenced (numpy, Cython, Tornado, & IPython to name just a few) are included in the Anaconda distribution as one simple download.

This book's use is twofold.  First, it is worth a full read-through for the discussion of the various things that tend to slow down Python code (or code in general) and what kinds of approaches you should be aware of.  Second, it provides good, brief examples of many different tools in practice, as well as listing other recommended resources at the end of each chapter, allowing it to serve as a good reference text.


One point the authors make repeatedly is that you must consider the trade between code execution time and development velocity.  Many of the things you can do to speed up your code will make it considerably harder to understand and work with in the future.  It's important to always have proof that you are optimizing the right portions of code and that the benefits are worth it.  They help you to look for the "big wins" where you can get drastic speed improvements with minimal effort and complexity.

Disclaimer: I received a free Ebook copy of this work under the O'Reilly Blogger Review Program.  I also happened to like it so much that I bought a hard copy as well so I can have it on my reference shelf at work.