Tuesday, October 8, 2013

The Joys and Cautions of Statistics

Documentary which takes viewers on a rollercoaster ride through the wonderful world of statistics to explore the remarkable power they have to change our understanding of the world, presented by superstar boffin Professor Hans Rosling, whose eye-opening, mind-expanding and funny online lectures have made him an international internet legend. 
Rosling is a man who revels in the glorious nerdiness of statistics, and here he entertainingly explores their history, how they work mathematically and how they can be used in today's computer age to see the world as it really is, not just as we imagine it to be. 
Rosling's lectures use huge quantities of public data to reveal the story of the world's past, present and future development. Now he tells the story of the world in 200 countries over 200 years using 120,000 numbers - in just four minutes. 
The film also explores cutting-edge examples of statistics in action today. In San Francisco, a new app mashes up police department data with the city's street map to show what crime is being reported street by street, house by house, in near real-time. Every citizen can use it and the hidden patterns of their city are starkly revealed. Meanwhile, at Google HQ the machine translation project tries to translate between 57 languages, using lots of statistics and no linguists. 
Despite its light and witty touch, the film nonetheless has a serious message - without statistics we are cast adrift on an ocean of confusion, but armed with stats we can take control of our lives, hold our rulers to account and see the world as it really is. What's more, [Rosling] concludes, we can now collect and analyse such huge quantities of data and at such speeds that scientific method itself seems to be changing.
This is a highly informative, entertaining film that shows the remarkable things statistics can do.  Here are three takeaways - that is, points of emphasis vis-a-vis Big Data and Analytics - for you to consider:

Statistics helps us understand better what is going on in the world around us and what we need to do in that world.

The car we drive has quite a lot more engineering, chemistry and technology, for example, beneath the hood, than we can see or even wish to see.  Our day to day world is like a car:  We run it, we navigate through it, and we get things done in it.  But what of the complexity underlying our world?  What of those things we don't normally see?  Some may argue that as long as things are working well and are enough to manage that world effectively, then there isn't a need to see more or know more.  This is what good technology promises, isn't it:  effectiveness, ease and convenience.

But citizens of San Francisco had cause to delve more into their world, as Hans Rosling points out.  Data on crime, in particular, became community statistics when city officials made it public.  It told citizens, and visitors for that matter, where crime was more likely to occur and where crime was less likely to occur.  That's critical information.  I will not hazard a guess about what citizens and visitors believed, but clearly statistics surfaced a more complete truth about their city.  

In good measure, this is my Theory of Algorithms.  Back in March 2010, something came over me and crystallized whatever it was that had been brewing in my mind for years:  That was to get at the underlying essence of things, to grasp reality as it was and things as they are, and to discern the patterns below the surface that brought things together.  Statistics is not the only means at my disposal, but as mathematical framework and instrument, statistics plays a crucial role in Theory of Algorithms.  

Those community statistics helped San Francisco citizens and visitors navigate the city safely and, more importantly, call upon officials to do something about the crime.  Community statistics helped everyone monitor the city and hold officials accountable for ensuring safety and security.

Numbers alone don't tell the whole story; we also have to analyze them.

Let me resort to a different analogy.  We go to the theater to watch a play.  There are actors and props, and they play out a drama that we can sit back and enjoy.  Because he is so entertaining and enthusiastic, Rosling makes statistics like a stage production.  Formally this is called descriptive statistics, that is, numbers acting out a drama and telling us a story.  

Inferential statistics, on the other hand, is the mathematics - calculation, computation or analysis - we do to the numbers.  Why do this?  Because the numbers themselves have a reality, and therefore have relationships and patterns among them, which, when analyzed, give us an even sharper, more illuminating picture of our broader reality.  Whether it's Pearson correlation, hierarchical regression, factor analysis or a simple t-test, it's inferential statistics.

To draw on our theatrical analogy, inferential statistics is getting a behind-the-scenes look at the play, for example, going backstage during a performance.  It may also entail joining the repertory during several rehearsals.  What is more, it is also about getting below the surface of things.  So we may speak to the playwright, the director, and the producer, from the point of conceiving the play, through its staging, and even afterwards.  We gain a granular understanding, then, of the why and the how of their play, which we don't normally see.  Again, many of us may have no desire or need to get this deeply into things.  But clearly some of us do want to and some of us do have a need for that.  

By the way, IBM, for one, positions its analytic tools as creating predictive models.  We all seem to love a prediction, as we long to see the future as clearly as we see the present.  Sports media and fans especially love a prediction, before a game or before the season.  Business leaders do have a stake in knowing what customers will buy, for example.  So they turn to predictive models, based on data, which help them create the right products and services and position these effectively in the market.  

In one respect, this is the endgame of science:  To predict what phenomena will occur, based on certain occurrences.  But scientists are very cautious about drawing conclusions on what results can and cannot predict.  They are so cautious that all they may conclude, from a regression analysis, for example, is this:  Factors A, B and C (i.e., independent variables) significantly account for this amount of variance in Factors X, Y and Z (i.e., dependent variables).  This phrasing is definitely more cumbersome, and frankly nerdy, than just saying, wow, we can predict this and that.  

Actually, the term inferential statistics is better than the term predictive modeling, because it accounts for that caution that we must all exercise when conducting analysis and interpreting results.  Be that as it may, contemporary media and marketing have commandeered the sexier term around prediction, so we go with it.  But at least we're better informed and cautioned now.

What correlation doesn't replace is good human thought.

A friend remarked, not too long ago, something along the lines of:  Technology weakened our mind. We turn to Google Maps on the smart phone, a GPS device on the dashboard, and the MapQuest site to help us navigate our way from place to some new place.  I've used all the above, and still do.  But my friend's point is very well-taken.  We mustn't dismiss our overall sense for place and direction.  Obviously these tools help, but they have inevitable flaws in them.  Because they're created by people, they simply cannot be perfect.

Let me give an example:  This past Spring, the northern Chicago area suffered heavy storms, which resulted in flooding in our county.  My daughter was babysitting for a family, who lived a distance away, and I was driving her there.  The Maps app on her iPhone 5 didn't account for the fact that typical routes were blocked off, because of flooding from the river.  What's more, traffic formed quickly on alternate routes to that family's house.  I didn't know exactly where they lived, but I did have a general sense and therefore drove us in that direction.  To be a bit more precise, the family lived east northeast of where we lived.  But because of flooding and traffic, I had to drive us north, then south, then a little angled west, before we could even head east.  In the end, we relied, too, on a very low-tech method:  We called the family, and asked for directions.

In this film, Rosling points out that in Japan, fewer heart attacks were correlated with a low-fat, little-wine diet in Japan.  Yet, fewer heart attacks were also correlated with a high-fact, much-wine diet in France.  How do we make sense of these crazy-making statistics?  (It reminded me of how confusing some signs were in the UAE, where I lived:  For example, one sign at a crossroad pointed an arrow left and right for Dubai.)  

The mantra that statisticians have repeated is:  Correlation is not causation.  In the above example, the relationship between heart disease and personal diet is very real, but it may be mediated by one or more other factors, such as physical activity, hereditary makeup, or even some other, yet undisclosed item in Japanese and French diets.  

Our friends on social media, and writers in popular publications, may tout findings from this or that scientific study.  I encourage us to take note and read them, but also to think about them and ask, What do these findings mean?  Do they really make sense?  What do other people say?  What have other studies found?  

Rosling's colleagues in the film are more blunt about this point:  Scientists must challenge, scientists must refute, scientists must try to destroy correlational findings.  That is, with alternative perspectives and explanations.  If the correlation still stands, then most likely it truly elucidates a particular relationship in our reality.   

I have often drawn on the multitrait-multimethod approach of inquiry and knowledge.  Basically, it means that we're best likely to arrive at the most accurate, most complete picture of things, if we measure a few things at a time (because they're interconnected, after all) and if we measure them in various different ways (because each measure has margins of error).  

So if you had stopped at Japanese diet or at French diet alone, then you would've ended up with a skewed grasp of what it took to reduce heart disease.  But the fact that you took a multimethod tact, instead, and found yourself with a seeming contradiction, is a good thing.  The contradiction prompted you to question more and investigate further, and overall helped you realize the greater complexity of heart disease.

In conclusion, statistics is an awesome platform and reference, but to understand our world we have to resort to more than statistics.

Rosling and his production team do a superb job of elucidating the value and the joy of statistics.  We can understand our world better through numbers and analytics.  But we cannot, and must not try to, dispense with our thinking.  As limited or as flawed as it may be, our thinking is still invaluable and necessary.

On this note, I offer my Tripartite Model.  Briefly for now, I suggest that science, art and religion altogether position us better for understanding not just the world around us, but also the people in that very world.  Each of this houses such bodies of knowledge, history and culture that I would be remiss to simplify them.  So just for the sake of brevity here, let me elaborate the attributes or characteristics that I mean:  (a) science: analytic, quantitative and rational; (b) art: intuitive, qualitative and creative; and (c) spiritual, philosophical and meta-rational.  

Whether we refer to it as statistics, or Big Data and Analytics, the scientific endeavor is a very powerful platform for understanding, and as Rosling so deftly demonstrates very entertaining, too.  But it simply is not enough.  It is irrevocably just part of a bigger whole.  

Thank you for reading, and let me know what you think!

Ron Villejo, PhD

No comments:

Post a Comment