A very warm welcome back to all here in Data Science Central. I decided to post today given that a friend in a common Social network shared with me one link that I thought to be in the interest of the community of good and responsible Data Scientists, as it were.
It concerns a blog post from Quantopian, which is an interesting new crowd-sourced investing platform vendor, a new breed that is emerging in the field of Quantitative Finance and Fund Management. In a somewhat academic but certainly informed pragmatic way the post delves in some of the intricacies of Data Science topics like the relationship between Models and Inference, the problems of how to fit the data in the Model, etc. It was nice reading specially coming from a field that is Data intensive, even if not that much sharing of knowledge intensive, but that is another story. I hope the readers will like and.... well share it.
Here are some excerpts and images:
Unfortunately, as anyone who has done such a thing can attest, it can be extremely difficult to fit your dream model and requires you to take many short-cuts for mathematical convenience. For example, everyone knows that financial returns are not normally distributed but still, explicitly or implicitly this assumption is still made a lot (e.g. the Sharpe ratio as I show in my talk, but also every time you use a linear regression like when estimating financial alpha and beta). Why? Because it's so convenient to work with! Thus, statistical modeling more often looks like this in reality:
So a lot of times we don't build the models we think best capture our data but rather the models we can make inference on.
I have blogged before about Probabilistic Programming and besides posting the video of a recent talk I gave with accompanying code (see below), I would like to highlight how Probabilistic Programming gets us much closer to the ideal I visualized above. In short, Probabilistic Programming Systems allow you to specific statistical models in code. Once specified in such a way, fitting this model to data (i.e. inference) is completely automatic (if things go well).
The link to the video in the excerpt is a recommendation to learn more about Probabilistic Programming as it can be of great help in dealing with Data Science issues and its problems. Well worth a view.