]]>

In my prior blog post, I wrote of a clever elf that could predict the outcome of a mathematically fair process roughly ninety percent of the time. Actually, it is ninety-three percent of the time and why it is ninety-three percent instead of ninety percent is also important.The purpose of the prior blog post was to illustrate the weakness of using the minimum variance unbiased estimator (MVUE) in applied finance. Nonetheless, that begs a more general question of when and why it should be used, or a Bayesian or Likelihood-based method should be applied. Fortunately, the prior blog post provides a way of looking at the problem.Fisher’s Likelihood-based, Pearson and Neyman’s Frequency-based and Laplace’s method of inverse probability really are at odds with one another. Indeed, much of the literature of the mid-twentieth century had a polemical ring to it. Unfortunately, what ended up coming about was a hybridization of the tools, and so it can be challenging to see how the differing properties matter.In fact, each type of tool was created to solve different kinds of problems. It should be unsurprising that they excel in some places and may even be problematic in some cases.In the prior blog post, the clever elf was able to have perfect knowledge of the outcome of a mathematically fair process in eighty percent of the cases and was superior in thirteen of the remaining twenty percent of the cases because the MVUE violates the likelihood principle. Of course, that is by design. It is supposed to violate the likelihood principle. It could not display its virtues if it tried to conform. Nonetheless, it forces a split between Pearson and Neyman on one side and Fisher and Laplace on the other.In most statistical disagreements, the split is among methods built around null hypothesis methods and Bayesian methods. In this case, Fisher’s method will sit on Laplace’s side of the fence rather than Pearson and Neyman’s. The goal of this post is neither to defend nor to attack the likelihood principle. Others have done that with great technical skill.This post is to provide a background of the issues separate from a technical presentation of it. While this post could muck around in measure theory, the goal is to extend the example of the cakes so the differences can be made apparent. As it happens, there is a clean and clear break in the information used between the methodologies.The likelihood principle is divisive in the field of probability and statistics. Some adherents to the principle argue that it rules out Pearson and Neyman’s methodology entirely. Opponents either say that its construction is flawed in some way or, simply state that for most practical problems, no one need care about the difference because orthodox procedures work often enough in practical situations. Yet these positions illustrate why not knowing the core arguments could cause a data scientist or subject matter expert to choose the wrong method.The likelihood principle follows from two separate tenets that are not individually controversial or at least not very controversial. There has been work to explore it, such as by Berger and Wolpert. There has also been work to refute it and to refute the refutations. See, for example, Deborah Mayo’s work. So far, no one has generated an argument so convincing that the opposing sides believe that the discussion is even close to being closed. It remains a fertile field for graduate students to produce research and advancements.The first element of the likelihood principle is the sufficiency principle. No one tends to dispute it. The second is the conditionality principle. It tends to be the source of any contention. We will only consider the discrete case here, but for a discussion of the continuous case, see Berger and Wolpert’s work on it listed below.A little informally, the weak conditionality principle supposes that two possible experiments could take place regarding a parameter. In Birnbaum’s original formulation, he considered two possible experiments that could be chosen, each with a probability of one-half. The conditionality principle states that all of the evidence regarding the parameter comes only from the experiment that was actually performed. The experiment that did not happen plays no role. That last sentence is the source of the controversy.Imagine that a representative sample is chosen from a population to measure the heights of members. There will be several experiments performed by two research groups for many different studies over many unrelated topics. The lab has two types of equipment that can be used. The first is a carpenter’s tape that is accurate to 1/8th of an inch (3.125 mm), while the other is a carpenter’s tape that is accurate to 1 millimeter. A coin is tossed to determine which team gets which carpenter’s tape.The conditionality principle states that the results of the experiment only depend on the accuracy of the instrument used and the members of the sample and that the information that would have been collected by using the other device or a different sample has to be ignored. To most people, that would be obvious, but that is the controversial part.Pearson and Neyman’s methods choose the optimal solution before seeing any data. Any randomness that impacts the process must be accounted for, and so the results that could have been obtained but were not are supposed to affect the form of the solution. Pearson and Neyman’s algorithm is optimal, having never seen the data, but may not be optimal after seeing the data. There can exist an element of the sample space that would cause Pearson and Neyman’s method to produce poor results. The guarantee is for good average results upon repetition over the sample space, not good results in any one experiment.There are examples of pathological results in the literature where a Frequentist and a Bayesian statistician can draw radically different solutions with the same data. To understand another place that a violation of the likelihood principle may occur, consider the lowly t-test.Imagine a more straightforward case where the lab only had one piece of equipment, and it was accurate to 1 millimeter. If a result is significant, then the sample statistic is as extreme or more extreme than what one would expect if the null is true. It compares the result to the set of all possible samples that could have been taken if the null is true. Of course, more extreme values were not observed for the sample mean. If a more extreme result were found, then that would have been the result and not the one actually observed. What if the result is the most extreme result any person has ever seen, can someone really argue that the tail probability is full?The conditionality principle says that if you didn’t see it, then you do not have information about it. You cannot use samples that were not seen to perform inference. That excludes all t-, F-, z-tests, and most Frequentist tests because they are conditioned on a model that assumes that certain things are real that have never been observed.A big difference between Laplace and Fisher on one side and Pearson and Neyman on the other is whether all the evidence that you have about a parameter is in the sample, or whether samples unseen must be included as well.The non-controversial part of the likelihood principle is the sufficiency principle. The sufficiency principle is a core element of all methods. It states something pretty obvious.Imagine you are performing a set of experiments to gather evidence about some parameter, and you are going to use a statistic , where t is a statistic sufficient for the parameter. Then if you conducted two experiments and the statistics were equal, , then the evidence about the parameter in both experiments is equal.When the two principles are combined, Birnbaum asserts that the likelihood principle follows from it. The math lands on the following proposition. If you are performing an experiment, then the evidence about a parameter should depend only on the experiment actually conducted and the data observed through the likelihood function.In other words, Fisher’s likelihood function and Laplace’s Likelihood function are the only functions that contain all the information from the experiment. If you do not use the method of maximum likelihood or the method of inverse probability, then you are not using all of the information. You are surrendering information if you choose something else. Before we look at the ghostly cakes again, there are two reasons to rule out Fisher’s method of maximum likelihood. The first is that, as mentioned in the prior blog post, there is not a unique maximum likelihood estimator in this case. The second, however, is that Fisher’s method isn’t designed so that you could make a decision from knowing its value. It is designed for epistemological inference. The p-value does not imply that there is an action that would follow from knowing it. It is designed to provide knowledge, not to prescribe actions or behaviors.If you use Fisher’s method as he intended, then the p-value is the weight of the evidence against your idea. It doesn’t have an alternate hypothesis or an automatic cut-off. Either you decide that the weight is enough that you should further investigate the phenomenon, or it isn’t enough, and you go on with life investigating other things.In the prior blog post, the engineer was attempting to find the center of the cake using Cartesian coordinates. The purpose was to take an action that is cutting a cake through a particular point. She had a blade that was long enough regardless of where the cake sat that was anchored in the origin. In practice, her only real concern was the angle, but not the distance. Even though two Cartesian dimensions were measured, only one is used in polar coordinates, the angle.The clever elf, however, was using a Bayesian method, and the likelihood was based on the distance between points as well as the angles. As such, it had to use both dimensions to get a result. The reason the MVUE was less precise is that it violates the likelihood principle and throws away information.It is here we can take another look at our ghostly cakes by leaving the Cartesian coordinate system and moving over to polar coordinates so we can see the source of the information loss directly. This difference can be seen in the sampling distribution of the two tools by angle.Why bother with the MVUE at all? After all, when it doesn’t match Fisher’s method of maximum likelihood, then it must be a method of mediocre likelihood. What does the MVUE have to offer that the Bayesian posterior density cannotThe MVUE is an estimator that comes with an insurance policy. It also permits answers to questions that the Bayesian posterior cannot answer.Neither Laplace’s Bayesian nor Fisher’s Likelihood methods directly concern themselves with either precision or accuracy. Both tend to be biased methods, but that is in part because neither method cares about bias. An unbiased estimator solves a type of problem that a biased estimator cannot solve.Imagine an infinite number of parallel universes where each is slightly different. A method is either unbiased and accurate or biased and inaccurate. For someone trying to determine which world they live in, the use of a Bayesian method implies they will always tend to believe they live in one of the nearby universes, but never find which one is their own, except by chance.Using Pearson and Neyman’s method also allows a guarantee against the frequency of false positives and a way to control for false negatives. Such assurance can be valuable, particularly in manufacturing quality control. That assurance extends to confidence, tolerance, and predictive intervals. Such a guarantee of coverage also holds value in academia.Finally, under mild circumstances such as correct instrumentation and methodology, Pearson and Neyman’s method allows for a solution to inferences in two ways that are unavailable to the Bayesian approach.First, Frequency methods allow for a complete form of reasoning that is not generally available to Bayesian methods. Bayesian methods lack a null hypothesis and are not restricted to two hypotheses. There should be one hypothesis for every possible combination of ways the world could exist. Unfortunately, it is possible that the set of Bayesian hypotheses cannot contain the real model of the world.Before Einstein and relativity, there couldn’t have been a hypothesis that included curvatures in space-time and so a Bayesian test would have found the closest fit to reality but also would have been wrong. Without knowing about relativity, a null hypothesis could test whether Newton’s laws were valid for the orbit of Mercury and discover that they were not. That does not mean a good solution exists from current knowledge, but it would show that there is something wrong with Newton’s laws.Additionally, Bayesian methods have no good solution to solve a sharp null hypothesis. A null hypothesis is sharp if it is of the form . Although there is a Bayesian solution in the discrete case, in the continuous case, there cannot be one because it would have zero measure. If it is assumed that , then the world should work in a specific manner. If that is the real question, then a Bayesian solution cannot provide a really good solution. There are Bayesian approximations but no exact answer.In applied finance, only Bayesian methods make sense for data scientists and subject matter experts, but in academic finance, there is a strong argument for Frequentist methods. The insurance policy is paid for in information loss, but it provides benefits unavailable to other methods.If the engineer in the prior blog post had been using polar coordinates rather than Cartesian coordinates, there would not be a need to measure the distance from the origin to find the MVUE because the blade was built to be long enough. The Bayesian method would have required the measurement of the distance from the origin.At a certain level, it seems strange that adding any variable to the angles observed could improve the information about the angle alone, yet it does. The difference between the MVUE and the posterior mean is obvious. The likelihood function requires knowledge of the distances. Even though the distance estimator is not used in the cutting of the cake, and even though there are errors in the estimation of the distance to the center from each point, the increase in information substantially outweighs the added source of error. Overall, the noise gets reduced from the added information.Finally, the informational differences should act as a warning to model builders. Information that can be discarded using a Frequency-based model may be critical information in a Bayesian model. Models like the CAPM come apart in a Bayesian construction, implying that a different approach is required by the subject matter expert. The data scientist performing the implementation will have differences too.Models of finance tend to be generic and have a cookbook feel. That is because the decision-tree implicit in financial modeling is around finding the right tool for the right mix of general issues. Concerns for things such as heteroskedasticity, autocorrelation, or other symptoms or pathologies all but vanish in a Bayesian methodology. Instead, the headaches often revolve around combinatorics, finding local extreme values, and the data generation process. To take advantage of all available information, the model has to contain it. In the ghostly cakes, more than the angle is necessary. The model needs to be proposed and measured.Berger, J. O., & Wolpert, R. L. (1984). The likelihood principle. Hayward, Calif: Institute of Mathematical Statistics.Mayo, Deborah G. (2014) On the Birnbaum Argument for the Strong Likelihood Principle. Statist. Sci. 29, no. 2, 227-239See More

How should an estimator be chosen? The academic training of economists and finance professionals has traditionally favored the minimum variance unbiased estimator (MVUE). Sometimes, the maximum likelihood estimator (MLE) is chosen. From time to time, the method of moments (MOM) or the generalized method of moments (GMM) is used. Because of its subjective nature, Bayesian methods are rarely used. The problem with this hierarchy is that it is preference-based and ignores the axiomatic structures that underly the possible choices. The argument to be made here is that the choice of an estimator can have unexpected consequences. To illustrate this, and before going back to the underlying principles, we are going to look at a simple information-based lottery situation where the choice of estimator creates a surprising outcome. Our story begins with an engineer and the owner of a bakery. The owner has a problem. His best baker is a ghost, and the cakes he makes are invisible. The important thing about the cakes is not that they are invisible, but that they become visible once cut. The cakes have an unusual magical property. If the cakes are cut into two pieces, the large piece turns purple while the small piece turns green. But, if the cakes are cut precisely in half, then both halves glow like gold and float in the air. The gold cakes sell for much more than the ordinary green and purple ones, so much so that the owner hired an engineer to maximize the number of gold cakes. Specifically, the contract is to build a device or devices that are to be attached to an existing table that will maximize the number of golden cakes. All of the cakes are six inches in diameter and remain invisible until cut. After baking the cakes, the ghost places them on the large square table. Because ghosts are not subject to limitations like gravity or barriers such as walls, doors, or tables, the ghost places the cake on the table randomly. Not only is the cake placed at a random location, but the locations are uniformly distributed over the surface of the table. Upon investigation, the engineer determines that the cake releases an invisible light from the decay of the ectoplasm used to ice the cake. The ectoplasm detector determines that the release of ghostly particles is also uniformly distributed over the surface of the cake. The engineer realizes that if the center of the cake can be found and if a blade could be placed on one corner of the table, treating it as the origin, then a blade cut through the center of the cake would split the cake into two equal pieces. The engineer was excited. A long blade was easy to forge, even though the blade had to have magical glyphs inscribed in it. Furthermore, attaching the blade at a corner was a straightforward engineering task. Detecting the decaying particles of ectoplasm was simple because detectors already existed, but finding the center of the cake did not have an obvious solution. So the engineer picked up a statistics textbook. She reads up on statistical estimators and decides to use the MVUE, which in this case, is the sample mean. She excludes the MLE because there cannot be a unique estimator. The MVUE has two very desirable properties. The first is that there is no way to construct a more accurate estimator than to construct an unbiased estimator for finite samples. Second, by having the smallest variance for the sampling distribution of the estimator, there is no less risky way to construct an unbiased estimator, given no prior information as to where the cake is located. The engineer calculates how many points can be collected until the cake is cool enough to cut. It turns out to be forty points. The engineer is more than satisfied with the sample size. Being proud of the design, the engineer tells many friends about it, including a clever elf. The clever elf goes to the bakery owner and recommends cutting the cake in a large open area so the entire town can watch the process for each cake. The elf was right; the town was fascinated. The elf also suggested building a display to show the location of every data point. Like a countdown for a rocket launch, watching a value appear in the data display increased anticipation. That, in turn, increased sales of coffee, cookies, and muffins. Once enough elves had arrived, the elf started taking gambles with other elves about whether the left side or the right side would be green. The bakery turned into a veritable casino filled with all the elves in town. That lasted until the clever elf started betting. At first, no one paid any attention to the elf. However, word started getting around that the clever elf was winning a lot of money. A few of the older and wiser elves started paying attention. They noticed that about one-fifth of the time, the clever elf would not gamble at all. In fact, the clever elf often said that he was sick at those times and apologized because he felt he couldn’t concentrate. The older, wiser elves brought in a wizard to detect if magic was being used. It was not. They did notice that when a new data point appeared on the display, the clever elf would input it into his cell phone. They also noticed that during the other four-fifths of the time, the clever elf would bet as much as possible, even borrowing money from others. They also noticed that the clever elf never lost. The clever elf was winning one hundred percent of the time, or not gambling at all. Word got around. No one would take his bets and gambling halted when he was around because everyone would mimic his bets. No one would take the other side of the bet. The clever elf realized that he had outsmarted himself, so he went to the orcs. He explained to the orcs about how much fun the elves were having gambling and so the orcs went to the bakery. The clever elf changed tactics. The elf realized that he would have to gamble all the time. He just chose to make small bets twenty percent of the time and bet everything eighty percent of the time. The behavior was so obvious that the orcs quickly realized something was going on, and they accused the clever elf of cheating. The clever elf learned a hard lesson about gambling with orcs. Finally, the clever elf went to the humans. He decided that there were so many humans that if he gambled the same amount on every bet and spread it over enough people, no one would notice. The clever elf won around ninety percent of the time. Eventually, the clever elf built up enough money that he went to the king and asked to be admitted to the nobility. The king agreed, for a very substantial fee, of course, and the disclosure of how the elf won so much money. The elf explained that whenever the MVUE was outside the Bayesian posterior, then it had to be in a location where it was physically impossible for that to be the center of location for the cake. The clever elf started to explain Bayes' law, but the King stopped him. He said, “show me in pictures about this Bayes’ law.” The elf explained that Bayesian analysis is built on the likelihood function and prior knowledge. Of course, no one had prior knowledge of where the baker would place the cakes, so each point was equally likely. The prior probability of the cake’s center being at any point on the table was one divided by the area of the table if the center could be on an edge and a reduced area if the entire cake had to be on the table somewhere. Nonetheless, it is the likelihood that matters here. Because the cake is of known diameter, it is known that the center of the cake must always be within three inches of any observed point. The likelihood function reverses this property. Each point three inches around the first observation is equally likely to be the true center of the location of the cake. The likelihood is for every point inside or on the perimeter of the circle that it is the correct point. This new circle is called the posterior density of the first observation. This posterior density is also the prior density of the second observation. This process of changing beliefs about where the center of the cake is located is called “posterior updating.” Once a second observation happens, any point inside the intersection of the two circles is now the posterior distribution of beliefs about possible locations. In this case, it creates a posterior that looks like a lens.The only points that could be the center of the cake sit inside this lens. This lens-like shape becomes the posterior distribution of the second observation and the prior distribution for the third observation. It is with the third observation that the Frequentist sample mean, and the Bayesian posterior mean no longer agree. The three observed points are (4.5,8), (9.5,8), and (6.75,8). The reason that there is no difference in the posterior from the second to the third observation is that the valid choices for the center of the cake based only on the third observation will make no changes to the intersection. That brings up an essential element of the multiplicative nature of Bayesian analysis versus the averaging nature of Frequentist analysis. Only new information about a parameter gets into the posterior density or mass function. If, as here, the prior already has the same or more information content than the likelihood, then the posterior will equal the prior. For a data point to impact the posterior calculations, that data point has to have information not already known about a parameter. Although it is true that had the third point been observed before the second, then all three points would have impacted the calculation of the second and third posteriors, the joint posterior does not depend on the information in the final observation, only the first and second of the two. The Frequentist calculation, on the other hand, is based on averaging over the sample space. It has several valuable properties. First, it is the MVUE. Second, it has good properties over the entire sample space but may not have good properties for a specific sample. Third, if used in a decision, as if the correct point, it minimizes the maximum possible loss that could obtain. The MVUE is based on the impact of the average amount of information given a true model and does not depend on knowing the real value of the parameter. The difference causes the Bayesian posterior mean to be (7,8), and the Frequentist sample mean to be (6.91667,8). In this case, both are inside the posterior. The clever elf, shows the King the plot of the posterior for the second cake that was baked, assuming it was centered on (7,8). It is clear that the sample mean (green) is outside the set of possible values (blue, with red for the posterior mean). The engineer splutters at hearing this. The elf, now called Lord Lolthlorian, ask the engineer if they remembered in their first statistics course of seeing a confidence interval such as [-2,12] for a sample that could only contain positive numbers? The engineer replied, well, of course, but the only thing that matters about a ninety-five percent confidence interval is that the interval covers the parameter ninety-five percent of the time or higher. It does not matter than the left bound is impossible; it only matters that the interval covers the parameter often enough. Yes, and for the MVUE, what matters is that it minimizes quadratic loss over the sample space, while, for the posterior mean, what matters is that it minimizes quadratic loss over the supported parameter space. It could exist that the posterior mean does not exist, such as when there is a hole in the parameter space, and the average of the posterior would be in that hole. In that case, however, a quadratic loss would be inappropriate for either method. In that case, the Bayesian posterior mean would not be a closed operation over the possible set. The engineer, recovering her balance, points out that the goal was not to build a casino but to cut cakes well. Lord Lolthlorian agreed. He said, “we have seen a thousand cakes cut, we should look at each cut and see how it worked out. We can ignore the specific cuts and look at the performance of the estimators. First, we should check to verify the data appears as it should. They make a chart of all 40,000 points to see if they see any unusual holes or patterns. They see none. They also find the marginal densities along the x- and y-axis using kernel density estimation expecting to see a parabola, and that is approximately what they do see. The marginal density along the y-axis is omitted here. Next, they find the kernel density estimates of the two types of estimators, placing them on the same scale for comparison. They do so along each axis, although the sampling distributions along the y-axis are omitted here for space.Finally, they construct Tukey’s five-point summary with the mean for both dimensions, looking at both the Frequentist sample mean and the Bayesian estimates of the center. x_bary_barPosterior_c_xPosterior_c_yMin.6.1267.1926.5657.6541st Qu.6.8367.8516.9457.953Median7.0018.0057.0018.000Mean6.9998.0106.9998.0023rd Qu.7.1578.1707.0528.051Max.7.6718.7637.5408.452 This disturbs the engineer. Lord Lolthlorian points out that the Bayesian estimator, subject to the choice of a prior density, automatically makes tradeoffs between accuracy and precision. The Bayesian estimator is intrinsically biased but exchanges that bias for increased precision. The trade-off happens in a manner such that a Bayesian estimator cannot be first-order stochastically dominated by another estimator, subject to the prior. This disturbs the engineer, and they ask, “why isn’t this taught in statistics courses for practitioners?” Lord Lolthlorian responds that there is only so much time in a program to teach statistical methods, and these methods require calculus and numerical integration skills. Also, although it is not as useful as a tool for gambling, lotteries, finance, or cake cutting, the MVUE would be superior in certain inferential problems. If there were a null to falsify, particularly a “sharp null,” the Frequentist method would shine. Also, if the maximum loss were terrible, then it may be superior. There is a difference in the utility of a tool if one of the options is that everyone in some circumstance would die if a bad sample were obtained instead of some people losing money from a gamble or some cakes being a little lopsided. Lord Lolthlorian said, as long as we are discussing it, we should look at the properties of the Bayesian and Frequentist estimator when the MVUE is inside the posterior and when it is outside the posterior to gain some information about how either of them is performing as estimators. When the MVUE is inside the posterior, their sampling distribution is very similar, although the Bayesian estimator has a bit more mass in the center. Only the density along the x-axis is shown for brevity. It is when the MVUE is outside the posterior that the differences become substantial. A value near the perimeter of the cake, or a run of values on one side of the axis, will tend to pull the sample mean. On the other hand, a value near the perimeter tends to cut down the size of the posterior, making it more precise. Likewise, a run of many values located near together has almost no impact on the posterior density because the points have approximately the same information in them. The impact of this can be seen on the sampling distributions of both types of estimators based on whether the MVUE has been pulled outside the posterior or not.The precision of the MVUE deteriorates when it is pulled outside the posterior either by values near the perimeter or runs, although it is equally accurate. The Bayesian estimator, however, has its precision improved by the added information from runs or the presence of edge values. Because the MVUE was outside the posterior in seven-hundred and ninety-one of the one thousand cases, the effect is a bit more pronounced than if being outside the posterior were a weak effect. Looking at the sampling distribution of the estimates of the center where all cakes were placed at (7,8) for all 1000 cakes results in a pronounced difference in precision between the two. The MVUE generates a wide, mildly sloping hill, while the Bayesian posterior mean generates a narrow, steep mountain. The engineer and the owner of the bakery confer. Lord Lolthlorian asks if they are going to switch to Bayesian methods. The engineer replies, “no.” Lord Lolthlorian remained surprised until the engineer came out thirty minutes later with a six by six-inch table with a metal frame to slowly move the cake from the edge if it is placed partly over an edge and a blade that cuts diagonally across one of the four corners. The engineer says, “if you cannot control for the natural variation in nature, you change the nature of the problem so that you do not have to deal with this.” To maximize revenue, the cutting of the golden cake was performed out of sight, and the cake was given as a prize for coffee drinkers that bought lottery tickets while watching the old cake splitter still produce cakes, but with hidden data so no one could replicate the clever elf’s solution. The clever elf moved to the United States to work in finance. He was offered a job in statistical arbitrage; he declined the position, instead set up a real arbitrage fund. The elf noted that there were several sources of arbitrage present in the finance market and that some data scientists were unknowingly violating a principle of probability and a principle of macroeconomics. Together, the use of these tools was creating arbitrage opportunities against market makers and hedge funds. The elf could not help but notice the irony that tools designed to capture statistical arbitrage opportunities were accidentally creating arbitrage opportunities. A friend that introduced the elf to finance asked: “well, couldn’t we just use the same formulas but with a Bayesian estimator?” Lord Lolthlorian responded, “no.” The models in use are built on Frequentist axioms, and when one attempts to derive them in a Bayesian framework, the results are not the same. Models that look like do not follow as solutions under Bayesian probability interpretations. Underlying these models is an assumption that the parameters are known. Parameters are random variables in Bayesian thinking. A data scientist cannot just pick up one model and plop it into a Bayesian space; the data scientist has to start over. His friend complains, “but these are valid Frequentist models, and there is no edge to the cake in a parameter space that is a half-bounded subset of the real numbers.” The elf replied, “you are thinking about this as a regression and in terms of parameters, but there is an edge. That edge has to be the edge for any model of finance. It requires that the price of an asset, in equilibrium, must equal its discounted cash flows, which must also equal the replacement cost of the firm.” The friend complained, “but what are the two principles being violated?” The elf explained that the probability issue is called coherence. A statistic is only coherent if fair gambles could be placed on it. Frequentist statistics are not coherent because that is not their goal. Their goal is protective. It is to control for certain types of errors. That makes the frequencies subject to a loss function rather than being actual probabilities. Because the sampling distribution of the mean and the median are different, they generate different confidence intervals, different predictive intervals, and different point estimates. They imply different frequencies. The distributions are first conditioned on a loss function. Bayesian predictions minimize the K-L divergence between the model and nature, so they are, intrinsically, the closest model to nature that is possible given the information. That happens because the K-L Divergence can be derived directly from Bayes theorem. It is a direct transformation of the Bayesian posterior predictive density and nature’s density. The macroeconomics principle is a bit more subtle. An example of this can be seen in the savings paradox. If everyone starts saving, then no one consumes, and the return on savings collapses, and investment generates total losses. If everyone consumes, then no seeds are saved for the next season, and everyone dies. If all data scientists use the same general methodology, without checking the rationality of those models against ground truth and instead do backtesting and cross-validation, then they create Keynes style rigidities that would not otherwise exist in nature because they have unintentionally adopted highly similar trading rules. When Long Term Capital Management (LTCM) collapsed, it was a surprise because the markets had been functioning as predicted up to the collapse. What was missing was that LTCM was, unintentionally, controlling the trading rules for the entire system. As such, everything had to go the way they priced everything. The rigidities in the system were their own. When cash flows for the underlying assets diverged from their models, the system collapsed. “If most data scientists are using the same models and the models do not match the physical reality, then they unintentionally create long-run arbitrage,” explained the elf. “Want to go out for some cake,” asked the friend. “I know a great bakery,” said the elf. David Harris can be found on LinkedIn hereThe code to produce the data is:rm(list = ls()) #clear variables#grab libraries library(ggplot2) library(export)set.seed(101) #create repeatable set of random variables #must be 4/pi times greater than target or more Number_of_Samples<-1400Sample_size<-40#center of circle c_x<-7 c_y<-8 x<-matrix(runif(Number_of_Samples*Sample_size,-1,1),nrow =1) y<-matrix(runif(Number_of_Samples*Sample_size,-1,1),nrow = 1) Boolean<-matrix(rep(0,Number_of_Samples*Sample_size),nrow = 1)Boolean<-ifelse(x**2+y**2<=1,1,0) x<-x[Boolean==1] y<-y[Boolean==1]#target number Number_of_Samples<-1000 rm(Boolean) x<-x[1:(Number_of_Samples*Sample_size)]*3+c_x y<-y[1:(Number_of_Samples*Sample_size)]*3+c_yx<-matrix(x,nrow = Sample_size,ncol = Number_of_Samples) y<-matrix(y,nrow = Sample_size,ncol = Number_of_Samples)#initialize and define estimators for x and y axis, top=frequentist, bottom=bayesian #constructed as a matrix for clean use of apply function estimators<-matrix(rep(0,5*Number_of_Samples),ncol = Number_of_Samples,nrow = 5) row.names(estimators)<-c("x_bar","y_bar","Posterior_c_x","Posterior_c_y","Is_Freq_Est_Possible")bayesian_posterior_means<-function(variables,s=Sample_size){ #this splits data back into x and y #variables passed as single matrix to permit using apply family x<-variables[1:s] y<-variables[(s+1):(2*s)] #it is cleaner to recreate mean x and y than to pass them x_bar<-mean(x) y_bar<-mean(y) #creates acceptance-rjection variables AR_TRIES<-10000 #creates random draws for acceptance-rejection r_c_x<-runif(AR_TRIES,min = max(x)-3,max = min(x)+3) r_c_y<-runif(AR_TRIES,min = max(y)-3,max = min(y)+3) bayes_feasible<-rep(0,AR_TRIES) #creates a reporting variable as to whether the frequentist sample mean is more than three units from #at least one observation freq_feasible<-0 #tests each element of random possible solutions for feasibility for (i in 1:AR_TRIES) { if(max((x-r_c_x[i])**2+(y-r_c_y[i])**2)<=9)bayes_feasible[i]<-1 } if(max((x-x_bar)**2+(y-y_bar)**2)<=9)freq_feasible<-1 #posterior means for a uniform, bounded distribution are the marginal means bayes_x<-mean(r_c_x[bayes_feasible==1]) bayes_y<-mean(r_c_y[bayes_feasible==1]) return(c(x_bar,y_bar,bayes_x,bayes_y,freq_feasible)) }#applies bayesian posterior mean construction over the data set estimators[1:5,]<-apply(matrix(rbind(x,y),ncol=Number_of_Samples), 2, bayesian_posterior_means)#constructs mean squared error of each type of estimator Frequency_Mean_Squared_Error<-sum((estimators[1,]-c_x)**2+(estimators[2,]-c_y)**2)/Number_of_Samples Bayesian_Mean_Squared_Error<-sum((estimators[3,]-c_x)**2+(estimators[4,]-c_y)**2)/Number_of_SamplesFrequency_to_Bayesian_Relative_Efficiency<-Frequency_Mean_Squared_Error/Bayesian_Mean_Squared_Error Percentage_of_MVUEs_That_Are_Impossible<-(Number_of_Samples-sum(estimators[5,]))/Number_of_SamplesSee More

]]>

I want to apologize to my small audience for being so long in waiting to post a blog post. My goal had been one per week, but life intervened in the meantime. While I have tried to be productive, the blog post fell by the wayside. I did write a couple of short stories, I must confess. I wrote How to Engage in Counterespionage Operations Against Ghosts because I grew up watching Sherlock Holmes movies and the Twilight Zone. I wrote a small set of short stories about the first day of the second American civil war. Over the last decade, people have carelessly bandied about war as a solution to America’s structural problems. I wanted them to really get what that would mean. I also wrote a short story called The Exfiltration of a President. After all, if one of the most watched and guarded persons on Earth had to get to a jurisdiction without extradition, how would that be accomplished? Today’s post is about none of those, although national stability should surely be included in the risk premiums of equity securities shouldn’t it? I know I promised a sex edition, but it will have to wait to til the next blog post, I am going to provide something quite a bit more valuable. I have a paper in peer review for publication and a second to follow it upon acceptance. I also have to others waiting in the wings, but the implications for this are that the public will see everything a year or so from now. The post here is an attempt to thread the needle between respecting copyright and the peer review process by not duplicating content but also to permit people to build on the content while mine is standing in line. It also solves a validation problem for data science. Most data scientists are not aware of the controversies in finance or economics. There is no reason for them to be. They need the tools because data scientists are builders, they do not need to be mired down in the Allais paradox, the equity premium paradox, or the many other issues economists are paid to deal with. I say this because of the types of questions data scientist and others in the area of finance ask on Stack Exchange. It is obvious what is in the pedagogy of data scientists and what is not.In 1963, Benoit Mandelbrot published a paper called On the Variation of Certain Speculative Prices. The synopsis of the article would be, “if that is your theory, then this cannot be your data, and this is your data.” By 1973, Fama and MacBeth should have put mean-variance finance to bed as a construction of the world with a falsification sufficient to close out the field. The difficulty is that there was nothing to replace it with. Economics has been in the place that physics was, following the Michelson and Morley experiments. It knew classical physics had a severe problem somewhere, but it would have to wait for a generation until quantum mechanics, and relativity came around for it to work again.The mistake, however, in using models such as the Capital Asset Pricing Model, Black-Scholes or Fama-French is that we know they do not work. The error in thinking is that that would imply that nothing works and we don't know anything at all about what does work. The structure of the rest of this post is to show why mean-variance finance cannot work followed by showing a tool that works consistently, even though there isn’t a lot of theory at this point as to why.Why worry about a controversy in economics if someone is paying you to build something that does not work? Because if you can create something that can work, you have a customer for life. Although I have an idea of how I would move this process forward as I have performed and have taught securities analysis for over two decades, given the size of the professional population, many people will have better ideas than mine that may never have been considered if the broader group of professionals were not focused on what works empirically.The goal of this post is to get people thinking, talking and building. The proof regarding the excludability of mean-variance methods and the argument for their general exclusion was first put forth in their protoform by Poisson and Augustin Cauchy in articles in the first half of the nineteenth century. A similar argument was made by R.A. Fisher in the 1930s as an example of how the statistical methods of Pearson and Neyman could go wrong. Because of how much work is required to include mathematical symbols in a blog post, I will reference the work of others at times rather than re-derive their well-known work.I am conscious that formalisms are weird in blog posts, but it permits serious rebuttal, and it permits a sober analysis by data scientists working in finance. It also isn’t being submitted as a paper for two reasons. First, the proof has been well known in statistics since the 1940s, second, I have a similar paper already out there. For the lemmas and the theorem, there will be no dividends, mergers and no firm can go bankrupt. That is in line with the assumptions of the standard models and because the other items are not necessary to exclude mean-variance finance. Dividends, bankruptcy, liquidity costs, and merger risks will be essential for the portion discussing how to move forward.ProofsAssumptions and DefinitionsAssumptionsThere are very many potential buyers and sellers.The market is in equilibrium.The securities are equity securities. This excludes various forms of bonds and other assets such as antiques which have different distributions.The securities are exchanged in a double auction.Buyers purchase and sellers sell q securities at price p where .All securities are purchased at time t and sold at time t+1.It is known with certainty that none of these firms will go bankrupt or merge out of existence.The parameters are estimated from information.Errors at time t and are t+1 independent.DefinitionsThe reward for investing resources at time t is defined as The return is defined as A statistic is any function of the data.Equilibrium price is defined as Equilibrium reward is defined as The reward for investing is also defined as where is the equilibrium reward and is a random variable..Lemma The distribution of the reward for investing, or the return, is approximately the distribution of the errors for securities near their equilibrium prices.By Wold's Decomposition theorem and given the assumption of an equilibrium price, prices can be written as Since assumption 7 requires that it follows that definition 1 can be reduced to the ratio of prices.Definitions 1 and 5 provide two definitions of the reward which could be written as which leads to which for small errors around the equilibrium set of prices is Lemma If the price errors from the first lemma are normally distributed around zero then as prices go to the equilibrium the distribution of the errors to reward is the same as the distribution of the ratio of errors about the prices.However, if is normally distributed around zero, then after several normalizations, it is known from Marsaglia, that the distribution of where a and b are constants and x and y are normal random variables is proportionate to the standard Cauchy multiplied by a function which will go to one as the price goes to equilibrium, leaving only the Cauchy distribution portion. Note that this does not hold as prices go far from the equilibrium as would be the case in a bubble or market collapse. Nonetheless if the errors are normally distributed. Theorem Given the assumptions and the first two lemmas, the distribution of returns of the reward function is the truncated Cauchy distribution.Assumption 3 has an important consequence conditional on definition 1. There is no requirement in definition 1 that either the numerator or the denominator have a stochastic component. Had assumption 7 not excluded mergers and bankruptcy and assumption 3 not required an asset to have the properties of an equity security, then a wide variety of possible stochastic processes could be built into returns.If an asset had been a zero coupon bond, then the numerator would be known with certainty, and the distribution would reflect on the error of pricing at purchase, excluding bankruptcy risk. Likewise, if it were certain that a cash-for-stock merger would happen, then the numerator would also have been cash and certain. In addition, because firms should merge with undervalued firms, the assumption of an equilibrium should have been violated. Also, if the firm were to go bankrupt, then the distribution of prices would not matter, only the probability from the Bernoulli process that the future quantity was equal to zero.Do note that such certainty is not a real-world problem if the probability of a reward is decomposed into the reward given the firm remains going concern multiplied by the probability it will remain a going concern, plus the probability of a reward given a merger multiplied by the probability of a merger, plus zero times the probability of a bankruptcy. From assumption 4, it follows that there cannot be a winner's curse in equilibrium. The overlap in the limit book would prevent the possibility of a cursed price, so the rational behavior is for each bidder to bid their estimation of the expected price. From assumption 1 and the central limit theorem, it must be the case that the distribution of the limit book must converge to the Gaussian distribution as the number of bidders becomes large enough. From Curtiss, Gurland, and Marsaglia, it is well known that the distribution of the ratio of normal variates centered on zero, or in this case the equilibrium, must be the Cauchy distribution. Alternatively, if one converted the prices into polar coordinates, it follows that the solution is also the solution to Gull's Lighthouse Problem and again converges to a Cauchy distribution.From assumption 7 there are other states of nature not included in this proof including bankruptcy which limits losses to the original investment, truncating the distribution in reward space at zero and in return space at negative one hundred percent. As such, integrating the kernel from zero to infinity rather than negative infinity to positive infinity produces a density of where is the scale parameter of returns and is the ratio of the standard deviations of prices, making it also a measure of price heteroskedasticity. It is important to note that the scale parameter is not a variance as neither the population mean nor variance is defined. The reason the distribution lacks a defined mean or variance depends, in part, on how the integrals are defined, but in either circumstance, the expectation is which clearly diverges. Although the arctangent goes to unity as the reward goes to infinity, it is obvious that the product goes to infinity as reward goes to infinity, implying an undefined expectation.This absence of a mean has an unexpected, but well known, result in statistics. The sampling distribution of an estimator of a mean or of a least squares estimator will map to the distribution of the data. The implication is that one randomly chosen element of a sample, if used as the estimator of the center of location, has the same informational value as the sample mean of a billion points of data. If a squares minimizing process or an arithmetic average is used, then no meaningful solution can be found.With no mean, the models collapse.Apple as an ExampleUsing Apple as an example, consider the daily returns. Rewards were normalized to daily rewards to allow for weekends and market closings. As such, a value of 1 is the same thing as a zero percent return. Summary statistics for Apple from R are:Min0.48131 Qtr0.9896Median13 Qtr1.0113Max1.33323Mean1.00078 The lifetime range is almost eighty-six percent. The difference between the mean and the median seems small, but these are daily returns. The annualized difference is almost thirty-three percent. The Cauchy distribution, ignoring the truncation at zero, uses the median as the center of location. The normal distribution’s most efficient estimator is the mean. Which to use? A kernel density estimate of Apple’s daily return using the bi-weight method is shown below. Now the implicit model using the normal distribution is used below. The normal is in red. The maximum likelihood estimator was used. The systematic effects of liquidity costs, dividends, truncation, and uncertainty regarding the estimator were ignored. The same is true for the Cauchy model below. It is possible to improve the modeling for both by proper accounting for other effects. The implicit model using the Cauchy distribution is a substantial improvement but creates a problem. If it is a distribution without a mean, then least squares methods should not be used. For many models, the log difference is used rather than the raw data. The log model does have a mean and variance, but no covariance. The log distribution is the hyperbolic secant distribution and an improvement in the sense that a mean and variance exist, but not a gain concerning least squares as there is still no covariance structure about which to discuss systematic and idiosyncratic risk. Path ForwardThe news on the path forward is both good and bad. The good news is that the path forward has yet to be built in an automated format, and so there is a small fortune to be made in creating the design the market ends up adopting. Someone reading this may get rich. The bad news is that the path forward has yet to build in an automated format and it won't look like a regression of the style traditional in existing models. There will be many failures.Dividends cannot be ignored. Bankruptcy and mergers cannot be ignored. Liquidity costs cannot be ignored. It also requires building across data sets. If you observe firm X alone in a time series, how will you capture its probability of going out of existence prior to it going out of existence? If you observe a firm that has never paid a dividend, how will you predict likely future dividends? The idea of observing a single stationary time series is inadequate.I am hoping to create a push toward new activity and end discussions of older ideas such as volatility surfaces or WACC as they won't matter anymore. Many things will vanish. Alpha and beta will go away. Factors will likely come back, but without the good fortune of having a covariance structure to work with.So how to move forward, by beginning with things that are known to work. It is time to unshackle our minds from the straitjacket of fixating on the elegant. One of the tools that work is value investing.Value InvestingI am going to begin this exposition on value investing with a financial story set in Montana. The story begins decades ago with two brothers meeting, falling in love with, and marrying two sisters. The two new families purchased homes diagonal to one another on a street corner in Great Falls. Great Falls was a planned city. Founded in 1883 by Paris Gibson and built on the advantages hydroelectric power could provide to an industrial location, it is a study in the history of American architecture. A drive from downtown shows the slow expansion of the city and the periods where growth happened can be identified by looking at the design of homes on a block.The two homes the couples moved into were started and finished on the exact same day. The construction was identical, and the exteriors were identical. To save money, the two families made bulk purchases when repairs or changes were needed, and the two homes remained identical all through the years. Both families had one child. The children, Charlie and Sam, grew up. Sam moved to New York while cousin Charlie moved to Los Angles. They were building successful careers when tragedy struck.The two couples loved to do things together and decided to go to Glacier National Park. While traveling up one of the mountains, their car went out of control and fell hundreds of feet off the road killing everyone instantly. The cousins returned to Great Falls to bury their parents and settle their estates.Charlie’s parents had built up an illiquid real estate empire in Cascade County and around Montana. Sam’s parents were of modest means and except for the home only held highly liquid assets. The estates settled on the same day, and both cousins listed their homes for sale on the same day. Both had immediate offers for $200,000, and they immediately accepted them. The couple that made the offer on Charlie’s home decided sometime later to take a camping trip in Glacier and traveled there for a weekend of fun. Sadly, the couple came upon the same curve in the road, and they too fell hundreds of feet to their deaths. Charlie was notified of the deaths by the realtor and was told the couple’s estates were empty and that the sale was off. Charlie returned to Great Falls to see what could be done as the estate was bleeding cash and decided that it would best be handled in person. Incidentally, Sam was there for the upcoming closing on the house. They both went to the old neighborhood to see how things had changed.Afterward, Charlie went to a bar to find as many pints of Dam Fog from the Mighty Mo Brewing Company as possible. While sitting at the bar, Charlie talked about the failed sale of the home when someone interrupted and said, “would you take $140,000 for it? I can’t do more, its what I can get.” Charlie, ecstatic, cheerfully accepts the offer. His parent’s estate was asset rich and cash poor.Sam and Charlie have the closing on the sale at the same attorney’s office at the same time, just down the hall from each other. They go out for a celebratory drink and promise not to let it become so long until they see each other again.The new homeowners, Alex and Jessie, were friends and worked at the same industrial concern. Their homes were identical, and the only difference between the two houses was that Alex paid $140,000 and Jessie paid $200,000, both in cash.Eleven months passed, and the industrial concern announced a planned expansion. Real estate prices in the city rose, and the two friends decided to have the homes appraised just to see if they could turn a quick buck. The appraiser set the value of the houses at $220,000. Neither was satisfied with the price improvement, but both discussed waiting until the expansion happened and they could downsize if they could get enough money. Unfortunately for both of them, embezzlement happened at their place of work and the firm was suddenly shuttered. Unemployed, without immediate prospects, both sold their homes and moved away. Incidentally, they sold them two years from the date of purchase for $180,000.Now the question is, did one of them take more risk than the other, and if so, which one?Because the homes were fundamentally identical and located at approximately the same place, the risk of loss from fire, meteor strike, civil commotion, and so forth should be equal. The fundamental chance of damage to the structures is the same. TimeAlexJessie0$140,000$200,0001$220,000$220,0002$180,000$180,000Standard Deviation$40,000$20,000 The sample standard deviation of prices for Jessie’s home was $20,000 while it was $40,000 for Alex’s home over the period. As measured by variance, Jessie’s house was the less risky investment. Was it less risky? Consider the following three definitions of risk. One definition is exposure to loss, the second is exposure to uncertainty, the third is exposure to goal failure.To make things slightly more comparable, let us add the stipulation that Alex lied to Charlie and actually had another $60,000 in savings so that both have equal assets at the beginning. Those funds are still in savings. Imagine that instead of being either Alex or Jessie, we are nature, and we know the true probability distribution of prices at the end of the second year. Let us assume it follows the following, somewhat strange, ad hoc cumulative mass process.Market Price in Thousands of DollarsProbability a Value is Less Than or Equal to the Market Price500.00%10010.00%14020.00%17030.00%19040.00%20050.00%21060.00%23070.00%26080.00%30090.00%350100.00%Based on the first definition, being exposed to the risk of loss, Alex exposed less money and only had a twenty percent chance of experiencing a loss. In addition, thirty percent of Alex's portfolio is in a federally insured savings account, and so the variance could be considered zero. Because the variance in a Bernoulli trial is greatest at the fifty percent mark, Jessie took the greatest risk in terms of the uncertainty of outcome. Alex's variance is $22,400, while Jessie's is $50,000 when measured as the uncertainty of loss. What about when measured as exposure to raw uncertainty?Looking forward, once the homes’ values had gone back to equilibrium and stabilized, both houses had the same exposure to uncertainty at that price level, but Alex exposed fewer resources and still took less risk. Now consider exposure to goal failure. Let us imagine the goal was to make a ten percent simple interest rate of return on all investments over the two years. For Alex, the profit needs to be at least $28,000 plus returns on the cash while Jessie must make $40,000. If we assume that the above mass function is piecewise linear, then Jessie has a 26.67% chance of succeeding. Conversely, Alex has a 70.67% of succeeding.Now consider the case where Alex, a great-great-grandchild of Rip van Winkle, fell asleep at the moment of purchase and woke up just in time to sell. Alex couldn’t allocate the other $60,000 in cash in other investments, and so the home must make all $40,000 to reach the goal. Alex still has a 65% chance of succeeding while still holding fewer risky assets. When does Alex’s risk catch up to Jessie’s risk? If Alex would make a catastrophic purchase and lose 100% of the investment in the complementary set of assets, then they would have the same risk of goal failure.Part of the Math Behind Value InvestingLet us go back to the above concepts of present value and future value. We are still going to ignore liquidity costs, dividends, merger risk, and bankruptcy risk to simplify the discussion, which we clearly should not do in the real world. Let be earnings and be price, note that Also, note that these are, for our purposes, economic operating earnings and not accounting earnings. The difference is that the accounting principles are a tool with a purpose. The tool is a mixture of meeting business needs and the political needs of management, shareholders, and legislatures. Each nation has its own standard set of accounting rules. It is common for legislatures to pass laws deferring the payment of taxes for powerful interests. Consider a firm that entered into a transaction that resulted in a one hundred million dollar tax liability, but where the law allows for the deferral of the payment by ten years. Real taxes are rarely that simple, but in the US, MACRS is such a set of rules. If a ten-year eight percent zero coupon bond were available, then, ignoring the tax on the gain, the tax can be canceled by investing forty-six million, three hundred and twenty thousand dollars in the bond and simply waiting for it to mature. What happened to the other fifty-four million dollars? It is really equity. If you can cancel a hundred million dollar accounting liability for forty-six million dollars, then the other fifty-four million is a fiction in economic terms. Operating accounting earnings are a rules-based specification of how to divvy up operating cash flows between stakeholders such as customers, employees, creditors, and shareholders. They tend to be less volatile than cash flows, but are still rules driven. For our purposes, there will be no stochastic component, and earnings will be perfectly representative. The PE ratio, commonly used in value investing will be denoted as The reward on investing can be rewritten as Future earnings can be restated as prior earnings multiplied by a growth factor, So the reward for investing can be stated as In this simplified world, the relationship makes the sole controllable variable Note this is not modeled like an economic theory. In standard economics, Standard economics either describes a system at rest, or a system moving toward rest. There is no system here. In equilibrium, value investing is called the value trap. Because prices are properly ordered, no excess gain exists in the system. However, a system driven by equilibria is a system that will seek to return over time to its equilibrium. That fact can be seen as an advantage. Additionally, I have improperly set aside dividends to make life simpler, but the econometrics of dividends will need to be a future blog posting. The goal of value investing is to purchase assets with the smallest measures of price to value. It is sometimes mistaken that this would imply the lowest price to earnings, price to sales, or price to book, but those are only markers of value. A more sophisticated view is concerned with economic value and not accounting measures which can be skewed. If dividends were added, then return would be a discounted sum of the parts. This posting has no direct concept of time. I ignored time by making the growth factor instead of standardizing it as If dividends were present, then a more complicated sum would be used, but the lessons would be no different. Likewise, if real rules of accountancy were used, then this would probably run sixteen hundred pages long. It would need to contain the content of the 5th and 6th editions, from 1987 and 1943 respectively, of Graham and Dodd's Security Analysis. Yes, the sixth edition is a reprint of the 1943 edition. The fifth edition describes how things should be done, the 6th edition describes why it is done. They really are inseparable. So, in this nearly perfect world, how to apply the above story? First ask, “what role does the data scientist play here?” Is the data scientist the appraiser, the market maker, the trader, the portfolio manager, or several of the above? The data scientist would move over the set of securities, favoring none. The Bayesian predictive distribution would need to be constructed of future prices and earnings. Such a thing includes risk in the distribution, inherently. The Bayesian predictive distribution is where is the sample space. Predictions need to form on all cash flows in the holding period. If something increases bankruptcy risk, it decreases the probability of getting a standard return or a return from a merger happening. In a world with dividends, anything that would make a dividend uncertain makes the return uncertain. The underlying firm operating risk is a return risk and included in the prediction. Actual economic and accounting values would have to rear their ugly heads and be included in the analysis. Given the above good news and bad news, that is the bad news. Still, we are not at a model, all we have done is spoke about the nature of return and current price. If we are not currently in equilibrium, that is we are not about to be in a value trap, then prices are not properly ordered. Most likely, most prices are properly ordered, but some will not be. Some will be overpriced, and some will be underpriced. What happens with is that it can be used to index the predictive distribution as a cumulative density. Consider an investor that required an eight percent rate of return. Imagine that if then there is a fifty percent chance of reaching the goal, but if then there is an eighty percent chance of reaching the goal. It converts this problem from one without a mean or a variance, to a multinomial problem, or a problem of minimizing the expected loss from goal failure. The viewpoint of value investing differs radically from models such as the CAPM. Each cash flow holds a potential value. The less one pays for a cash flow, the less risk and the higher return one would expect to receive. Because of this, misvalued securities should be rare, and they are. Value investors are looking for errors in the ordering of securities and other investments in terms of price to value. Earnings were used above, but profits are not always a reliable index both for economic reasons and accounting reasons. Accounting statements include both stated values and notes to explain the stated values. Real accounting happens in the notes. It is also where the valuation process begins.From the view of data science, it would be a massive undertaking. However, since the first portion of this blog post points out that factor models, beta based models, and Ito models are intrinsically invalid, it makes sense to use things that have been observed to work. The first thing I tell students in introductory economics courses is that empiricism says that if something is always observed to work, then do that. If something always fails, then do not do that. If something works, contingent on other things, discover what those other things are. No matter how much you love an idea, model, or thing, if it is not supported in the empirical literature, then do not do that. Value investing does not span the set of all things that are known to work, but it is attractive because it inverts the risk and reward trade-off compared to that faced in equilibrium. The data scientist constructing software to find and invest in the disparities that exist between value and price has to begin with accounting and economic data. The scientist would break up the problem into many small parts, such as accounting for valuation issues created by inventory methods, tax adjustments, goodwill adjustments, and so forth. Then the data scientist can use these variables to filter the set down to those most undervalued and adjust them for liquidity costs. I would still recommend having a person read the financial and make the decisions, but that is because I believe that machine learning when combined with highly skill humans can produce far superior results than either alone. Bibliography Curtiss, J. H. (1941). On the distribution of the quotient of two chance variables. Annals of Mathematical Statistics, 12:409-421. Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests. The Journal of Political Economy, 81(3):607-636. Fisher, R. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London, Series A, 144:285 - 307. Graham, B. and Dodd, D. L. (1934). Security Analysis. Whittlesey House, McGraw-Hill Book Company, New York.Graham, Benjamin, et al. (1988) Graham and Dodd's Security Analysis. McGraw-Hill, Graham, Benjamin, and David L Dodd.(2009) Security Analysis : Principles and Technique. McGraw-Hill. New York.Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In Erickson, G. J. and Smith, C. R., editors, Maximum-Entropy and Bayesian Methods in Science and Engineering: Foundations, volume 1 of Fundamental Theories of Physics, pages 53-74. Springer.Gurland, J. (1948). Inversion formulae for the distribution of ratios. The Annals of Mathematical Statistics, 19(2):228-237.Mandelbrot, B. (1963). The variation of certain speculative prices. The Journal of Business, 36(4):394-419.Marsaglia, G. (1965). Ratios of normal variables and ratios of sums of uniform variables. Journal of the American Statistical Association, 60(309):193-204.Stigler, S. M. (1974). Studies in the history of probability and statistics. xxxiii: Cauchy and the witch of Agnesi: An historical note on the Cauchy distribution. Biometrika, 61(2):375- 380. See More

I want to apologize to my small audience for being so long in waiting to post a blog post. My goal had been one per week, but life intervened in the meantime. While I have tried to be productive, the blog post fell by the wayside. I did write a couple of short stories, I must confess. I wrote How to Engage in Counterespionage Operations Against Ghosts because I grew up watching Sherlock Holmes movies and the Twilight Zone. I wrote a small set of short stories about the first day of the second American civil war. Over the last decade, people have carelessly bandied about war as a solution to America’s structural problems. I wanted them to really get what that would mean. I also wrote a short story called The Exfiltration of a President. After all, if one of the most watched and guarded persons on Earth had to get to a jurisdiction without extradition, how would that be accomplished? Today’s post is about none of those, although national stability should surely be included in the risk premiums of equity securities shouldn’t it? I know I promised a sex edition, but it will have to wait to til the next blog post, I am going to provide something quite a bit more valuable. I have a paper in peer review for publication and a second to follow it upon acceptance. I also have to others waiting in the wings, but the implications for this are that the public will see everything a year or so from now. The post here is an attempt to thread the needle between respecting copyright and the peer review process by not duplicating content but also to permit people to build on the content while mine is standing in line. It also solves a validation problem for data science. Most data scientists are not aware of the controversies in finance or economics. There is no reason for them to be. They need the tools because data scientists are builders, they do not need to be mired down in the Allais paradox, the equity premium paradox, or the many other issues economists are paid to deal with. I say this because of the types of questions data scientist and others in the area of finance ask on Stack Exchange. It is obvious what is in the pedagogy of data scientists and what is not.In 1963, Benoit Mandelbrot published a paper called On the Variation of Certain Speculative Prices. The synopsis of the article would be, “if that is your theory, then this cannot be your data, and this is your data.” By 1973, Fama and MacBeth should have put mean-variance finance to bed as a construction of the world with a falsification sufficient to close out the field. The difficulty is that there was nothing to replace it with. Economics has been in the place that physics was, following the Michelson and Morley experiments. It knew classical physics had a severe problem somewhere, but it would have to wait for a generation until quantum mechanics, and relativity came around for it to work again.The mistake, however, in using models such as the Capital Asset Pricing Model, Black-Scholes or Fama-French is that we know they do not work. The error in thinking is that that would imply that nothing works and we don't know anything at all about what does work. The structure of the rest of this post is to show why mean-variance finance cannot work followed by showing a tool that works consistently, even though there isn’t a lot of theory at this point as to why.Why worry about a controversy in economics if someone is paying you to build something that does not work? Because if you can create something that can work, you have a customer for life. Although I have an idea of how I would move this process forward as I have performed and have taught securities analysis for over two decades, given the size of the professional population, many people will have better ideas than mine that may never have been considered if the broader group of professionals were not focused on what works empirically.The goal of this post is to get people thinking, talking and building. The proof regarding the excludability of mean-variance methods and the argument for their general exclusion was first put forth in their protoform by Poisson and Augustin Cauchy in articles in the first half of the nineteenth century. A similar argument was made by R.A. Fisher in the 1930s as an example of how the statistical methods of Pearson and Neyman could go wrong. Because of how much work is required to include mathematical symbols in a blog post, I will reference the work of others at times rather than re-derive their well-known work.I am conscious that formalisms are weird in blog posts, but it permits serious rebuttal, and it permits a sober analysis by data scientists working in finance. It also isn’t being submitted as a paper for two reasons. First, the proof has been well known in statistics since the 1940s, second, I have a similar paper already out there. For the lemmas and the theorem, there will be no dividends, mergers and no firm can go bankrupt. That is in line with the assumptions of the standard models and because the other items are not necessary to exclude mean-variance finance. Dividends, bankruptcy, liquidity costs, and merger risks will be essential for the portion discussing how to move forward.ProofsAssumptions and DefinitionsAssumptionsThere are very many potential buyers and sellers.The market is in equilibrium.The securities are equity securities. This excludes various forms of bonds and other assets such as antiques which have different distributions.The securities are exchanged in a double auction.Buyers purchase and sellers sell q securities at price p where .All securities are purchased at time t and sold at time t+1.It is known with certainty that none of these firms will go bankrupt or merge out of existence.The parameters are estimated from information.Errors at time t and are t+1 independent.DefinitionsThe reward for investing resources at time t is defined as The return is defined as A statistic is any function of the data.Equilibrium price is defined as Equilibrium reward is defined as The reward for investing is also defined as where is the equilibrium reward and is a random variable..Lemma The distribution of the reward for investing, or the return, is approximately the distribution of the errors for securities near their equilibrium prices.By Wold's Decomposition theorem and given the assumption of an equilibrium price, prices can be written as Since assumption 7 requires that it follows that definition 1 can be reduced to the ratio of prices.Definitions 1 and 5 provide two definitions of the reward which could be written as which leads to which for small errors around the equilibrium set of prices is Lemma If the price errors from the first lemma are normally distributed around zero then as prices go to the equilibrium the distribution of the errors to reward is the same as the distribution of the ratio of errors about the prices.However, if is normally distributed around zero, then after several normalizations, it is known from Marsaglia, that the distribution of where a and b are constants and x and y are normal random variables is proportionate to the standard Cauchy multiplied by a function which will go to one as the price goes to equilibrium, leaving only the Cauchy distribution portion. Note that this does not hold as prices go far from the equilibrium as would be the case in a bubble or market collapse. Nonetheless if the errors are normally distributed. Theorem Given the assumptions and the first two lemmas, the distribution of returns of the reward function is the truncated Cauchy distribution.Assumption 3 has an important consequence conditional on definition 1. There is no requirement in definition 1 that either the numerator or the denominator have a stochastic component. Had assumption 7 not excluded mergers and bankruptcy and assumption 3 not required an asset to have the properties of an equity security, then a wide variety of possible stochastic processes could be built into returns.If an asset had been a zero coupon bond, then the numerator would be known with certainty, and the distribution would reflect on the error of pricing at purchase, excluding bankruptcy risk. Likewise, if it were certain that a cash-for-stock merger would happen, then the numerator would also have been cash and certain. In addition, because firms should merge with undervalued firms, the assumption of an equilibrium should have been violated. Also, if the firm were to go bankrupt, then the distribution of prices would not matter, only the probability from the Bernoulli process that the future quantity was equal to zero.Do note that such certainty is not a real-world problem if the probability of a reward is decomposed into the reward given the firm remains going concern multiplied by the probability it will remain a going concern, plus the probability of a reward given a merger multiplied by the probability of a merger, plus zero times the probability of a bankruptcy. From assumption 4, it follows that there cannot be a winner's curse in equilibrium. The overlap in the limit book would prevent the possibility of a cursed price, so the rational behavior is for each bidder to bid their estimation of the expected price. From assumption 1 and the central limit theorem, it must be the case that the distribution of the limit book must converge to the Gaussian distribution as the number of bidders becomes large enough. From Curtiss, Gurland, and Marsaglia, it is well known that the distribution of the ratio of normal variates centered on zero, or in this case the equilibrium, must be the Cauchy distribution. Alternatively, if one converted the prices into polar coordinates, it follows that the solution is also the solution to Gull's Lighthouse Problem and again converges to a Cauchy distribution.From assumption 7 there are other states of nature not included in this proof including bankruptcy which limits losses to the original investment, truncating the distribution in reward space at zero and in return space at negative one hundred percent. As such, integrating the kernel from zero to infinity rather than negative infinity to positive infinity produces a density of where is the scale parameter of returns and is the ratio of the standard deviations of prices, making it also a measure of price heteroskedasticity. It is important to note that the scale parameter is not a variance as neither the population mean nor variance is defined. The reason the distribution lacks a defined mean or variance depends, in part, on how the integrals are defined, but in either circumstance, the expectation is which clearly diverges. Although the arctangent goes to unity as the reward goes to infinity, it is obvious that the product goes to infinity as reward goes to infinity, implying an undefined expectation.This absence of a mean has an unexpected, but well known, result in statistics. The sampling distribution of an estimator of a mean or of a least squares estimator will map to the distribution of the data. The implication is that one randomly chosen element of a sample, if used as the estimator of the center of location, has the same informational value as the sample mean of a billion points of data. If a squares minimizing process or an arithmetic average is used, then no meaningful solution can be found.With no mean, the models collapse.Apple as an ExampleUsing Apple as an example, consider the daily returns. Rewards were normalized to daily rewards to allow for weekends and market closings. As such, a value of 1 is the same thing as a zero percent return. Summary statistics for Apple from R are:Min0.48131 Qtr0.9896Median13 Qtr1.0113Max1.33323Mean1.00078 The lifetime range is almost eighty-six percent. The difference between the mean and the median seems small, but these are daily returns. The annualized difference is almost thirty-three percent. The Cauchy distribution, ignoring the truncation at zero, uses the median as the center of location. The normal distribution’s most efficient estimator is the mean. Which to use? A kernel density estimate of Apple’s daily return using the bi-weight method is shown below. Now the implicit model using the normal distribution is used below. The normal is in red. The maximum likelihood estimator was used. The systematic effects of liquidity costs, dividends, truncation, and uncertainty regarding the estimator were ignored. The same is true for the Cauchy model below. It is possible to improve the modeling for both by proper accounting for other effects. The implicit model using the Cauchy distribution is a substantial improvement but creates a problem. If it is a distribution without a mean, then least squares methods should not be used. For many models, the log difference is used rather than the raw data. The log model does have a mean and variance, but no covariance. The log distribution is the hyperbolic secant distribution and an improvement in the sense that a mean and variance exist, but not a gain concerning least squares as there is still no covariance structure about which to discuss systematic and idiosyncratic risk. Path ForwardThe news on the path forward is both good and bad. The good news is that the path forward has yet to be built in an automated format, and so there is a small fortune to be made in creating the design the market ends up adopting. Someone reading this may get rich. The bad news is that the path forward has yet to build in an automated format and it won't look like a regression of the style traditional in existing models. There will be many failures.Dividends cannot be ignored. Bankruptcy and mergers cannot be ignored. Liquidity costs cannot be ignored. It also requires building across data sets. If you observe firm X alone in a time series, how will you capture its probability of going out of existence prior to it going out of existence? If you observe a firm that has never paid a dividend, how will you predict likely future dividends? The idea of observing a single stationary time series is inadequate.I am hoping to create a push toward new activity and end discussions of older ideas such as volatility surfaces or WACC as they won't matter anymore. Many things will vanish. Alpha and beta will go away. Factors will likely come back, but without the good fortune of having a covariance structure to work with.So how to move forward, by beginning with things that are known to work. It is time to unshackle our minds from the straitjacket of fixating on the elegant. One of the tools that work is value investing.Value InvestingI am going to begin this exposition on value investing with a financial story set in Montana. The story begins decades ago with two brothers meeting, falling in love with, and marrying two sisters. The two new families purchased homes diagonal to one another on a street corner in Great Falls. Great Falls was a planned city. Founded in 1883 by Paris Gibson and built on the advantages hydroelectric power could provide to an industrial location, it is a study in the history of American architecture. A drive from downtown shows the slow expansion of the city and the periods where growth happened can be identified by looking at the design of homes on a block.The two homes the couples moved into were started and finished on the exact same day. The construction was identical, and the exteriors were identical. To save money, the two families made bulk purchases when repairs or changes were needed, and the two homes remained identical all through the years. Both families had one child. The children, Charlie and Sam, grew up. Sam moved to New York while cousin Charlie moved to Los Angles. They were building successful careers when tragedy struck.The two couples loved to do things together and decided to go to Glacier National Park. While traveling up one of the mountains, their car went out of control and fell hundreds of feet off the road killing everyone instantly. The cousins returned to Great Falls to bury their parents and settle their estates.Charlie’s parents had built up an illiquid real estate empire in Cascade County and around Montana. Sam’s parents were of modest means and except for the home only held highly liquid assets. The estates settled on the same day, and both cousins listed their homes for sale on the same day. Both had immediate offers for $200,000, and they immediately accepted them. The couple that made the offer on Charlie’s home decided sometime later to take a camping trip in Glacier and traveled there for a weekend of fun. Sadly, the couple came upon the same curve in the road, and they too fell hundreds of feet to their deaths. Charlie was notified of the deaths by the realtor and was told the couple’s estates were empty and that the sale was off. Charlie returned to Great Falls to see what could be done as the estate was bleeding cash and decided that it would best be handled in person. Incidentally, Sam was there for the upcoming closing on the house. They both went to the old neighborhood to see how things had changed.Afterward, Charlie went to a bar to find as many pints of Dam Fog from the Mighty Mo Brewing Company as possible. While sitting at the bar, Charlie talked about the failed sale of the home when someone interrupted and said, “would you take $140,000 for it? I can’t do more, its what I can get.” Charlie, ecstatic, cheerfully accepts the offer. His parent’s estate was asset rich and cash poor.Sam and Charlie have the closing on the sale at the same attorney’s office at the same time, just down the hall from each other. They go out for a celebratory drink and promise not to let it become so long until they see each other again.The new homeowners, Alex and Jessie, were friends and worked at the same industrial concern. Their homes were identical, and the only difference between the two houses was that Alex paid $140,000 and Jessie paid $200,000, both in cash.Eleven months passed, and the industrial concern announced a planned expansion. Real estate prices in the city rose, and the two friends decided to have the homes appraised just to see if they could turn a quick buck. The appraiser set the value of the houses at $220,000. Neither was satisfied with the price improvement, but both discussed waiting until the expansion happened and they could downsize if they could get enough money. Unfortunately for both of them, embezzlement happened at their place of work and the firm was suddenly shuttered. Unemployed, without immediate prospects, both sold their homes and moved away. Incidentally, they sold them two years from the date of purchase for $180,000.Now the question is, did one of them take more risk than the other, and if so, which one?Because the homes were fundamentally identical and located at approximately the same place, the risk of loss from fire, meteor strike, civil commotion, and so forth should be equal. The fundamental chance of damage to the structures is the same. TimeAlexJessie0$140,000$200,0001$220,000$220,0002$180,000$180,000Standard Deviation$40,000$20,000 The sample standard deviation of prices for Jessie’s home was $20,000 while it was $40,000 for Alex’s home over the period. As measured by variance, Jessie’s house was the less risky investment. Was it less risky? Consider the following three definitions of risk. One definition is exposure to loss, the second is exposure to uncertainty, the third is exposure to goal failure.To make things slightly more comparable, let us add the stipulation that Alex lied to Charlie and actually had another $60,000 in savings so that both have equal assets at the beginning. Those funds are still in savings. Imagine that instead of being either Alex or Jessie, we are nature, and we know the true probability distribution of prices at the end of the second year. Let us assume it follows the following, somewhat strange, ad hoc cumulative mass process.Market Price in Thousands of DollarsProbability a Value is Less Than or Equal to the Market Price500.00%10010.00%14020.00%17030.00%19040.00%20050.00%21060.00%23070.00%26080.00%30090.00%350100.00%Based on the first definition, being exposed to the risk of loss, Alex exposed less money and only had a twenty percent chance of experiencing a loss. In addition, thirty percent of Alex's portfolio is in a federally insured savings account, and so the variance could be considered zero. Because the variance in a Bernoulli trial is greatest at the fifty percent mark, Jessie took the greatest risk in terms of the uncertainty of outcome. Alex's variance is $22,400, while Jessie's is $50,000 when measured as the uncertainty of loss. What about when measured as exposure to raw uncertainty?Looking forward, once the homes’ values had gone back to equilibrium and stabilized, both houses had the same exposure to uncertainty at that price level, but Alex exposed fewer resources and still took less risk. Now consider exposure to goal failure. Let us imagine the goal was to make a ten percent simple interest rate of return on all investments over the two years. For Alex, the profit needs to be at least $28,000 plus returns on the cash while Jessie must make $40,000. If we assume that the above mass function is piecewise linear, then Jessie has a 26.67% chance of succeeding. Conversely, Alex has a 70.67% of succeeding.Now consider the case where Alex, a great-great-grandchild of Rip van Winkle, fell asleep at the moment of purchase and woke up just in time to sell. Alex couldn’t allocate the other $60,000 in cash in other investments, and so the home must make all $40,000 to reach the goal. Alex still has a 65% chance of succeeding while still holding fewer risky assets. When does Alex’s risk catch up to Jessie’s risk? If Alex would make a catastrophic purchase and lose 100% of the investment in the complementary set of assets, then they would have the same risk of goal failure.Part of the Math Behind Value InvestingLet us go back to the above concepts of present value and future value. We are still going to ignore liquidity costs, dividends, merger risk, and bankruptcy risk to simplify the discussion, which we clearly should not do in the real world. Let be earnings and be price, note that Also, note that these are, for our purposes, economic operating earnings and not accounting earnings. The difference is that the accounting principles are a tool with a purpose. The tool is a mixture of meeting business needs and the political needs of management, shareholders, and legislatures. Each nation has its own standard set of accounting rules. It is common for legislatures to pass laws deferring the payment of taxes for powerful interests. Consider a firm that entered into a transaction that resulted in a one hundred million dollar tax liability, but where the law allows for the deferral of the payment by ten years. Real taxes are rarely that simple, but in the US, MACRS is such a set of rules. If a ten-year eight percent zero coupon bond were available, then, ignoring the tax on the gain, the tax can be canceled by investing forty-six million, three hundred and twenty thousand dollars in the bond and simply waiting for it to mature. What happened to the other fifty-four million dollars? It is really equity. If you can cancel a hundred million dollar accounting liability for forty-six million dollars, then the other fifty-four million is a fiction in economic terms. Operating accounting earnings are a rules-based specification of how to divvy up operating cash flows between stakeholders such as customers, employees, creditors, and shareholders. They tend to be less volatile than cash flows, but are still rules driven. For our purposes, there will be no stochastic component, and earnings will be perfectly representative. The PE ratio, commonly used in value investing will be denoted as The reward on investing can be rewritten as Future earnings can be restated as prior earnings multiplied by a growth factor, So the reward for investing can be stated as In this simplified world, the relationship makes the sole controllable variable Note this is not modeled like an economic theory. In standard economics, Standard economics either describes a system at rest, or a system moving toward rest. There is no system here. In equilibrium, value investing is called the value trap. Because prices are properly ordered, no excess gain exists in the system. However, a system driven by equilibria is a system that will seek to return over time to its equilibrium. That fact can be seen as an advantage. Additionally, I have improperly set aside dividends to make life simpler, but the econometrics of dividends will need to be a future blog posting. The goal of value investing is to purchase assets with the smallest measures of price to value. It is sometimes mistaken that this would imply the lowest price to earnings, price to sales, or price to book, but those are only markers of value. A more sophisticated view is concerned with economic value and not accounting measures which can be skewed. If dividends were added, then return would be a discounted sum of the parts. This posting has no direct concept of time. I ignored time by making the growth factor instead of standardizing it as If dividends were present, then a more complicated sum would be used, but the lessons would be no different. Likewise, if real rules of accountancy were used, then this would probably run sixteen hundred pages long. It would need to contain the content of the 5th and 6th editions, from 1987 and 1943 respectively, of Graham and Dodd's Security Analysis. Yes, the sixth edition is a reprint of the 1943 edition. The fifth edition describes how things should be done, the 6th edition describes why it is done. They really are inseparable. So, in this nearly perfect world, how to apply the above story? First ask, “what role does the data scientist play here?” Is the data scientist the appraiser, the market maker, the trader, the portfolio manager, or several of the above? The data scientist would move over the set of securities, favoring none. The Bayesian predictive distribution would need to be constructed of future prices and earnings. Such a thing includes risk in the distribution, inherently. The Bayesian predictive distribution is where is the sample space. Predictions need to form on all cash flows in the holding period. If something increases bankruptcy risk, it decreases the probability of getting a standard return or a return from a merger happening. In a world with dividends, anything that would make a dividend uncertain makes the return uncertain. The underlying firm operating risk is a return risk and included in the prediction. Actual economic and accounting values would have to rear their ugly heads and be included in the analysis. Given the above good news and bad news, that is the bad news. Still, we are not at a model, all we have done is spoke about the nature of return and current price. If we are not currently in equilibrium, that is we are not about to be in a value trap, then prices are not properly ordered. Most likely, most prices are properly ordered, but some will not be. Some will be overpriced, and some will be underpriced. What happens with is that it can be used to index the predictive distribution as a cumulative density. Consider an investor that required an eight percent rate of return. Imagine that if then there is a fifty percent chance of reaching the goal, but if then there is an eighty percent chance of reaching the goal. It converts this problem from one without a mean or a variance, to a multinomial problem, or a problem of minimizing the expected loss from goal failure. The viewpoint of value investing differs radically from models such as the CAPM. Each cash flow holds a potential value. The less one pays for a cash flow, the less risk and the higher return one would expect to receive. Because of this, misvalued securities should be rare, and they are. Value investors are looking for errors in the ordering of securities and other investments in terms of price to value. Earnings were used above, but profits are not always a reliable index both for economic reasons and accounting reasons. Accounting statements include both stated values and notes to explain the stated values. Real accounting happens in the notes. It is also where the valuation process begins.From the view of data science, it would be a massive undertaking. However, since the first portion of this blog post points out that factor models, beta based models, and Ito models are intrinsically invalid, it makes sense to use things that have been observed to work. The first thing I tell students in introductory economics courses is that empiricism says that if something is always observed to work, then do that. If something always fails, then do not do that. If something works, contingent on other things, discover what those other things are. No matter how much you love an idea, model, or thing, if it is not supported in the empirical literature, then do not do that. Value investing does not span the set of all things that are known to work, but it is attractive because it inverts the risk and reward trade-off compared to that faced in equilibrium. The data scientist constructing software to find and invest in the disparities that exist between value and price has to begin with accounting and economic data. The scientist would break up the problem into many small parts, such as accounting for valuation issues created by inventory methods, tax adjustments, goodwill adjustments, and so forth. Then the data scientist can use these variables to filter the set down to those most undervalued and adjust them for liquidity costs. I would still recommend having a person read the financial and make the decisions, but that is because I believe that machine learning when combined with highly skill humans can produce far superior results than either alone. Bibliography Curtiss, J. H. (1941). On the distribution of the quotient of two chance variables. Annals of Mathematical Statistics, 12:409-421. Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests. The Journal of Political Economy, 81(3):607-636. Fisher, R. (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London, Series A, 144:285 - 307. Graham, B. and Dodd, D. L. (1934). Security Analysis. Whittlesey House, McGraw-Hill Book Company, New York.Graham, Benjamin, et al. (1988) Graham and Dodd's Security Analysis. McGraw-Hill, Graham, Benjamin, and David L Dodd.(2009) Security Analysis : Principles and Technique. McGraw-Hill. New York.Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In Erickson, G. J. and Smith, C. R., editors, Maximum-Entropy and Bayesian Methods in Science and Engineering: Foundations, volume 1 of Fundamental Theories of Physics, pages 53-74. Springer.Gurland, J. (1948). Inversion formulae for the distribution of ratios. The Annals of Mathematical Statistics, 19(2):228-237.Mandelbrot, B. (1963). The variation of certain speculative prices. The Journal of Business, 36(4):394-419.Marsaglia, G. (1965). Ratios of normal variables and ratios of sums of uniform variables. Journal of the American Statistical Association, 60(309):193-204.Stigler, S. M. (1974). Studies in the history of probability and statistics. xxxiii: Cauchy and the witch of Agnesi: An historical note on the Cauchy distribution. Biometrika, 61(2):375- 380. See More