Time for everyone to put on their propeller hats.

I bring you the inverse hyperbolic sine transformation: log(y_{i}+(y_{i}^{2}+1)^{1/2})

According to a ranting Canadian economist,

Except for very small values of y, the inverse sine is approximately equal to log(2yi) or log(2)+log(yi), and so it can be interpreted in exactly the same way as a standard logarithmic dependent variable.

Here is more.

What’s the fuss about? If you want to look at the effect of an event or policy on income or employment or actions of some sort, you face a problem: the variable has a long upper tail (there are millionaires and workaholics) and so you want to take out the skew (or else have very imprecise estimates).

But transforming income with a natural logarithm doesn’t work, because there are many people with no income or employment or no acts, and ln(0) is undefined. So you either have to drop the zero income folks or (unholy of unholies) give all the zero earners $1. Which rumor has it will get your paper rejected by the AER.

Doubtful? Me too, but rumor also has it this function has the blessing of Cardinal David Card.

h/t Rema Hanna.

Link takes us to a journal, but not a specific paper

He’s quoting “Worthwhile Canadian Initiative,” which has a longer piece on this topic with complete reference info for the article in question plus other relevant articles. See:

http://worthwhile.typepad.com/worthwhile_canadian_initi/2011/07/a-rant-on-inverse-hyperbolic-sine-transformations.html

Sorry, link fixed.

Thank you for that post. I have one naÃ¯ve question: what is nice with log(y) is that the underlying model is a cobb-douglas. If I use log(yi+(yi2+1)1/2) instead of y, I assume that the underlying model is the inverse of that function and then I have to give some economic meaning to that inverse function. right?

Is it a trade-off between keeping the zero’s and using a correctly specified model?

I don’t know much about this, so this may be quite naive. If Y is drawn from a distribution where the mean is a function of “regressors” and 0 values are reported when the draw is below a “reservation” amount (so that people don’t work or purchase a good), then what you may want to do is impute a value for the 0’s that maintains the same underlying CEF in the log specifiation with and wihout the 0 values. This is theoretically in keeping with the basic model.

There is an old paper that does this cited below. It gives a simple method for doing such an imputation, under the assumption that there is no selection on the 0-values, which is also assumed in the solution above. It seems to work well in applications, although may have its own problems—perhaps some monte-carlo simulations comparing the different solutions would be useful?

I think that this solution is better than adding 1 to 0 values, which I never understand.. The adding 1 assumes that in a setup where households have cows ranging from 0 to 10, adding 1 cow (raising deep questions of what a log cow looks like to begin with) in a log-spec is identical to adding $1 to a wage distribution that looks completely different…

& a thanks to Pramila Krishnan for pointing this out a long time ago….she seems to know every published paper on econometric solutions going back to 1920….(http://www.econ.cam.ac.uk/faculty/person.html?id=krishnan&group=faculty)

Johnson, S. R., and Gordon C. Rausser. 1971. “Effects of Misspecifications of Log-Linear Functions when Sample Values are Zero or Negative.” American Journal of Agricultural Economics 53(1):120-124.

Unless you really want elasticities, why not just use the Poission QMLE or non-linear least squares with the exponential function? If you make your function flexible enough, it’s not like saying the conditional mean is linear is any weaker of an assumption than saying it is the exponential function.

I have to say I am doubtful too. What bothers me about log(1+y) is that the choice of 1 is arbitrary (unless x is unitless, like inflation) and potential influential on the estimates. That goes equally for the 1 in log(y+(y^2+1)^0.5). y could be denominated in Malawian kwacha of 1997. Why is adding 1 Malawian kwacha of 1997 more right in the latter case than the former? Why not 1 kwacha of 2007, or 5 kwacha of 1997? The choice affects the results.

In my own experience–in replicating Pitt and Khandker’s study of the impact of microcredit–this issue has come up with right-side variables, as distinct from left-side. It matters there too. The treatment variable is log credit, which has lots of zeroes. I think the right way to handle the trouble there is to represent credit as two variables: a dummy for whether any borrowing occurs, and a variable for the log borrowings. This renders the arbitrariness in the imputation for log(0) harmless. After performing such a regression, the impact of switching from no borrowing to borrowing a given amount is determined from the coefficients on both variables together. Changing the imputation on log(0) will result in a perfectly offsetting change in the coefficient on the dummy. I figured this out, but I assume it was figured out by others decades ago.

Interestingly, since P&K treat credit as endogenous, they model it too—it is also a left-side variable. They use Tobit, which seems to deal with the 0’s better than a game like inverse hyperbolic sine. Of course, Tobit has its own problems, like inconsistency in the presence of heteroskedasticity, which I guess is what leads in the direction of quantile regressions. The point is, it seems better in the ideal to confront the question of how to model the troublesome variable than to dodge with a clever formula.

–David