Navigation bar--use text links at bottom of page.

(Fruit Is Not Like Mother's Milk--continued, Part H)


Defects of an Alleged Statistical "Proof"
that Fruit and Mother's Milk are Similar

Outline of the "obligate
crank science proof"

There exists in fruitarian circles an alleged statistical "proof" that one might encounter, which claims that the nutritional composition of mother's milk is "closest" to that of fruit. The proof uses correlation and covariance, and the claim is made that these are "robust" methods for comparing foods.

The alleged proof is probably the most detailed analysis ever presented by a fruitarian in an attempt to support the claim that fruit is similar to mother's milk. That is, if the proof were valid, it would be one of the few pieces of evidence anyone has been able to put forward in support of the "fruit is like mother's milk" theory. Let's review the major steps in the alleged proof so that if you happen to encounter it or those promoting it, you'll recognize the "proof" (and its numerous problems) as the make-believe counterfeit that it is.

  1. The food composition (nutrient) profiles from USDA Handbook 8 were averaged in broad categories: fruit, dairy, beef, grain, etc. The profile for human milk is a single food nutrient profile and is not averaged.

  2. Correlations were computed between the human milk profile and the average category profiles calculated in step 1. The correlations were highest between human milk and two categories: fruit (0.93) and poultry (0.96).

  3. The covariances between the human milk profile and the two categories mentioned above were examined; the covariance between human milk and fruit was 3162; and between human milk and poultry, 6853. As the value of the milk-fruit covariance is lower, the claim is then made that fruit is "closer" to milk than (any) other foods.

Summary analysis of the alleged statistical "proof"

As some readers may find it difficult to understand detailed discussions of statistics, this brief summary section is to give you an overall assessment of the proof, and a relatively non-technical, partial list of the major errors it contains. After that, the material is discussed in depth in two subsequent sections, which you can read or skip, as you wish.

Synopsis: As we'll see shortly, the above statistical "proof" of the similarity between milk and fruit is all of the following:

A partial list of the major errors in the proof is as follows:

In conclusion, the entire alleged statistical "proof" is essentially a farce, and a prime example of fallacious, crackpot crank science.

Let's now examine in depth the wide assortment of errors to be found in the alleged proof in the subsections that follow. The section immediately following discusses the numerous flaws due to statistical errors in the fallacious crank science proof. As a comparison, the subsequent, and final, section directly contrasts the crank science proof with the approach used in this paper. The discussion assumes that you are comfortable with statistical concepts. (If that does not describe you, you might prefer to skip the material.) However, note that even if you cannot completely follow the statistical concepts, the analysis of the myriad errors in the crank science proof may still help, by way of comparison, to illustrate the underlying reasoning behind the statistical approach used in this paper.

Errors of the "obligate
crank science proof"

Errors in the use of category averages

Recall that the hypothesis of interest is that fruit--specifically raw, unprocessed fruit, consisting of mostly sweet fruit--is allegedly similar or "closest" in composition to mother's milk than any other foods. This implies that if you are going to compare mother's milk and fruit, then you want to compare milk against averages of commonly consumed raw fruits (i.e., as is done in this paper).

However, that is not what the extremist's proof does. Instead, it reportedly averages all the nutrient profiles in the fruit category of USDA Handbook 8. A quick inspection reveals that the handbook (and an average produced by averaging everything in the handbook) includes the following: raw fruits; cooked fruits; canned fruits, including fruit salads--in water packs, light sugar syrup, and heavy sugar syrup; dried fruit; fruit-juice concentrates (some of them undiluted); and frozen fruit. What's especially interesting here is that this mixed bag of fruit would not even meet the fruitarian's own criteria for what is acceptable fruit in a raw diet. The utter sloppiness of approach here indicates from the beginning, then, how little care or rigor is exercised in this "proof."

As the USDA handbook fruit category average, then, is not necessarily the same as an average of raw fruits, the data analyzed by the extremist are not the data needed to test the hypothesis of interest. In other words, the "wrong" data are analyzed, and the entire proof is invalid and irrelevant from this point on.

However, as there are many more interesting errors in the remaining structure of the proof, let us continue our discussion of the errors.

Note: Before leaving this topic, readers should be aware that the USDA handbook categories are very broad. For example, USDA Handbook 8-1 (dairy) includes eggs and egg products, non-dairy substitutes, and so on. Use of such broad averages obscures important detail. In particular, such averages are so broad that it would be (nearly) impossible for a person to eat all the items included in such an average. This reflects on the unrealistic nature of the analysis presented in the fallacious proof.

Possible errors in handling missing data

The "proof" reports that missing data are treated as zeroes for purposes of analysis. Of particular interest here is treating missing values as zeroes in computing averages. There is a subtle but statistically important difference between treating missing data as zeroes, and excluding it from the analysis. The difference occurs in determining the value of N, which is the number of data points in the analysis, as used in calculation of sample means, variances, etc.

If missing data is treated as zeroes, the value of N will reflect both missing and non-missing data. If you exclude missing data, then the value of N reflects only the non-missing data. The end result is that including missing values as zero can increase the value of N, and bias averages downward--hence, may increase correlations simply due to similar patterns of missing values in items being compared.

Note that it is desirable to use data that is complete, or nearly so, in an analysis. In some cases that makes it appropriate to exclude an item from an averaged profile if it has "excessive" missing data.

Correlation and covariance: background information

Covariance is a gross measure of the joint variation of two variables. The theoretical definition is (notation explained after formula):

Cov(X,Y) = E[ (X-E(X)) * (Y-E(Y)) ]

Cov(X,Y) = covariance between variables X, Y; and,
E(*) is the expected value (expectation) operator, i.e., in this case, the true mean.

Readers unfamiliar with the E(*) notation can simply substitute Avg(*), or average, for the expectation operator. Thus, for example, (X-E(X)) simply means the value of X, with its average subtracted from it.

It is also the case that:

Cov(X,Y) = E[X*Y] - ( E(X)*E(Y) ),

which when estimated yields the standard formula:

Cov(X,Y) = Avg(X*Y) - ( Avg(X)*Avg(Y) )

where Avg(*) is the average, i.e., arithmetic mean.

Note the relationship between variance and covariance:

Variance of X = Var(X) = Cov(X,X),

and note the terminology: SD(X) = standard deviation of X = square root of Var(X).

Note that covariance changes when data are rescaled:

Cov(aX,cY) = a*c * Cov(X,Y)

where a,c are arbitrary coefficients.

Correlation is a measure of the linear relationship between two variables, X and Y. If the relationship between two variables X and Y is nearly a straight line (of non-zero slope), then the correlation will be close to 1 or -1. Correlation and covariance are closely related:

Corr(X,Y) = Cov(X,Y)/[SD(X)*SD(Y)]

where Corr(X,Y) = Correlation between variables X and Y, and SD(X) is the standard deviation of variable X.

Correlation is always in the range of -1 to +1. Hence, in a figurative sense, correlation can be considered to be a sort of "standardized" covariance.

For a nice introduction to the topics of correlation and covariance, see:
De Groot M (1975) Probability and Statistics, Addison-Wesley Publishing, pp. 172-178.

Errors in the use of correlation:
Basic errors

The crank science proof uses correlations to compare milk and the derived category averages. However, let us consider the following.

Errors in the use of correlation:
An unproven, implicit assumption of the "proof"

The use of correlations in the alleged proof is based on a very important, implicit structural assumption. Given a suitable data set with two variables, then the calculation of correlations and other statistics is an objective exercise. However, note the term "suitable data set"--when do you have enough data? What is "suitable" in this context?

The implicit assumption--and a highly debatable assumption--made in the crank science proof is that the USDA nutrient list is both necessary and sufficient (in the strict logical sense) for an analytical comparison of two foods. An obvious defense of the USDA handbook is that it is a standard document, it includes many nutrients, etc. However, the USDA handbook has some very serious shortcomings for a comparison of fruit against other foods. In the USDA handbooks, the sugars are not broken out by type. Instead, all sugars and starches are listed combined, simply as total carbohydrate. This prevents one from comparing types of sugars, amount of starch, etc. In fact, the lack of such data in the USDA handbook is the reason I used the German (Scherz et al.) data for my analysis--i.e., the German data specifically included the (very important) breakdown of sugars/starch, by type.

If one compares the USDA and German tables, one notices many structural differences. Similar remarks apply if one compares these two tables to other standard tables (e.g., British), or to published papers in the journals. The conclusion one quickly reaches is that there is no universally acknowledged, complete, "standard" nutrient list for food composition. Correlations based on different nutrient lists could yield different results.

Thus the allegedly objective analysis--correlations--presented in the "proof" is actually based on a table that is at least partially subjective in construction. Further, the completeness of the table is highly debatable.

Remark: Inasmuch as regression and correlation are closely related, I deliberately limited regression analysis in this paper to narrowly defined data sets of closely related quantities, all of which were measured in the same units (amino acid profile, fatty acid profile).

Errors in the use of correlation:
Why the "proof" probably cannot be "fixed"

Let's assume that you can somehow get a nutrient list that will be accepted as standard by most observers. Let's also assume you have data in the format of such a list, and you want to compare (despite the limitations therein) the nutrient lists (foods) via correlation. How, then, can you "fix" the problems found in the fallacious crank science proof? Let's consider some possible fixes and see why they won't work well, if at all.

Fix 1: Drop energy and vitamin A data. Convert all remaining data (all of which are weights) into one measure--grams--and do correlations.

This gives you data all in one measure--grams. However, it gives you lists that contain a few numbers above zero (proximate composition--grams of water, fat, protein, carbohydrate, etc.), with the rest of the numbers, in relative terms, nearly zero (i.e., mg and mcg converted to grams). The result is that due to these magnitude differences within the lists, the correlations observed will be a function of the few numbers above zero, as the rest of each list is nearly zero. (That is, (X-Avg(X)) in the calculation of the correlations is nearly constant for X "small," i.e., mg/mcg terms converted to grams.) Yet depending on the vitamin or mineral in question (those quantities usually measured in mg or mcg), certain nutrients can have significant effects at very low concentrations. Thus, in effect, the "importance" of nutrients that occur at low relative concentrations are unjustifiably "discounted" by this approach, in terms of their effect on correlation. Hence this fix does not work, as the correlations computed under such conditions really don't reflect the entire data set.

Fix 2: Eliminate units of measure by converting the nutrient lists into lists of ratios, i.e., (nutrient list value)/(index list value). The candidates for serving as an "index list" are as follows.

  1. List of RDA/RDI values for nutrients.
  2. The nutrient list for human milk.
  3. Another constructed list.


  1. RDA/RDI lists: Both the nutrients and values therein are controversial and incomplete. There is no RDA/RDI for many important nutrients, and the different authorities often disagree as to the values for the RDA/RDI. You will not get anything approaching a consensus using this approach.

  2. If human milk is used for calculating ratios, then you end up trying to calculate a correlation between a nutrient ratio list--milk--that is constant, i.e. all 1's, versus other nutrient lists. A constant list has variance (and standard deviation) of zero, hence to compute correlation, you end up dividing by zero--which is mathematically undefined. This approach definitely won't work.

  3. It may be very difficult to produce an index list that is internally consistent and that can be defended statistically/logically. The problem that milk has lactose, and other foods do not, presents some major challenges in constructing such a list. An overall average or "small" reference value for lactose would yield a high lactose content ratio for dairy, zero for other foods. This could influence correlations.

    Of course, the greatest challenge in producing any index list to use in calculating ratios for analysis is that any/all index lists constructed are subject to challenge, both in regard to index list values and the nutrients included in such an index list. Also, correlation is limited as an analysis tool, per the remarks above (it does not accurately reflect the important differences in proximate composition, sugar breakdown, etc.). An analysis that is limited to correlation (and covariance) alone, and that does not consider proximate composition and carbohydrate breakdown (e.g., the crank science proof) is an incomplete and invalid analysis.

Errors in covariance and conclusions of the "proof"

The claim in the proof that lower covariances for milk and fruit (vs. milk and poultry) suggests the numbers for milk and fruit are lower in magnitude (hence may be "closer" in gross terms) is approximately correct. However, the covariances are irrelevant because the average used for fruit is wrong, and the use of correlations is invalid per the above discussion. Hence the conclusion of the proof is logically and statistically invalid and does not apply.

However, we can have some fun with the invalid logic of the proof. Recall that covariance changes with scale, but correlation does not. Now consider the covariances calculated in the proof: Cov(M,F) = 3162, and Cov(M,P) = 6853, where M = milk, F = fruit, and P = poultry. Let's use the result that Cov(aX,cY) = a*c*Cov(X,Y), where we let a = 1, c = (3162/6853) = ~0.46. We then have:

Cov(M,cP) = 3162,

and Corr(M,cP) = Corr(M,P) = 0.96,

versus Cov(M,F) = 3162, and Corr(M,F) = 0.93.

That is, using the results from the crank science proof, the best match of foods is not 100 g of milk vs. 100 g of fruit, but instead 100 g of milk vs. 46 g of poultry!

Remark: As correlation does not change with rescaling but covariance does, the analytical advantage of poultry (higher correlation) can be maintained. If fruit is rescaled to yield a different covariance, simply rescale poultry yet again to match the fruit covariance.

To summarize the above in plain language: the covariance results of the crank science proof, rescaled, suggest that poultry, in smaller amounts, is a better match to milk than is fruit.

Differences: crank science
"proof" versus this paper

Here we'll look at some of the major differences between the analytical approach of this paper and the confused, sloppy approach of the crank science proof that uses correlation and covariance. Such an analysis reveals the following.

Flawed approach of the fruitarian "proof"

Approach of the analysis in this paper

In sharp contrast to the above, the analysis in this paper emphasizes basic, common sense points, as follows.

--Tom Billings

Before writing to Beyond Veg contributors, please be aware of our
email policy about what types of email we can and cannot respond to.


See Table of Contents for Section I - Nutritional Comparison Tables

See Table of Contents for Section II - Making Sense of the Numbers

See Table of Contents for Section III - Challenging Fruitarian Defenses of the Theory

Back to Waking Up from the Fruitarian Dreamtime
Back to Research-Based Appraisals of Alternative Diet Lore

   Beyond Veg home   |   Feedback   |   Links