   (Fruit Is Not Like Mother's Milk--continued, Part H) A P P E N D I X Defects of an Alleged Statistical "Proof"
that Fruit and Mother's Milk are Similar Outline of the "obligate crank science proof" There exists in fruitarian circles an alleged statistical "proof" that one might encounter, which claims that the nutritional composition of mother's milk is "closest" to that of fruit. The proof uses correlation and covariance, and the claim is made that these are "robust" methods for comparing foods.

The alleged proof is probably the most detailed analysis ever presented by a fruitarian in an attempt to support the claim that fruit is similar to mother's milk. That is, if the proof were valid, it would be one of the few pieces of evidence anyone has been able to put forward in support of the "fruit is like mother's milk" theory. Let's review the major steps in the alleged proof so that if you happen to encounter it or those promoting it, you'll recognize the "proof" (and its numerous problems) as the make-believe counterfeit that it is.

1. The food composition (nutrient) profiles from USDA Handbook 8 were averaged in broad categories: fruit, dairy, beef, grain, etc. The profile for human milk is a single food nutrient profile and is not averaged.

2. Correlations were computed between the human milk profile and the average category profiles calculated in step 1. The correlations were highest between human milk and two categories: fruit (0.93) and poultry (0.96).

3. The covariances between the human milk profile and the two categories mentioned above were examined; the covariance between human milk and fruit was 3162; and between human milk and poultry, 6853. As the value of the milk-fruit covariance is lower, the claim is then made that fruit is "closer" to milk than (any) other foods. Summary analysis of the alleged statistical "proof" As some readers may find it difficult to understand detailed discussions of statistics, this brief summary section is to give you an overall assessment of the proof, and a relatively non-technical, partial list of the major errors it contains. After that, the material is discussed in depth in two subsequent sections, which you can read or skip, as you wish.

Synopsis: As we'll see shortly, the above statistical "proof" of the similarity between milk and fruit is all of the following:

• Incorrect and/or invalid at every step.

• Fallacious, hence does not prove anything.

• In my opinion, a prime example of crank science--indeed, it is one of the worst examples of crank science that I have ever seen.

A partial list of the major errors in the proof is as follows:

• The nutrient profiles used are category averages from the USDA handbooks. Thus, the "fruit" profile is actually not an average of only raw fruits, but includes processed fruits as well. In other words, the data are not what they need to be to test the hypothesis. From this point on (i.e., from the very beginning of the proof), the analysis is invalid--after all, an analysis done on the wrong data does not prove anything.

• Treating missing data as zero may introduce bias and increase correlations.

• Correlation is not an appropriate way to compare lists (nutrient profiles) that lack internal consistency; i.e., the USDA nutrient profiles are dissimilar items measured in 7 different units. This assertion follows from the definition of correlation, and is discussed in detail further below.

• The use of correlation in the proof is based on a major structural assumption, which also happens to be an implicit assumption. The result is that the proof is not as objective as the fruitarian extremist might want you to believe it to be.

• In figurative terms, correlation can be interpreted as a kind of "standardized covariance"; i.e., correlation and covariance are related. Note that covariance changes when a variable is rescaled (multiplied by a coefficient); correlation does not. The fact that covariance is dependent on scale makes its use in the proof inappropriate (discussed below).

• The combination of correlation and covariance together do not "prove" that milk is "closer" to fruit than poultry. As previously mentioned, correlation as used in the proof is inappropriate.

• Humorous but true: If one pretends that the data, method, and results from the proof are actually valid, then use of the rescaling property of covariance "proves" that 100 g of mother's milk, is "closest" to 46 g of poultry--not (100 g of) fruit!

In conclusion, the entire alleged statistical "proof" is essentially a farce, and a prime example of fallacious, crackpot crank science.

Let's now examine in depth the wide assortment of errors to be found in the alleged proof in the subsections that follow. The section immediately following discusses the numerous flaws due to statistical errors in the fallacious crank science proof. As a comparison, the subsequent, and final, section directly contrasts the crank science proof with the approach used in this paper. The discussion assumes that you are comfortable with statistical concepts. (If that does not describe you, you might prefer to skip the material.) However, note that even if you cannot completely follow the statistical concepts, the analysis of the myriad errors in the crank science proof may still help, by way of comparison, to illustrate the underlying reasoning behind the statistical approach used in this paper. Errors of the "obligate crank science proof" Errors in the use of category averages Recall that the hypothesis of interest is that fruit--specifically raw, unprocessed fruit, consisting of mostly sweet fruit--is allegedly similar or "closest" in composition to mother's milk than any other foods. This implies that if you are going to compare mother's milk and fruit, then you want to compare milk against averages of commonly consumed raw fruits (i.e., as is done in this paper).

However, that is not what the extremist's proof does. Instead, it reportedly averages all the nutrient profiles in the fruit category of USDA Handbook 8. A quick inspection reveals that the handbook (and an average produced by averaging everything in the handbook) includes the following: raw fruits; cooked fruits; canned fruits, including fruit salads--in water packs, light sugar syrup, and heavy sugar syrup; dried fruit; fruit-juice concentrates (some of them undiluted); and frozen fruit. What's especially interesting here is that this mixed bag of fruit would not even meet the fruitarian's own criteria for what is acceptable fruit in a raw diet. The utter sloppiness of approach here indicates from the beginning, then, how little care or rigor is exercised in this "proof."

As the USDA handbook fruit category average, then, is not necessarily the same as an average of raw fruits, the data analyzed by the extremist are not the data needed to test the hypothesis of interest. In other words, the "wrong" data are analyzed, and the entire proof is invalid and irrelevant from this point on.

However, as there are many more interesting errors in the remaining structure of the proof, let us continue our discussion of the errors.

Note: Before leaving this topic, readers should be aware that the USDA handbook categories are very broad. For example, USDA Handbook 8-1 (dairy) includes eggs and egg products, non-dairy substitutes, and so on. Use of such broad averages obscures important detail. In particular, such averages are so broad that it would be (nearly) impossible for a person to eat all the items included in such an average. This reflects on the unrealistic nature of the analysis presented in the fallacious proof. Possible errors in handling missing data The "proof" reports that missing data are treated as zeroes for purposes of analysis. Of particular interest here is treating missing values as zeroes in computing averages. There is a subtle but statistically important difference between treating missing data as zeroes, and excluding it from the analysis. The difference occurs in determining the value of N, which is the number of data points in the analysis, as used in calculation of sample means, variances, etc.

If missing data is treated as zeroes, the value of N will reflect both missing and non-missing data. If you exclude missing data, then the value of N reflects only the non-missing data. The end result is that including missing values as zero can increase the value of N, and bias averages downward--hence, may increase correlations simply due to similar patterns of missing values in items being compared.

Note that it is desirable to use data that is complete, or nearly so, in an analysis. In some cases that makes it appropriate to exclude an item from an averaged profile if it has "excessive" missing data. Correlation and covariance: background information Covariance is a gross measure of the joint variation of two variables. The theoretical definition is (notation explained after formula):

Cov(X,Y) = E[ (X-E(X)) * (Y-E(Y)) ]

where:
Cov(X,Y) = covariance between variables X, Y; and,
E(*) is the expected value (expectation) operator, i.e., in this case, the true mean.

Readers unfamiliar with the E(*) notation can simply substitute Avg(*), or average, for the expectation operator. Thus, for example, (X-E(X)) simply means the value of X, with its average subtracted from it.

It is also the case that:

Cov(X,Y) = E[X*Y] - ( E(X)*E(Y) ),

which when estimated yields the standard formula:

Cov(X,Y) = Avg(X*Y) - ( Avg(X)*Avg(Y) )

where Avg(*) is the average, i.e., arithmetic mean.

Note the relationship between variance and covariance:

Variance of X = Var(X) = Cov(X,X),

and note the terminology: SD(X) = standard deviation of X = square root of Var(X).

Note that covariance changes when data are rescaled:

Cov(aX,cY) = a*c * Cov(X,Y)

where a,c are arbitrary coefficients.

Correlation is a measure of the linear relationship between two variables, X and Y. If the relationship between two variables X and Y is nearly a straight line (of non-zero slope), then the correlation will be close to 1 or -1. Correlation and covariance are closely related:

Corr(X,Y) = Cov(X,Y)/[SD(X)*SD(Y)]

where Corr(X,Y) = Correlation between variables X and Y, and SD(X) is the standard deviation of variable X.

Correlation is always in the range of -1 to +1. Hence, in a figurative sense, correlation can be considered to be a sort of "standardized" covariance.

For a nice introduction to the topics of correlation and covariance, see:
De Groot M (1975) Probability and Statistics, Addison-Wesley Publishing, pp. 172-178. Errors in the use of correlation:
PART 1.
Basic errors The crank science proof uses correlations to compare milk and the derived category averages. However, let us consider the following.

• The USDA nutrient profiles are a collection of different nutrient values measured in 7 different units: weight (gm, mg, mcg); energy levels (kcal, kJ); vitamin A (RE, IU).

• The definition/formula for covariance, variance, and correlation all involve calculating a mean and subtracting it from the raw data. Consider the mean or (internal) average of the set of numbers in any one specific (USDA) nutrient profile (whether a solo food or category average does not matter). What units does that average/mean have? Obviously, one cannot assign any units to such an average, as it is a mixture of numbers displaying 7 different types of units, per above. Hence, quantities like E(X) or Avg(X), in the calculation of covariance, variance, and correlation are not meaningful. It follows from this that covariance, variance, and correlation are not meaningful in this context, and that the use of covariance and correlation in the proof is inappropriate.

In other words, in the alleged proof, the fruitarian extremist applied a statistical technique without bothering with the critical detail of whether it is appropriate for the data. (By the way, in preparing to do the analysis given in this paper, I considered, then rejected, the use of correlations across entire nutrient lists, for this very reason, and additional reasons discussed below.)

• Characteristics of the USDA nutrient profiles that impact data analysis. The USDA food profile data for food energy are linear functions (more precisely, rescalings) of each other (i.e., the same basic quantity in related units: kcal and kJ); similar remarks apply to the data for vitamin A, measured in RE and IU. The proximate composition data (less the food energy data) is linearly dependent, as it must add to 100 g in most cases. Further, the energy data--whether in kcal or kJ, is approximately a linear function of the proximate composition data (i.e., calories are a function of the fat, carbohydrate, and protein content of a food). The result of these redundancies and internal linear relationships in the USDA nutrient profiles may be to increase correlation and to introduce bias.

• Correlations are given and references to (implicitly) significant differences in correlations are suggested in the (very poorly written) "proof," all without any formal tests of significance. Such an approach is necessarily incomplete and dubious.

• An important structural limit in applying correlations to nutrient lists (not an error): The definition of correlation is such that it does not acknowledge differences in the importance of the factors used in computation. In other words, in this case, the grams of fat and sugar (which are critical differences) are just as important as the levels of vitamins A and C are in calculating correlations. However, as milk is a fatty food, and fruit a sweet food, such differences are of great real-world importance. That is why I analyzed the proximate composition (fat, carbohydrate, protein) in calories, separately, in my paper. Ignoring such differences by limiting analysis to correlations (and covariances), only, is irrational and produces an incomplete and misleading analysis. Errors in the use of correlation:
PART 2.
An unproven, implicit assumption of the "proof" The use of correlations in the alleged proof is based on a very important, implicit structural assumption. Given a suitable data set with two variables, then the calculation of correlations and other statistics is an objective exercise. However, note the term "suitable data set"--when do you have enough data? What is "suitable" in this context?

The implicit assumption--and a highly debatable assumption--made in the crank science proof is that the USDA nutrient list is both necessary and sufficient (in the strict logical sense) for an analytical comparison of two foods. An obvious defense of the USDA handbook is that it is a standard document, it includes many nutrients, etc. However, the USDA handbook has some very serious shortcomings for a comparison of fruit against other foods. In the USDA handbooks, the sugars are not broken out by type. Instead, all sugars and starches are listed combined, simply as total carbohydrate. This prevents one from comparing types of sugars, amount of starch, etc. In fact, the lack of such data in the USDA handbook is the reason I used the German (Scherz et al.) data for my analysis--i.e., the German data specifically included the (very important) breakdown of sugars/starch, by type.

If one compares the USDA and German tables, one notices many structural differences. Similar remarks apply if one compares these two tables to other standard tables (e.g., British), or to published papers in the journals. The conclusion one quickly reaches is that there is no universally acknowledged, complete, "standard" nutrient list for food composition. Correlations based on different nutrient lists could yield different results.

Thus the allegedly objective analysis--correlations--presented in the "proof" is actually based on a table that is at least partially subjective in construction. Further, the completeness of the table is highly debatable.

Remark: Inasmuch as regression and correlation are closely related, I deliberately limited regression analysis in this paper to narrowly defined data sets of closely related quantities, all of which were measured in the same units (amino acid profile, fatty acid profile). Errors in the use of correlation:
PART 3.
Why the "proof" probably cannot be "fixed" Let's assume that you can somehow get a nutrient list that will be accepted as standard by most observers. Let's also assume you have data in the format of such a list, and you want to compare (despite the limitations therein) the nutrient lists (foods) via correlation. How, then, can you "fix" the problems found in the fallacious crank science proof? Let's consider some possible fixes and see why they won't work well, if at all.

Fix 1: Drop energy and vitamin A data. Convert all remaining data (all of which are weights) into one measure--grams--and do correlations.

This gives you data all in one measure--grams. However, it gives you lists that contain a few numbers above zero (proximate composition--grams of water, fat, protein, carbohydrate, etc.), with the rest of the numbers, in relative terms, nearly zero (i.e., mg and mcg converted to grams). The result is that due to these magnitude differences within the lists, the correlations observed will be a function of the few numbers above zero, as the rest of each list is nearly zero. (That is, (X-Avg(X)) in the calculation of the correlations is nearly constant for X "small," i.e., mg/mcg terms converted to grams.) Yet depending on the vitamin or mineral in question (those quantities usually measured in mg or mcg), certain nutrients can have significant effects at very low concentrations. Thus, in effect, the "importance" of nutrients that occur at low relative concentrations are unjustifiably "discounted" by this approach, in terms of their effect on correlation. Hence this fix does not work, as the correlations computed under such conditions really don't reflect the entire data set.

Fix 2: Eliminate units of measure by converting the nutrient lists into lists of ratios, i.e., (nutrient list value)/(index list value). The candidates for serving as an "index list" are as follows.

1. List of RDA/RDI values for nutrients.
2. The nutrient list for human milk.
3. Another constructed list.

1. RDA/RDI lists: Both the nutrients and values therein are controversial and incomplete. There is no RDA/RDI for many important nutrients, and the different authorities often disagree as to the values for the RDA/RDI. You will not get anything approaching a consensus using this approach.

2. If human milk is used for calculating ratios, then you end up trying to calculate a correlation between a nutrient ratio list--milk--that is constant, i.e. all 1's, versus other nutrient lists. A constant list has variance (and standard deviation) of zero, hence to compute correlation, you end up dividing by zero--which is mathematically undefined. This approach definitely won't work.

3. It may be very difficult to produce an index list that is internally consistent and that can be defended statistically/logically. The problem that milk has lactose, and other foods do not, presents some major challenges in constructing such a list. An overall average or "small" reference value for lactose would yield a high lactose content ratio for dairy, zero for other foods. This could influence correlations.

Of course, the greatest challenge in producing any index list to use in calculating ratios for analysis is that any/all index lists constructed are subject to challenge, both in regard to index list values and the nutrients included in such an index list. Also, correlation is limited as an analysis tool, per the remarks above (it does not accurately reflect the important differences in proximate composition, sugar breakdown, etc.). An analysis that is limited to correlation (and covariance) alone, and that does not consider proximate composition and carbohydrate breakdown (e.g., the crank science proof) is an incomplete and invalid analysis. Errors in covariance and conclusions of the "proof" The claim in the proof that lower covariances for milk and fruit (vs. milk and poultry) suggests the numbers for milk and fruit are lower in magnitude (hence may be "closer" in gross terms) is approximately correct. However, the covariances are irrelevant because the average used for fruit is wrong, and the use of correlations is invalid per the above discussion. Hence the conclusion of the proof is logically and statistically invalid and does not apply.

However, we can have some fun with the invalid logic of the proof. Recall that covariance changes with scale, but correlation does not. Now consider the covariances calculated in the proof: Cov(M,F) = 3162, and Cov(M,P) = 6853, where M = milk, F = fruit, and P = poultry. Let's use the result that Cov(aX,cY) = a*c*Cov(X,Y), where we let a = 1, c = (3162/6853) = ~0.46. We then have:

Cov(M,cP) = 3162,

and Corr(M,cP) = Corr(M,P) = 0.96,

versus Cov(M,F) = 3162, and Corr(M,F) = 0.93.

That is, using the results from the crank science proof, the best match of foods is not 100 g of milk vs. 100 g of fruit, but instead 100 g of milk vs. 46 g of poultry!

Remark: As correlation does not change with rescaling but covariance does, the analytical advantage of poultry (higher correlation) can be maintained. If fruit is rescaled to yield a different covariance, simply rescale poultry yet again to match the fruit covariance.

To summarize the above in plain language: the covariance results of the crank science proof, rescaled, suggest that poultry, in smaller amounts, is a better match to milk than is fruit. Differences: crank science "proof" versus this paper Here we'll look at some of the major differences between the analytical approach of this paper and the confused, sloppy approach of the crank science proof that uses correlation and covariance. Such an analysis reveals the following. Flawed approach of the fruitarian "proof"

• The crank science proof used only the USDA handbook tables for analysis. As was discussed, these tables do not provide a breakdown of types of sugar. In contrast, the analysis in this paper uses the German tables, which generally provide a sugar breakdown. Such information is important, as fruit is of central interest here, and sugar is the major source of calories in most (but not all) fruits; also milk contains an "uncommon" sugar, lactose.

• The crank science proof used USDA handbook category averages--quite literally, the wrong data for fruit. In contrast, the analysis in this paper uses realistic averages of small numbers of raw fruits. The averages used in this paper were chosen based on real-world experience as a fruitarian (no fruitarian would eat an average of the USDA fruit handbook, as it includes many cooked and canned fruits). Additionally, the German data was augmented (where necessary) with data from the USDA and/or British tables, to provide relatively "complete" data for the analysis of this paper.

Note also that since fruitarians typically rely on a small number of (low-priced, in-season) fruits as staples, the approach in this paper--comparing milk to a realistic average of fruits--provides a real-world test of how the claim "fruit is just like mother's milk" actually holds up in practice.

• The crank science proof used only correlations and covariances, nothing else. The use of correlations in the proof is inappropriate for the reasons cited previously. Approach of the analysis in this paper In sharp contrast to the above, the analysis in this paper emphasizes basic, common sense points, as follows.

• The percentage breakdown by calories--proximate composition--is of great importance, and limiting analysis to correlations and covariances (per the crank science proof) effectively ignores the importance of the proximate composition. All of the hand-waving and bogus crank science statistics cannot overcome the simple reality that milk at 52.53% calories from fat is a "fatty food," and sweet fruits at 84.87% calories from sugar are a "sweet food," and that sugar and fat are indeed different (hence milk and fruit are different). This point is common sense and obvious to all, with the notable exception of the extremist fruitarian promoters of the fallacious crank science "fruit is just like mother's milk" theories.

• Types of sugars: The sugar breakdown is an explicit part of the analysis in this paper. This is important, as lactose (milk) and fructose (fruits) are slowly assimilated, while glucose and sucrose (fruits) are more rapidly assimilated.

Note also that an analysis based only on correlations and covariances does not give proper emphasis to differences in proximate composition or sugar breakdown, per the points above.

• Instead of general correlations, the analysis in this paper uses linear regression (a technique closely related to correlation), within limited, narrowly defined areas--the comparison of the amino acid and fatty acid profiles. Regression is reasonable in those situations as the data are all of the same type, closely related, and measured in the same units. Such narrow use of regression can be more easily defended, logically, than the muddled and uninformed crank science "proof's" approach of attempting correlation on nutrient profiles that are internally inconsistent, and which are arguably incomplete.

--Tom Billings

Before writing to Beyond Veg contributors, please be aware of our
email policy about what types of email we can and cannot respond to.

 Beyond Veg home   |   Feedback   |   Links 