Map of India
Return Home

Measuring Biological Distance

The Chi-Square Test
	The chi-square test is a non-parametric technique which has a distribution similar
to the binomial distribution without being limited to two variables. The test is flexible
and easy to carry out. The test gives the probability that the differences between two
samples are not due to random variation. It effectively consists of a comparison between
the difference of observed and expected values squared divided by the expected value.

X2 = (Oi - E i)2 * (1/ E i)
Dividing by E i standardizes the statistic by weighting the contribution of each class so that the largest classes do not necessarily produce the largest statistic (Thomas, 1976). The null hypothesis for calculating biological distance using the chi-squared statistic is that the samples being compared are from the same original population. The statistic is then compared with the chi-squared distribution tables to determine how close the statistic is to that obtained randomly. A result that would be expected to occur randomly less than 5% of the time, is considered significant enough to reject the hypothesis that there is no significant difference between the populations. An outcome that is not significant at the 5% level does not necessarily mean that the two samples are from the same population, but may also mean that there is a close affinity between the populations and the differences are slight. The results from the chi-square test correlate well with those of other more sophisticated methods and is generally considered sensitive enough given the nature of the data (Constandse- Westermann, 1972). These methods all treat the traits as though they were independent from each other which is not neccessarily true The chi-squared statistic for biological distance is calculated with the following formula from Constandse-Westermann (1972):
Dk2 = {(p1jk - p2jk)2/ pjkw} To judge the significance of Dk2: T2 = {(2n1 * 2n2) / (2n1 + 2n2)} * Dk2
D = distance p = the % of individuals possessing the trait i = designates population j = designates trait k = class of trait so, pijk = the frequency of k class of j trait in i population pjkw = the weighted expected value calculated by:
(2n1j* pijk + 2n2j* p2jk) / (2n1j + 2n2j)
(n = sample size) This X2 value is then multiplied by 100 to standardize sample sizes and calculated for each trait between each population. All of the X2 values are added together to get the X2 statistic. The degrees of freedom are calculated by k-1 and the statistic with the given number of degrees of freedom is compared with the value from the distribution table for a 5% significance level. If the X2 statistic is greater than would be expected by chance, the null hypothesis is rejected. Simple Example of Chi-Square test: p1jk p2jk n=301 n=93 trait 1 0.220 0.218 trait 2 0.155 0.188 pjkw = (2n1j* pijk + 2n2j* p2jk) / (2n1j + 2n2j) (602*0.220 + 186*0.218) / 788 = 0.222 weighted mean for trait #1 (602*0.155 + 186*0.188) / 788 = 0.163 weighted mean for trait #2 Dk2 = {(p1jk - p2jk)2/ pjkw} {(0.220-0.218)2 / 0.222} + {(0.155-0.188)2 / 0.163} = 0.006699 T2 = {(2n1 * 2n2) / (2n1 + 2n2)} * Dk2 {(602*186) / (602*186)}* 0.006699 = 0.9519041, degrees of freedom= 1 The distribution table has a value of 3.84146 for 5% significance with one degree of freedom. The differences between the two populations are not significant at the 5% level for these two traits. Squared Difference To get an overall idea of the resemblance between populations, the coefficient of relationship R can be calculated with the following formula:
R2 = d2jk / nj d= (p1jk - p2jk)2 / n variance R = (variance of p1jk - variance of p2jk) / nj
Pooled Dispersion Matrix In order to separate out the relative contributions of each trait to the difference between populations, the pooled dispersion matrix cjkl is calculated by this formula:
cjkl = {(n21ij* aijkl) +...+ (n22ij* aijkl)}/ (n1ij +...+ nij ) aijkl = [ {pijk (1-pijk) /n1ij ] in the form of a matrix, ie.
[ x -x ] [ -x x ] An example of the pooled dispersion matrix (Constandse-Westermann, 1972): Relative contributions to the difference between two populations of the MN and ABO systems: Pop1 Pop2 Pop3 gene freq. n=100 n=200 n=200 M 0.20 0.50 0.30 N 0.80 0.50 0.70 p 0.40 0.30 0.50 q 0.10 0.30 0.20 r 0.50 0.40 0.30 aijkl for MN system: pop1= [(0.20*0.80)/100] = 0.0016 pop2 = [(0.50*0.50)/100] = 0.00125 pop3 = [(0.30*0.70)/100] = 0.00105 cijkl = {(n21ij* aijkl) +...+ (n22ij* aijkl)}/ (n1ij +...+ nij ) {(100*100*0.0016) + (200*200*0.00125) + (200*200*0.00105)} / 500 cijkl = 108/500 = 0.216 contribution of MN system to djk of pop1 and 2= [1/cijkl] (freq. n in p1 - freq. n in p2)2 = (1/0.216) (-0.30)2 = 0.4167 = 41.67% calculating cijkl for the ABO system (remembering that aijkl is a matrix reduced above by dropping one row and one column), gives a percent contribution of 65.77 The relative contribution can then be calculated for the differences between the other sample combinations. Angular Transformation For multinomial qualitative traits, a simple angular transformation can be used to calculate biological distance. This method is based on the geometric formula a2 + b2 = c2, with the square root of c equaling a linear distance value (Constandse- Westermann, 1972). Using the symbol key above:
djk = p1jk - p2jk
The result is then expressed as an angle rather than a frequency through the linear transformation, which also standardizes the variances:
pijk> = sin-1 * (square root of) pijk the other angle, 0ijk = sin-1 * (1-2pijk)1 the variance pijk= 1/(4nij) variance 0ijk = 1/ nij
Using the weighted mean for trait 1 (0.222) from the above example: pijk> = sin-1 * (square root of) pijk sin-1* 0.471 28.11 degrees variance pijk= 1/(4nij) 1/8 0.125 0ijk > = sin-1 * (1-2pijk)1 sin-1 * (0.556) 33.78 degrees variance 0ijk = 1/ nij 0.125 The distance, a chord, between the two groups can be measured by subtracting pijk - 0ijk and the 2 standardized variances can be added. Mean Measure of Divergence The linear transformation is also used in the Mean Measure of Divergence formula (Berry and Berry, 1967).
(01 - 02)2 - (1/n1 + 1/n2)
Where again the 0 = the angular transformation of the incidence frequency per population into radians:
0 = sin -1 (1-2p) with variance = 1/n D = 01 - 02 has variance (1/n1) + (1/n2) and (01 - 02 ) / Var will have an approximate X2 distribution (df = 1) The variance of D2 = 4D2 (1/n1 + 1/n2) for a pair of traits The variance of D2 = 4 (1/n1 + 1/n2) sum D2 for a pair of populations The variance for the MMD between two populations = 4 (1/n1 + 1/n2)[(01 - 02)2 - (1/n1 + 1/n2)] / n of traits
Freeman and Tukey (1950) define 0 as
0 = sin -1 [root of) r / (n+1)] + sin -1 [root of) (r+1)(n+1)]
r = radians The difference is apparently in the variance which is calculated incorrectly and underestimated with the original MMD formula (Souza and Houghton, 1977). This linear transformation however, is most neccessary when the sample sizes are under 100 and alpha is between 0.05- 0.95. For this study neither of those conditions apply which leads me to believe that the formula from Berry and Berry (1967) will suffice.

Background
Garasias
Metric Traits
Discrete Traits
Methods and Results
Problems
References