I am by no means a statistician. There are a variety of tests and techniques I use, most of which I mostly understand. This post is about one of those techniques that I became familiar with while working in 40Ar/39Ar labs, and trying to write some of my own data reduction software. One thing I noticed was that although this statistic is rather common in the earth sciences, there wasn’t a single place that explained the steps required to calculate or interpret the results. So, here it goes…
There are a variety of methods for calculating a best-fit line through a set of data. The most common methods are the least squares and weighted least squares techniques. The least squares method allows for the calculation of a best fit line, but does not consider errors in the measurements, while the weighted least squares method considers only errors in one axis or measurement. For geochronologic applications, it is important to consider errors in measurements in both the x- and y-directions. The following method employs a set of equations derived by Derek York, originally for use with Rb-Sr isochron diagrams, but later employed in a variety of geochronologic systems, including 40Ar/39Ar and U-Th-Pb. The basic idea of the York fit is to produce a line that minimizes the distance of each data point from the line, while weighing the precision of each point, as shown below
The above diagram shows the basics of the York fit. First of all, the goal is to come up with the line that minimizes the distance from your data point to the line. The average distance of each data point (xi,yi) to the best fit line (point Xi,Yi on line Y=aX+b) is the value that should be minimized. In addition, more weight, or importance in the fit, is given to samples with higher precision. This minimization parameter S is therefore defined by York (1969) as
where the weight of each data point is defined as
Calculation of the best-fit line using this method is an iterative process, where you start with a reasonable approximation for the slope (b) and y-intercept (a), as well as the correlation coefficient (r) for the data errors. Reasonable starting values can be calculated from a least squares fit to the data, which does not take into account the errors in the x- and y-direction (Eq.15 and 16). The correlation coefficient between the errors is given in equation 16. Equations 3-8 can be used to calculate the slope (b), y-intercept (a), and x-intercept (c) from a given data set. Once the new value for b is calculated, it is reinserted into the equation, and the values are calculated again. The answers will eventually converge, depending upon the desired precision, scatter in the data, and quality of the original estimate
Uncertainties for the a and b values calculated above are given by McDougall and Harrison (1997) as
The Mean Square of the Weighted Deviates (MSWD) is the parameter used to assess the goodness of fit of the calculated line to the data. It is the ratio of the minimization parameter calculated above (S) to the number of degrees of freedom in the system (in this case, n-2). The MSWD cannot be compared to other parameters, such as r2, that are typically used to describe the goodness of fit of a line. The MSWD is unitless, and is normalized to the input values (this is not a measure of absolute error). The standard deviation of the MSWD is given by equation 12. Equation 13 is the target MSWD for a given data set. If the calculated MSWD is less than the value calculated in equation 13, then there is a > 5% probability that the data are adequately described by the line, and are generally accepted. The target MSWD is typically around 2, but varies with the number of data points. As the number of data points increases, the target MSWD approaches 1. An MSWD << 1 indicates that the reported errors are most likely overestimations of the true error.
Example: Sample data set and associated calculations from an 40Ar/39Ar analysis. Use the X and Y data and the associated uncertainties to calculate fits and statistics. Answers below!
X |
X stdev |
Y |
Y stdev |
2.7023E-03 |
2.7102E-05 |
3.0800E-03 |
2.7400E-05 |
3.5143E-03 |
4.6900E-05 |
3.0226E-03 |
4.6800E-05 |
1.1232E-02 |
5.7000E-05 |
1.8457E-03 |
2.0600E-05 |
1.7668E-02 |
7.0400E-05 |
1.0541E-03 |
1.3000E-05 |
2.3694E-02 |
2.8400E-04 |
1.3909E-04 |
1.3900E-04 |
1.8766E-02 |
5.8708E-05 |
8.8394E-04 |
1.7017E-05 |
2.1217E-02 |
6.8546E-05 |
5.7748E-04 |
1.4818E-05 |
2.5336E-02 |
1.2869E-04 |
2.4434E-06 |
2.5362E-05 |
1.8063E-02 |
4.8321E-05 |
9.8585E-04 |
1.5664E-05 |
1.9546E-02 |
5.5549E-05 |
8.0004E-04 |
1.5398E-05 |
1.7832E-02 |
6.9203E-05 |
1.0204E-03 |
1.9451E-05 |
1.2708E-02 |
4.1328E-05 |
1.7162E-03 |
2.0225E-05 |
1.2232E-02 |
3.9281E-05 |
1.8275E-03 |
1.9135E-05 |
1.1136E-02 |
2.8326E-05 |
1.9530E-03 |
1.3782E-05 |
1.1843E-02 |
2.1066E-05 |
1.8550E-03 |
7.7317E-06 |
1.3436E-02 |
1.5847E-05 |
1.6080E-03 |
6.0629E-06 |
r = 0.898
a = 3.45E-3 ± 1.48E-5
b = -0.1362 ± 1.01E-3
c = 0.02535
MSWD = 2.12
Target MSWD = 1.76
Bad Fit (P < 5%)
There are a large number of publications looking at the York II fit (as we call it) to 40Ar/39Ar isochrons. Ramifications of MSWD values, slopes and intercepts, have been looked at in quite a few publications.
I think you may have forgotten to define the parameter f.
f is the degrees of freedom! Nice catch.
This is a very useful summary! Do you have the expressions for the case of zero intercept (that is, the intercept a = 0)? If so, I would appreciate these.
Continuing from my message above, I would appreciate the approximate expression for sigma_b (that is, eq. 9 in this post for the case of a=0).
Thanks for the comments! I don’t have those expressions handy, but do you mean a case where the line is forced through the origin? If the calculated origin is 0, I don’t see why different equations would be needed, but if you are trying to force the line through the origin, or know if must pass through is a priori, I’m not sure how to deal with that!
Yes, I mean that the line is forced through the origin because it must pass through the origin. Hence, in eq. (1), a = 0. It is relatively easy to obtain the expression of b by minimizing S, but finding a good approximation of the error on b (analogy to eq. 9) is difficult.
Very helpful. Thank you
Before asking my question, I just would like to say your post is amazing and absolutely important. There is a huge number of professional who works with geochronology and have no idea about the calculations! The results are poor conclusions or even meaningless ages. Well, my question: what is the definition of rhoi, in equation 2? Where can I find it?
Cheers,
Felipe