Expressing Data Quality in Fitting

Do Kester

1. Introduction

In real fitting processes it often occurs that data points are of different quality. Either as the result of previous calculations, e.g. when the datapoint is obtained from an integrating receiver, or from known deficiencies in the data such that the data have unequal quality, through calbration of the measurements, or even just by an educated hunch of the experimenter.

In this note we explore the possibilities to use these differences in order to obtain a better fitting result. We can express the qualities as either importance weights attached to the data points, or as a scale factor in the residuals.

Mostly data points are measured at locations (in time, space, frequency etc.) that are known with much higher precision than what is measured. These locations are called the independent variables, and what is measured at these locations, are called the dependent variables. In these cases, we fit the model by minimizing the mismatches in the dependent variable, called y, assuming that the other variable, the independent one, called x, is known with a precision, at least an order of magnitude better.

Assume we have N data points D = {xi,yi}. We want to fit the data with an as-yet-unspecified model

  Equation 1

such that some norm of the residuals, ri, are minimumized over the parameters, θ.

  Equation 2

2. Quality as Weights

In our paper (Kester and Mueller, 2021) (KM), we state our preference for weighting over scaling on the basis that weights are more general, more flexible than scales. We also state that our weights are treated as multiple cq. fractal observations. An observation with a weight of 2 would yield in the same result as including that same observation twice in the dataset, without weight, i.e. a weight of 1.

The common suggestion, repeated in our paper, that the choice between weights and scales is irrelevant with respect to the final fitting results, is somewhat too fast. Up on closer inspection it turns out that the least-squares results and the maximum likelihood results are indeed the same, provided that we choose w = σ-2, where w is the weight and σ is the accuracy. However, the calculation of the size of the posterior and consequently, the evidence is different. The overall shape of the posterior is not affected.

The crux is in the likelihood. The likelihood is the product of the error distribution values at individual data points, xi.

  Equation 3

When we choose a Gaussian error distribution we obtain for individual data points, without weights. We will address other error distributions later on in this note.

  Equation 4

where yi is the measured value at location xi and mi = F(xi : θ). It is the value returned by the model, F, at the same location and for certain values of the parameter vector, θ, also known as mock data.. The scale factor, σ, is in the narrative of KM, part of the likelihood. It is either a fixed value or a hyperparameter to be estimated in the problem too.

The likelihood of Eq. 2 is displayed in Fig. 1, using the black line. It is shown as a function of the residuals r = y - m. The scale factor σ, is set to 1.0.

  log L

Figure 1. The Gaussian likelihood as function of residual size. In black for σ = 1, in green for w = 2, and in red for σ = ½√2.

When there are qualitative differences between individual data points, there are 2 ways to handle them: via weighting of the datapoints, and via scaling of the residuals. For weighting we assume that a data point that has a weight, w has the same impact as w (of the same) data points with a weight equal to 1. We can easily extend this to floating values of the weight, wi.

With these assumptions Eq. 4 transforms into

  Equation 5

This is the formula we have been using in our NestedSampler.

In Fig. 1 the likelihood of Eq. 5 is displayed in green for a weight value, w = 2. This likelihood is not normalized as it is actually a product of 2 likelihoods of the form of Eq. 4. The latter is less than 1, everywhere, so the green line is consistently below the black line.

However, there is another option, where we know the accuracy of the errors on the data point, e.g. when the data point is the result of a previous calculation. Assume σi represents the size of the errors. In this narrative the σ's are a given. They are part of the data, not part of the likelihood.

  Equation 6

The likelihood of Eq. 6 is shown in Fig.1 as the red line, using a value for σ = 1/ √2. This is a normalized likelihood, so when the width decreases, the height must increase.

Even if we put, as usual w = s-2 (as in Fig.1), there is still a difference between Eq. 5 and Eq. 6. The exponential parts of the equations turn out the same, but the normalizing parts are different. The least-squares and the maximum likelihood solutions only depend on the exponential parts. So they do not differ in both representations of the data quality. The Bayesian solution, posterior and evidence, take the full likelihood into account, so they are different. Which one is better is up to the user to decide: which of the notions, weighting or scaling, reflects better on the problem and the data at hand.

In the logarithm the likelihood transform into

  Equation 7

Eq. 4 (no differing data quality) transforms (N is the number of data points).

  Equation 8

Equation 5 (using weights) transforms into.

  Equation 9

Equation 6 (using scales) transforms into.

  Equation 10

3. Quality as Errors

We will follow the discussions in Phil Gregory's Book (1) and the supplement to it (2). They are reference resp, as PG1 and PG2.

There are several narratives that can be followed, each valid in applicable circumstances. Here we assume that the likelihood is a Gaussian error distribution.

3.1. Single scale

All errors are from a Normal distribution with a single scale, σ. At this point it does not matter how the model mi is calculated. We just assume it is there.

  Equation 11

The scale, σ, is either a given of the problem or an extra (hyper)parameter, that must be estimated from the data.

3.2. Known Accuracies

For each of the data points, the accuracy is known as σi. The likelihood is the same as in Eq.2 except that the accuracies, σi, are different at each data point, replacing the σ in Eq. 2 by σi.

  Equation 12

3.3. Errors in the Model

When our (knowledge of the) model is incomplete, not embodying all subtleties present in the data, there can be (systematic) errors which are not caught when knowing the accuracies only. The model incompleteness might be known at the level of each data point as σm,i, or globally as σm. The latter might be an unknowm quantity, estimatable as a hyperparameter of the problem.

Errors in the model can occur simultaneously to errors in the data points. When combining 2 possible errors, the distribution takes the form of a convolution of both individual error distributions. For Gaussian distributions we obtain another Gaussian where the variance is the sum of the individual variances. For other distributions it is not so clear cut. In section 4 we discuss that situation.

  Equation 13

All σ values might be the same for all data points, provided they are a given in the problem. Only one of them can be sensibly estimated from the data.

3.4. Errors in x and y

When we have errors in x and y of similar size, we do not know the exact location of the independent variable. We have to estimate them too. We call them the targets, ti. In stead of Eq.1 we have

  Equation 14

And in addition to Eq. 2 we have an extra residual to minimize; this time for the unknown x-location, for which holds that

  Equation 15

The unknown x-locations are so called nuisance parameters of the problem, each with their own prior. They have to be integrated over all possible values to obtain the values for the model parameters, θ. When using the nested sampling algorithm, it means that the nuisance parameters just can be ignored.

PG1 presents a case where the nuisance parameters are integrated analytically, so they are removed from the problem. PG1 needs to assume that the model is a straight line and that one prior suffices for all targets. The latter encodes the case where the possible range of the values for the targets, t, is much smaller than the possible range of the errors, εi. PG1 ends up with one (maybe 2) nuisance parameters in stead of N.

We think that these conditions are too restrictive; we prefer working with nuisance parameters for the N targets. Again. using a Gaussian error distribution, we multiply the probabilities of Eqs. 11, 12, or 13, whatever the case is, with the probability for the residuals in x.

  Equation 16

where τ is the accuracy of the data points in the x-direction.

As the targets are (nuisance) parameters to the problem, they need a prior. As said before, PG1 chooses a single prior for all targets and he could integrate the nuisance parameters out (for straight line models). We want a prior for each target as we think that in most cases the values for xi are not drawn from a single distribution, but are significantly different in themselves with an added error contribution. This leads to priors which are centered on the data point values of xi. This may seem like cheating as we are using the data for establishing the prior. But it is not. We know beforehand that the targets need to be near the x-data, whatever these data points are. We are not looking at the data and decide what to choose; we already know what to do.

Moreover, for a large class of distributions that are function of the size of the mismatch between target and data, it holds that the probability of the data, given the target, is the same as the probabilty of the target given the data. E.g. for a Gaussian distribution as prior

  Equation 17

where the scale of the distribution, τ, needs to be specified, estimated by the user.

In its simplest form the likelihood becomes

  Equation 18

The values of mi are now evalutated on the targets of x: mi = F(ti : θ).

More succinctly, the likelihood can be written in matrix notation as

  Equation 19

where

  Equation 20

So zi is the data vector, the ζi is the target vector and Vi is the variance matrix. The variance matrix contains σy being the accuracy deemed for the y-data, eventually compounded by the accuracy (known or unknown) for the model values at the target values, t, as in Eq. 5. The values for σx are the accuracies deemed for the x-data. Both σ's might be known on the level of individual data points, or only globally for the problem as a whole.

When there are errors on both axes, the contours of equal likelihood form nested ellipsoids, whose semimajor axes are multiples of σx and σy. The target positions, t, are found where these ellipisoids touch the model function, F. The optimal solution is found where the distances from data to targets is minimal. Geometrically, we have to find those parameters, θ, for which the ellipsoidal distance from data to model is minimal.

3.5. Correlated errors in x and y

When the errors in x and y are correlated, the variance matrix, Vi, needs to be replaced by a covariance matrix.

  Equation 21

where ρi is the covariance factor between xi and yi. With this replacement, the likelihood of Eq. 18 stays the same.

With correlated errors, the ellipsoids are rotated according to the correlation between x and y. We still want to find the minimum in the ellipsoidal distances from data to model.

3.5.1. Example

Suppose we have measurements of the magnitudes of stars in 3 bands, U, B and V. To eliminate the influence of the distance of the stars we take differences between the bands: U - B and B - V.

By construction the the errors in the differences are (anti)correlated. Every error in B shows negative in U - B and positive in B - V.

If the accuracies of U, B and V were σU, σB, and σV, resp., the covariance matrix would be

  Equation 22

All this, of course, under conditions of independent measurements

3.5.2. Model Mismatch

When there is also an mismatch in the model of (possibly unknown) size σm, the most probable model function is the one closest (in 2 dimensions) to the data points. The covariance matrix reverts to

  Equation 23

The model mismatch is supposed to be independent of the other accuracies; it does not affect the covariance.

3.5.3. Computation

For computational reasons, we write out Eq. 19, using the covariance matrix of Eq. 22. Firstly we write the covariance matrix as

  Equation 24

Where the v's are (co)variances of y and x.

The determinant of V becomes

  Equation 25

And the inverse of V is

  Equation 26

In most computations we need the likelihood in logarithmic form.

  Equation 27

All relevant items in Eq. 28 need indices i. They are omitted to keep it (relatively) simple.

We also need the partial derivatives of the log-likelihood to the parameters, θ, to the targets, t, and, if present to the unknown model accuracy, σm as in Eq. 23.

  Equation 28

  Equation 29

Where δi,k is the Kronecker delta.

The unknown model scale is present in vyy, vxx, and subsequently in D.

  Equation 30

So in case of Eq. 23 we have

  Equation 31

Combining these

  Equation 32

4. Other Error Distributions

The schemes above only work on scaled error distributions; so not for Poisson and Bernouilli distributions.

In the table below is indicated what is implemented in BayesicFitting with , and what is not implemented with . The numbers in brackets refer to notes.

The orange σm indicate a scale factor due to the incompleteness of the model. It might either be known or unknown. In the latter case the scale needs to be estimated from the data. All other scales are known accuracies that need to be presented with the data.

ErrorDistributions
Scale Eqn Gauss Laplace Uniform Expon Cauchy Mixed
σ 11 (1)
σm 11
σi 12 (1)
σi + σm 13 (2) (3) (2) (2) (4)
Errors in X and Y (5)
( σi, τi ) 18 (6) (6) (6) (6) (1)
( σi, τi, ρi ) 21 (7) (7) (7) (7) (1)
( σi + σm, τi + σm, ρi ) 22 (2) (3) (2) (2) (4)
Notes
1   Only of interest when both scales of the mixing distributions are unknown.
2   Convolution distribution unknown
3   Convolution is a chopped triangular distribution.
4   Only when both mixed distributions are Gauss.
5   Assuming the same distribution in X and Y.
6   Could in principle be done. However not implemented yet for want of a good use case.
7   Dont know; Maybe a rotated version of the previous. Analogous to Gauss.

5. Combining weights and accuracies

In principle accuracies and weights can be combined. E.g. when one data point represents a known number of observations, all with the same known accuracy.

The overall likelihood is the product of the likelihoods of the individual points, as in Eq. 3. Using weights Eq. 3 changes in

  Equation 33

It is irrelevant what the exact form of the individual likelihoods is. They can all be combined in this way with weights.