Glossary

A list of terms used in this package, with a short explanation.

The sizes listed at various terms:
D: dimension of the input variable(s).
K: number of parameters in the model.
N: number of data points.

Independent Variable

The vector(s) of coordinates (locations, times, frequencies or whatever), at which the measurements were made. These vector(s) are known beforehand. Mostly in the ordinary 1 dimensional case this would be the x-axis when the data were to be plotted. However, more than one input vector is allowed. Then it becomes a more dimensional problem.
Size = N if D == 1 else D * N.

Dependent Variable

The vector of measured datapoints. For the fitterclasses this needs to be a 1-dim vector. For fitting maps, cubes or even higher dimensional datasets automatic conversion is done to get the dependent and independent variables in the proper shape.
Size = N.

Weight

A vector of the same shape as the dependent variable representing the weights of the individual datapoints. Weights by nature are non-negative.
Weights are defined such that a point with weight, w, is equivalent to having that same point w times . This concept is extended to non-integral values of the weights.
Weights can sometimes be derived from the standard deviations in a previous calculation. In that case the weights should be set to the inverse squares of the stdevs. However weights do not need to be inverse variances; they can also be derived in other ways. One specially usefull feature of the use of weights, is that some weights might be set to zero, causing those points not to contribute at all to the fit.
Size = N.

Accuracy

The accuracy is a (set of) numbers that represent a user provided estimate of the size of the errors.
Accuracies do not change the "number of observations", as weights do. Each measurement might have a different accuracy; it is still one measurement. When choosing weight = accuracy-2, the difference only matters in the calculation of the evidence.
Accuracy can be 1 number, valid for all data, or a vector of N, one value for each data point. When there are possibly errors in both the dependent variable and the independent variable, it can be a matrix of (N,2) or of (N,3). In the latter case the third number is the (Pearson) correlation coefficient between both variables.
Size = 1 or 2 or 3 (all datapoints the same value) or
       N or (2,N) or (3,N) (one value for each data point).

Model

The mathematical relationship which is supposed to be present between the independent and the dependent variables. The relationship mostly contains one or more unknown values, called parameters. The fitting process is a search for those model parameters that minimize the differences between the modelled data and the measured data.

Parameter

The parameters of the model. After fitting they are at the optimal values.
Size = K.

Problem

A container object that collects all elements of a problem e.g. the Model, the independent and dependent variables and if present, the weights and/or accuracies. Problems are only relevant in the context of NestedSampler.

Chisq

Chisq is the global misfit of the data (D) wrt the model (M), scaled by the accuracies and/or multiplied with the weights, if applicable :
χ2 = Σ w * (( D - M ) / σ )2
Least squares is the same as log of the likelihood of an Gaussian error distribution. Least squares is easy Bayes. In least-squares setting, the fitters minimize Chisq to find the optimal parameters.

Likelihood

The cumulative probability of the data, given the parameters and model. In practice the log of the likelihood is used as it is a more manageable, nicer number.
MaxLikelihood fitters search for a (global) maximum in the likelihood landscape. At that position, the maximum likelihood solution for the parameters is found. In case of a Gaussian likelihood, this ML solution is the same as the least squares solution.

Standard Deviation

The standard deviation of the parameters. It is the squareroot of the trace of the covariance matrix
When the number of data points increases the standard deviations decrease roughly with a factor sqrt(N).
Size = K.

Scale or Noise Scale

The average amount of noise left over when the model with optimized parameters has been subtracted.
s = sqrt( χ2 / ( N - K ) )
Scale will not decrease when the number of datapoints increase.

Confidence Region

The confidence region is the wiggle room of the optimal solution. It is derived via a montecarlo method from the covariance matrix.

Design Matrix

The matrix of the partial derivatives of the model function to each of its parameters at every data point. It is also known as the Jacobian (matrix).
Size = N * K.

Hessian Matrix

The inner product of the design matrix with its transpose. In the presence of weights these are also folded in.
Size = K * K.

Covariance Matrix

The covariance matrix is the inverse of the Hessian matrix multiplied by the scale squared. The standard deviations are defined as the square root of the diagonal elements of this matrix.
Size = K * K

Prior or Prior probability

The prior is the probability of the parameters (in our case) before considering the data.
There is a lot of mumbo-jumbo about priors. They are said to be subjective and thus (wildly) different, depending on the whim of the actors. However in real life problems this is not the case. From the layout of the problem you are analysing it follows mostly directly where parameter can be allowed to go. Say if you have data from a spectrometer, then any frequency derived should be within the measuring domain of the instrument; fluxes should be above zero and below the saturation etc. My personal rule of thumb is, whenever you start to frown on the outcome of a parameter it is out of your prior range.

Posterior or Posterior probability

The posterior is the probability of the parameters (in our case) after considering the data.
According to Bayes Rule, the joint probability of data and parameters, given the model:
joint = posterior * evidence = likelihood * prior
P(p,D|M) = P(p|D,M) * P(D|M) = P(D|p,M) * P(p|M)
Where P is probability, p is parameters, D is Data and M is Model.
As the integral of the posterior over the parameter space must be 1.0, to be a proper probability, the evidence acts as a normalizing constant of the prior * likelihood.

Evidence

The integral of the prior * likelihood over the parameter space. It provides the evidence the data carry for the model. I.e. it tells us how probable the model is given the data. Technically seen there is another application of Bayes Rule needed to get from p(D|M) to p(M|D): P(M|D) ~ P(M) * P(D|M) If we can ignore the priors on the models (all being the same) then the probability of the data given the model, P(D|M) id proportional to the probability of the model given the data, p(M|D).

Because of the proportionality, the number itself does not say anything. It can be compared to evidences obtained for other models fitted to the same data.

In practice the log of the evidences is calculated, as these numbers can be very extreme (large or small). If the 10log evidences of 2 models, A and B, differ by a number f, then probability for model A is 10^f times larger than that of model B. P(A)/P(B) = 10^f.

Information

The log of the ratio the space available to the parameters under the prior probability to the space under the posterior probability.