Continuing from WSPR for A/B tests – a discussion – part 1.

Above is a frequency histogram of the experiment log.

## Approximately…

The histogram uses 1dB intervals for the bars, so it chunks the data into discrete bands, and that hides an important issue with WSPR SNR data, its granularity is 1dB, so it is a very coarse measure given the spread of the data.

Lets compare the probability distribution of the measured difference data with an ideal normal distribution.

Above is a quantile-quantile (Q-Q) plot of the raw data and an ideal response with the same standard deviation as the raw data. The data is for 4508 points, so these dots each typically represent a large number of observations, more so in the middle region.

There are two main departures:

- the response is a staircase rather than a straight line;
- the response departs from a straight line by curving to higher slope at the low end and high end.

(1) is due to the 1dB granularity of WSPR reports. The representation of an underlying continuous variable as an integer adds statistical noise and causes the data to fail tests of normality (so invalidating parametric methods dependent on normality).

(2) is a characteristic of the measurement system which exhibits some non-linearity of response at the low and high ends of the WSPR detector range. The effect of this defect is diminished by the lower number of observations in the lower and upper tails of the data.

So, to the eye, the data might at first look normally distributed but it fails normality tests. The Mean is -0.09dBm, the median is 0dB, and a Shapiro-Wilk normality test gives a probability 2.62e-37 that the data is normally distributed… extremely unlikely.

## Testing hypotheses

We have a set of observations with a mean of -0.09dB, and the question arises whether there is in fact a difference between transmitters A and B, or was this mean a result of chance.

In statistical speak, we want to test the null hypothesis Ho: there is no difference between A and B.

If this were normally distributed data, we could use a paired Student's t test to test that hypothesis, and further, we could use the properties of a normal distribution to set confidence limits to the calculated difference.

It is not normally distributed data, so we could apply a non-parametric test for Ho.

The Wilcoxon signed rank test is suitable, and calculated probability that Ho is true (the data are not different) is 2.9e-24. It is extremely unlikely that A and B are the same, or that B-A=0 (ie that this result occurred by chance). Although we can say that with conviction, we cannot set confidence limits on the calculated mean (-0.09dB).

We can observe that mean of the measurements was -0.09dB and 95% of the measurements fell within the range -3.0 to 3.0dB.

Were the data normally distributed, we could calculate a confidence interval base on SD and N and say that the difference of B and A is -0.09dB +/-0.070dB with 95% confidence.

The latter is a stronger statement as it makes inference about B wrt A whereas the statement before that simply reports measurements.

Continued at WSPR for A/B tests – a discussion – part 3.