asked on

Statistics Question

I have a general statistical problem here.

I have say 50 records in a table that are "readings" (integers) from a device. The readings are not consecutive as in 103458,103459,103460 but they should be sequential: 103458,103465,104560, etc.

From time to time there are errors in the read.... lets say one out of 50. These errors are usually way out of sequence perhaps an example would be 103458,103465,234,104560 - note the 234 is WRONG and way out of sequence.

I am trying to determine the max value - the min value of these records but I want to throw out these inconsistent records. The numbers can be anywhere between 1 and about a million depending on which counter I am tracking so there isnt a very good way to do this on fixed variances and standard deviation sort of works but in some cases the standard deviation is very low because the numbers DO NOT have noise in them and thus I cannot throw anything outside of one STDDev away from the avg out.

I need to come up with something that uses the standard deviation to determine which values should be included but I am a little rusty on my stats here... any help is appreciated.

Thank you

I am trying to understand how I can detect these data elements that are out of character with the others.

NTGuru705

ASKER

I should add I cannot adjust the values as they are being inserted... this is outside my control - only the analysis.

aburr

I would avoid the SD route if possible. If you have about 50 numbers. I would suggest testing for "out of order". Number your results from 1 to N. Thus you will have a matrix A consisting of a(n) with a(n) being the nth number in your set.
for each pair calculate a difference d(n,n=1) d(n,n+1) = an -a (n+1) If d(n,n+1) is < 0 throw out a(n+1) and continue to calculate.

deighton

if you calculate a least squares regression line, then for reading n, you will get a straight line with constant factors a and b such that

estimate = an + b

then using that you can calculate the average deviation from the deviation line and chuck out the gross deviations

NTGuru705

ASKER

I like the difference option but this assumes that the first value isn't the bum value. The values could be 145,123456,134567,127,1456789,1567890,1567950

The 145 and 127 in the above are junk. I don't think this will work because the first number is one of the bum numbers.

The least squares regression is not anything I recall from my one stats class years ago so ill have to do some more digging on what that is exactly.
Thank you both for your help.

ASKER CERTIFIED SOLUTION

phoffric

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

NTGuru705

ASKER

There are timestamps but I am not sure I understand why that would matter because the outliers can happen at any point on the time axis.

If I take the mean what will I use to determine values that are the ones I need... ?

If I use standard deviation I still have the same problem presented above if the standard deviation is low meaning there are no outliers.
Thank you for your direction.

SOLUTION

aburr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

NTGuru705

ASKER

I can tell that it is outside the range of the other items that are the bulk of the number... but I dont have a set number that could be used because the numbers could be between 1 and a million. A percentage wont work because when the numbers are lower the percentage of change can be higher.

I think I need to figure out the max range of numbers and exclude +- .75 of that from the median.
I will try this and see.
Thank you

aburr

"but I dont have a set number that could be used because the numbers could be between 1 and a million."
-
If you do not have a set number the computer cannot have a set number
-
"I think I need to figure out the max range of numbers and exclude +- .75 of that from the median."
Sounds very reasonable

NTGuru705

ASKER

I have a series of machines that are sending in "values" - depending on the machine volume the numbers can vary - meaning one machine may have a count of 500K and another 500 just due to where its cycle is.

I do know though that each unit should produce X units at max - so I can get the median and make sure I only go .75 of my X value on either side of the median.

Thank you

phoffric

>> 145,123456,134567,127,1456789,1567890,1567950
>> The 145 and 127 in the above are junk.

How can you say that the 145 is junk in the above example?
The median (after sort) is 134567,
and (134567-145) = 134422 = 1.34422e+5
and (1456789-134567) = 1322222 = 1.322222e+6

So, 1456789 appears to be more of an outlier than 145.

Also, do you need to be concerned about wrap-around. That is, when the highest number is reached, the next number will be small and will falsely appear to be an outlier. If so, then special case needs to be made when your numbers get near the max threshold.

>> There are timestamps but I am not sure I understand why that would matter because the outliers can happen at any point on the time axis.
Depending upon what your data represents, timestamps can be relevant in determining outliers. Suppose, for example, that the numbers represents electric KW usage. If two readings are 1 day apart, but a third reading is 30 days later, then this last reading would likely be much larger than the first two readings.
So, with your model of what your data represents, it may be possible to predict what the next reading should be (within some % margin) based on time; the larger the time difference, then possibly, the larger the difference between two consecutive readings.

NTGuru705

ASKER

Also, do you need to be concerned about wrap-around. That is, when the highest number is reached, the next number will be small and will falsely appear to be an outlier. If so, then special case needs to be made when your numbers get near the max threshold.

There isnt really a max value... the field is an integer and will be taken offline before it hits the max real value. Good point though.

I see your point on the timestamp though in this case there isnt a way to do that because the reading is not relevant to time. Thanks for clarifying for me.

phoffric

Ok, but I am still not sure looking at your data what the criteria is for saying that 145 is an outlier. Its distance to the next reading, 123456, is smaller than the distance from 134567 (presumably good) to its next "good" reading, 1456789.

So, I think we need to clarify a little more what the criteria is for declaring a reading to be an outlier. Also, if you provide more examples (perhaps more subtle) and explain the points that are outliers, we can deduce the criteria.

SOLUTION

phoffric

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial