Mytix
asked on
simple outlier question
Hi Guys,
Say I have a set of data which represents time over distance (x/y).
Is there a way to determine outliers in the distance? What is the usual way of determining this?
I hope my question is clear enough...otherwise Ill try and provide an example problem.
Thanks
Say I have a set of data which represents time over distance (x/y).
Is there a way to determine outliers in the distance? What is the usual way of determining this?
I hope my question is clear enough...otherwise Ill try and provide an example problem.
Thanks
Please provide an example
You could take the average/standard deviation of the distance. Then call an outlier more/less than (average +/- X*standard deviation). Set X to higher values to call only most extreme outliers as outliers, set X to lower to call less extreme outliers as outliers
ASKER
Ok.....I understood only half of that adam...
for example..here is what a very very simple sample might look like ..note that althought it can be represented by a function..it isnt one or isnt defined by a function
x y
------------
30 20
31 19
32 20
33 21
34 19
35 18
36 15
37 20
38 21
39 19
40 18
Now from this example the outlier is x=36, but how do I calculate it? what is y represented a slope/curve-like function?
Thanks again
for example..here is what a very very simple sample might look like ..note that althought it can be represented by a function..it isnt one or isnt defined by a function
x y
------------
30 20
31 19
32 20
33 21
34 19
35 18
36 15
37 20
38 21
39 19
40 18
Now from this example the outlier is x=36, but how do I calculate it? what is y represented a slope/curve-like function?
Thanks again
Your question is a bit unclear and unusual. I will assume your first column is x (time) and the second is y(distance). It is unclear what x/y represents (an inverse velocity?!?) No matter the procedure is the same no matter how unphysical the background.
Presumably each line is data taken about what you hope is the same thing. Perform the calculation (x/y) on each of the 11 points, take the average, find the standard deviation. An outlier is often defined as any point where
|zi – zbar| > 3 sigma
where zi is the data point (xi/yi), zbar is the average and sigma is the standard deviation.
The whole problem of outliers is controversial and fraught with dangers. In my opinion data should NERVER be rejected without specific cause.
Presumably each line is data taken about what you hope is the same thing. Perform the calculation (x/y) on each of the 11 points, take the average, find the standard deviation. An outlier is often defined as any point where
|zi – zbar| > 3 sigma
where zi is the data point (xi/yi), zbar is the average and sigma is the standard deviation.
The whole problem of outliers is controversial and fraught with dangers. In my opinion data should NERVER be rejected without specific cause.
In this example, you would want to calculate the average and standard deviation of the Y data. Then you would want to take the absolute value of each Y - Average. If this is greater than C * Standard Deviation, then the data is an outlier. C is a constant, and the value for C depends on how much you want to call an outlier. In this particular example, setting C=2 will give you only the 1 outlier, where X=36. If you were to set C=1, X=33 and X=38 would also be called outliers. If you were to set C=3, none of your data points would be outliers.
The details of how to calculate this will depend on the software you are using. If using Excel, here is an example:
-Assuming the titles X and Y are in row 1, Columns A and B, and data follows immediately
-Add a column Info
x y Info
-- -- ----
30 20
31 19
32 20
33 21
34 19
35 18
36 15
37 20
38 21
39 19
40 18
In cell B15, put this formula: =AVERAGE(B2:B12)
In cell B16, put this formula: =STDEV(B2:B12)
In cell B17, put this number (this is the C from above): 2
In cell C2, put this formula: =IF(ABS(B2-B$15)>(B$16*B$1 7),"OUTLIE R","normal ")
fill this formula down to C12
you can change the C (cell B17) to see how it affects things.
like aburr said, throwing out data because it's not like the others isn't usually a good idea. and in this particular case, 15 is that different from the others..... but that's specific to your problem
The details of how to calculate this will depend on the software you are using. If using Excel, here is an example:
-Assuming the titles X and Y are in row 1, Columns A and B, and data follows immediately
-Add a column Info
x y Info
-- -- ----
30 20
31 19
32 20
33 21
34 19
35 18
36 15
37 20
38 21
39 19
40 18
In cell B15, put this formula: =AVERAGE(B2:B12)
In cell B16, put this formula: =STDEV(B2:B12)
In cell B17, put this number (this is the C from above): 2
In cell C2, put this formula: =IF(ABS(B2-B$15)>(B$16*B$1
fill this formula down to C12
you can change the C (cell B17) to see how it affects things.
like aburr said, throwing out data because it's not like the others isn't usually a good idea. and in this particular case, 15 is that different from the others..... but that's specific to your problem
ASKER
Okay....x and y have almost no correlation..I had to use time and distance as an example. Im trying to detect outliers because I want to keep them and throw out the rest. My bigger problem involves finding outliers for data which are almost random, finding a pattern in it and finding occurances of data which are not following this pattern. I can see that if the case was liner using average would be easier, but what if the pattern looked non-linear? parabolic maybe? is the method the same to find outliers despite the different possible functions?
thanks again!
thanks again!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.