TDSM 3.15
From The Data Science Design Manual Wikia
Below are some of the ways to screen an outlier from dataset:
- We can visualize the data by using graphs and find candidates for outliers.
- We can analyze the data and can practically find minimum and maximum values of the points in various dimensions.
- For Normal distributions if the data is a number of sigma away from mean, then that point can be an outlier.
- We can cluster the data points and if there is a point which is very far from cluster center, then it can an outlier.
Handling of outliers:
- If the outliers are present because of some measurement error or any other error in data collection, then the outliers can be deleted which can lead to model improvement.
- If the outliers present are not because of some error, then they should not be deleted as when the outlier will come again in test data then our model will give worse results if we have deleted them in training data.