TDSM 9.5
In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution The least frequently occurring 80% of items are more important as a proportion of the total population Zipf’s law, Pareto distribution, power laws
Examples:
1) Natural language
- Given some corpus of natural language - The frequency of any word is inversely proportional to its rank in the frequency table
- The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
- “The” accounts for 7% of all word occurrences (70000 over 1 million)
- “of” accounts for 3.5%, followed by “and”…
- Only 135 vocabulary items are needed to account for half the English corpus!
Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems: - Skewed distribution - Which metrics to use? Accuracy paradox (classification), F-score, AUC - Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…) - Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach