TDSM 9.5

From The Data Science Design Manual Wikia
Revision as of 22:02, 9 December 2017 by Kv (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution The least frequently occurring 80% of items are more important as a proportion of the total population Zipf’s law, Pareto distribution, power laws

50 Px


Examples: 1) Natural language - Given some corpus of natural language - The frequency of any word is inversely proportional to its rank in the frequency table - The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent… - “The” accounts for 7% of all word occurrences (70000 over 1 million) - “of” accounts for 3.5%, followed by “and”… - Only 135 vocabulary items are needed to account for half the English corpus!

Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people

File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites

Importance in classification and regression problems: - Skewed distribution - Which metrics to use? Accuracy paradox (classification), F-score, AUC - Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…) - Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach