Distance-network-TDSM

From The Data Science Design Manual Wikia
Jump to: navigation, search

Distance and Network Methods

Distance Metrics


10-1. Prove that Euclidean distance is in fact a metric.

(Solution 10.1)


10-3. Prove that dimension-weighted [math]L_p[/math] distance is a metric, for all [math]p \geq 1[/math].

(Solution 10.3)


10-5. Prove that edit distance on text strings defines a metric.

(Solution 10.5)


Nearest Neighbor Classification


10-7. What is the maximum number of nearest neighbors that a given point p can have in two dimensions, assuming the possibility of ties?

(Solution 10.7)


10-9. Construct a two-class point set on [math]n \geq 10[/math] points in two dimensions, where every point would be misclassified according to its nearest neighbor.

(Solution 10.9)


10-11. Suppose a two-class, [math]k=1[/math] nearest-neighbor classifier is trained with at least three positive points and at least three negative points.

  1. Might it possible this classifier could label all new examples as positive?
  2. What if [math]k=3[/math]

(Solution 10.11)


Networks


10-13. Power law distributions on vertex degree in networks usually result from preferential attachment, a mechanism by which new edges are more likely to connect to nodes of high degree. For each of the following graphs, suggest what their vertex degree distribution is, and if they are power law distributed describe what the preferential attachment mechanism might be.

  1. Social networks like Facebook or Instagram.
  2. Sites on the World Wide Web (WWW).
  3. Road networks connecting cities.
  4. Product/customer networks like Amazon or Netflix.

(Solution 10.13)


10-15. Prove that in any simple graph, there are always an even number of vertices with odd vertex degree.

(Solution 10.15)


Clustering


10-17. For a data set with points at positions [math](4,10), (7,10), (4,8), (6,8), (3,4), (2,2), (5,2), (9,3), (12,3), (11,4), (10,5), and (12,6)[/math], show the clustering that results from

  1. Single-linkage clustering
  2. Average-linkage clustering
  3. Furthest-neighbor (complete linkage) clustering.

(Solution 10.17)


10-19. Perform k-means clustering manually on the following points, for [math]k = 2[/math]: [math]S=\{(1,4), (1,3), (0,4), (5,1), (6,2), (4,0)\}[/math] Plot the points and the final clusters.

(Solution 10.19)


10-21. Identify a data set on entities where you have some sense of natural clusters which should emerge, be it on people, universities, companies, or movies. Cluster it by one or more algorithms, perhaps k-means and agglomerative clustering. Then evaluate the resulting clusters based on your knowledge of the domain. Did they do a good job? What things did it get wrong? Can you explain why the algorithm did not reconstruct what was in your head?

(Solution 10.21)


10-23. Assume that we are trying to cluster [math]n=10[/math] points in one dimension, where point [math]p_i[/math] has a position of [math]x=2^i[/math]. What is the agglomerative clustering tree for these points under

  1. Single-linkage clustering
  2. Average-linkage clustering
  3. Complete-link/furthest-neighbor clustering

(Solution 10.23)


Implementation Projects


10-25. Do experiments studying the impact of merging criteria (single-link, centroid, average-link, furthest link) on the properties of the resulting cluster tree. Which leads to the tallest trees? The most balanced? How do their running times compare? Which method produces results most consistent with k-means clustering?

(Solution 10.25)


Interview Questions


10-27. What is curse of dimensionality? How does it affect distance and similarity measures?

(Solution 10.27)


10-29. How might we be able to estimate the right number of clusters to use with a given data set?

(Solution 10.29)


10-31. How can you deal with correlated features in your data set by reducing the dimensionality of the data.

(Solution 10.31)


Kaggle Challenges


10-33. Which people are most influential in a given social network? https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network

(Solution 10.33)


10-35. Predict which product a consumer is most likely to buy. https://www.kaggle.com/c/coupon-purchase-prediction

(Solution 10.33)