Difference between revisions of "Distance-network-TDSM"
(Created page with "= Distance and Network Methods = '''Distance Metrics''' <br>10-1. Prove that Euclidean distance is in fact a metric. (Solution 10.1) <br>10-3. Prove that d...") |
(No difference)
|
Revision as of 22:06, 31 March 2017
Distance and Network Methods
Distance Metrics
10-1.
Prove that Euclidean distance is in fact a metric.
10-3.
Prove that dimension-weighted [math]L_p[/math] distance is a metric, for all [math]p \geq 1[/math].
10-5.
Prove that edit distance on text strings defines a metric.
Nearest Neighbor Classification
10-7.
What is the maximum number of nearest neighbors that a given point p can have in two dimensions, assuming the possibility of ties?
10-9.
Construct a two-class point set on [math]n \geq 10[/math] points in two dimensions, where every point would be misclassified according to its nearest neighbor.
10-11.
Suppose a two-class, [math]k=1[/math] nearest-neighbor classifier is trained with at least three positive points and at least three negative points.
- Might it possible this classifier could label all new examples as positive?
- What if [math]k=3[/math]
Networks
10-13.
Power law distributions on vertex degree in networks usually result
from preferential attachment, a mechanism by which new edges are more likely to connect to nodes of high degree. For each of the following graphs, suggest what their vertex degree distribution is, and if they are power law distributed describe what the preferential attachment mechanism might be.
- Social networks like Facebook or Instagram.
- Sites on the World Wide Web (WWW).
- Road networks connecting cities.
- Product/customer networks like Amazon or Netflix.
10-15.
Prove that in any simple graph, there are always an even number of vertices with odd vertex degree.
Clustering
10-17.
For a data set with points at positions [math](4,10), (7,10), (4,8), (6,8), (3,4), (2,2), (5,2), (9,3), (12,3), (11,4), (10,5), and (12,6)[/math], show the clustering that results from
- Single-linkage clustering
- Average-linkage clustering
- Furthest-neighbor (complete linkage) clustering.
10-19.
Perform k-means clustering manually on the following points,
for [math]k = 2[/math]:
[math]S=\{(1,4), (1,3), (0,4), (5,1), (6,2), (4,0)\}[/math]
Plot the points and the final clusters.
10-21.
Identify a data set on entities where you have some sense of natural clusters which should emerge, be it on people, universities, companies, or movies. Cluster it by one or more algorithms, perhaps k-means and agglomerative clustering. Then evaluate the resulting clusters based on your knowledge of the domain. Did they do a good job? What things did it get wrong? Can you explain why the algorithm did not reconstruct what was in your head?
10-23.
Assume that we are trying to cluster [math]n=10[/math] points in one dimension, where point [math]p_i[/math] has a position of [math]x=2^i[/math]. What is the agglomerative clustering tree for these points under
- Single-linkage clustering
- Average-linkage clustering
- Complete-link/furthest-neighbor clustering
Implementation Projects
10-25.
Do experiments studying the impact of merging criteria (single-link, centroid, average-link, furthest link) on the properties of the resulting cluster tree. Which leads to the tallest trees? The most balanced? How do their running times compare? Which method produces results most consistent with k-means clustering?
Interview Questions
10-27.
What is curse of dimensionality? How does it affect distance and similarity measures?
10-29.
How might we be able to estimate the right number of clusters to use with a given data set?
10-31.
How can you deal with correlated features in your data set by reducing the dimensionality of the data.
Kaggle Challenges
10-33.
Which people are most influential in a given social network?
https://www.kaggle.com/c/predict-who-is-more-influential-in-a-social-network
10-35.
Predict which product a consumer is most likely to buy.
https://www.kaggle.com/c/coupon-purchase-prediction