Difference between revisions of "TDSM 6.7"

From The Data Science Design Manual Wikia
Jump to: navigation, search
(Created page with "'''6-7.''' Construct scatter plots for sets of 10, 100, 1000, and 10,000 points. Experiment with the point size to find the most revealing value for each data set. The follo...")
 
Line 1: Line 1:
 
'''6-7.''' Construct scatter plots for sets of 10, 100, 1000, and 10,000 points. Experiment with the point size to find the most revealing value for each data set.  
 
'''6-7.''' Construct scatter plots for sets of 10, 100, 1000, and 10,000 points. Experiment with the point size to find the most revealing value for each data set.  
 +
 +
----
  
 
The following plots were constructed with the "Master.csv" data set from this archive: [http://www.data-manual.com/datafiles/baseballdatabank-master.zip http://www.data-manual.com/datafiles/baseballdatabank-master.zip]
 
The following plots were constructed with the "Master.csv" data set from this archive: [http://www.data-manual.com/datafiles/baseballdatabank-master.zip http://www.data-manual.com/datafiles/baseballdatabank-master.zip]

Revision as of 11:24, 8 September 2017

6-7. Construct scatter plots for sets of 10, 100, 1000, and 10,000 points. Experiment with the point size to find the most revealing value for each data set.


The following plots were constructed with the "Master.csv" data set from this archive: http://www.data-manual.com/datafiles/baseballdatabank-master.zip

6.7.10.png

6.7.100.png

6.7.10k.png

One might suspect that a higher weight would be positively correlated with dying at a younger age due to the various health complications that often come with being overweight. However, that doesn't appear to be the case for baseball players.

Viewing the three plots, the first two can be a bit misleading and make it appear as if there is a strongly positive or strongly negative correlation, respectively. They also show less of the range. While these smaller samples don't accurately represent the entire data set, we can see by increasing the size of the points when there is less data makes it easier to visualize traits such as correlation and density.

We can learn a lot more from the 10k plot, which depicts barely any correlation and a wider range. I personally learned two things from this last plot, there was an American with dwarfism named Eddie Gaedel that participated in a Major League Baseball game in the 50's who weighed only 65 lbs[1], and that there is a very weak minor negative correlation between the age a baseball player dies and their weight while in the league (the P.C.C. based on this data was -0.0269160733072).
  1. https://en.wikipedia.org/wiki/Eddie_Gaedel