TDSM 7.21

From The Data Science Design Manual Wikia
Jump to: navigation, search

In my personal experience, I would say good data is more important.

If you don't have good model, you can try some other models or just get a relatively good model and relatively good result. However, with bad data, you can do nothing but hate the world and hate the society.

In another perspective, it is much much more expansive and hard to collect, label and maintain a good dataset. This could involved in lots of manual labor.

In terms of good data:

  1. well labeled. if the textbooks for kids are all wrong, we cannot expect kids can get to understand the pattern or the rules of the real world. Though, Of course, some times we add some noise into the dataset to test if the system is robust and we have a lot of tricks to get rid of those noise and also it is inevitable to have error or mistakes in dataset.
  2. balanced. You cannot expect a good result if you have 10000 data for A and only 1 for B. It is always hard to learn with an unbalanced dataset. At least, your data can represent the prior probability well. The ideal data is ideally completely random-sampled from the real world.
  3. well structured. Just easier for others to use.

In terms of good model, there is not absolutely best model or good model. It totally depends on how you use it and depends on what kind of problem you want to solve. There are also several metrics to judge a model, like accuracy, precision, recall et cl. However, it is also depends on the situation. When you judge a cancer classifier, it is better to get a bigger recall. When you want a terrorist detector, it is better to find the one with a bigger precision(unless you want to find all the possible terrorists and you think it is fine to shoot at innocence).