Sources

Where do the training data come from?

Summary

  • Most deep learning applications require lots of labeled data. There are publicly available datasets that can serve as a starting point, but there is no competitive advantage of doing so.

  • Most companies usually spend a lot of money and time to label their own data.

  • Data flywheel means harnessing the power of users rapidly improve the whole machine learning system.

  • Semi-supervised learning is a relatively recent learning technique where the training data is autonomously (or automatically) labeled.

  • Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data.

  • Synthetic data is data that’s generated programmatically, an underrated idea that is almost always worth starting with.

Last updated