# Sources

{% embed url="<https://www.youtube.com/watch?v=5rY5HPe9UzI>" %}
Sources - Data Management
{% endembed %}

## Summary

* Most deep learning applications require lots of labeled data. There are publicly available datasets that can serve as a starting point, but there is no competitive advantage of doing so.
* Most companies usually spend a lot of money and time to label their own data.
* **Data flywheel** means harnessing the power of users rapidly improve the whole machine learning system.
* **Semi-supervised learning** is a relatively recent learning technique where the training data is autonomously (or automatically) labeled.
* **Data augmentation** is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data.
* **Synthetic data** is data that’s generated programmatically, an underrated idea that is almost always worth starting with.
