Full Stack Deep Learning
  • Full Stack Deep Learning
  • Course Content
    • Setting up Machine Learning Projects
      • Overview
      • Lifecycle
      • Prioritizing
      • Archetypes
      • Metrics
      • Baselines
    • Infrastructure and Tooling
      • Overview
      • Software Engineering
      • Computing and GPUs
      • Resource Management
      • Frameworks and Distributed Training
      • Experiment Management
      • Hyperparameter Tuning
      • All-in-one Solutions
    • Data Management
      • Overview
      • Sources
      • Labeling
      • Storage
      • Versioning
      • Processing
    • Machine Learning Teams
      • Overview
      • Roles
      • Team Structure
      • Managing Projects
      • Hiring
    • Training and Debugging
      • Overview
      • Start Simple
      • Debug
      • Evaluate
      • Improve
      • Tune
      • Conclusion
    • Testing and Deployment
      • Project Structure
      • ML Test Score
      • CI / Testing
      • Docker
      • Web Deployment
      • Monitoring
      • Hardware/Mobile
    • Research Areas
    • Labs
    • Where to go next
  • Guest Lectures
    • Xavier Amatriain (Curai)
    • Chip Huyen (Snorkel)
    • Lukas Biewald (Weights & Biases)
    • Jeremy Howard (Fast.ai)
    • Richard Socher (Salesforce)
    • Raquel Urtasun (Uber ATG)
    • Yangqing Jia (Alibaba)
    • Andrej Karpathy (Tesla)
    • Jai Ranganathan (KeepTruckin)
    • Franziska Bell (Toyota Research)
  • Corporate Training and Certification
    • Corporate Training
    • Certification
Powered by GitBook
On this page

Was this helpful?

  1. Course Content
  2. Infrastructure and Tooling

Frameworks and Distributed Training

How to choose a deep learning framework? How to enable distributed training for your models?

PreviousResource ManagementNextExperiment Management

Last updated 4 years ago

Was this helpful?

Summary

  • Unless you have a good reason not to, you should use either TensorFlow or PyTorch.

  • Both frameworks are converging to a point where they are good for research and production.

  • fast.ai is a solid option for beginners who want to iterate quickly.

  • Distributed training of neural networks can be approached in 2 ways: (1) data parallelism and (2) model parallelism.

  • Practically, data parallelism is more popular and frequently employed in large organizations for executing production-level deep learning algorithms.

  • Model parallelism, on the other hand, is only necessary when a model does not fit on a single GPU.

  • Ray is an open-source project for effortless, stateful, parallel, and distributed computing in Python.

  • RaySGD is a library for distributed data parallel training that provides fault tolerance and seamless parallelization, built on top of Ray.

  • Horovod is Uber’s open-source distributed deep learning framework that uses a standard multi-process communication framework, so it can be an easier experience for multi-node training.

Frameworks and Distributed Training - Infrastructure and Tooling