Deep Learning (DL) has emerged as a powerful tool for solving complex problems invarious domains, including natural language processing and computer vision. DL, being an
empirical process, requires tuning of hyperparameters and exploring neural network architectures
which involve significant compute resources, storage, memory, time, and human effort. While
tools exist to address challenges associated with large datasets or with large DL models, there is
a notable scarcity of comprehensive solutions that efficiently handle both large-scale models as
well as large-scale datasets. The advent of Transformers and Large Language Models(LLMs)
have underlined these problems and made overcoming them ever more significant. Unlike big
tech, these issues are particularly acute for small-scale companies and individuals. There is a need to democratize large-scale DL.
As a response, we propose a novel end-to-end platform that provides efficient scaling of
DL in a cluster, regardless of the size of the datasets or models. Our platform can preprocess
data, train, validate, and test models, as well as visualize results - all under one roof. Cerebro
achieves this through its novel scheduler which is a hybrid of task, data and model parallelism.
Our design supports fault tolerance and cluster resource heterogeneity. Implementing Cerebro’s
user-friendly templates makes scaling DL effortless, allowing users to work seamlessly with the
same familiarity as on their local machines. To evaluate our platform, we conducted experiments
on various DL tasks, including image captioning and object detection. The experiments demon-
strated that our platform provides efficient scaling of DL workloads, significantly reducing the
time, effort, and resource costs required for large-scale model selection. This thesis describes the methods and approaches taken in the design and development
of the Cerebro platform. We also discuss in detail our experimental observations of Cerebro in
action and outline the directions this work can take in the future.