Data-driven approaches, especially those that leverage deep learning (DL), have led to significant progress for many important problems in computer vision and image/video processing over the last decade -- fueled by the availability of large-scale training datasets. Typically, for supervised DL tasks that assess the unambiguous aspects of visual media – such as classifying an object in an image, recognizing an activity in a video – large-scale datasets can be reliably captured with human-provided labels specifying the expected right answer. In contrast, an important class of perceptual tasks deserves special attention: assessing the different aspects of the quality and authenticity of visual media. DL for these tasks can enable widespread downstream applications. However, the subjective nature of these tasks makes it difficult to capture unambiguous and consistent large-scale human-annotated training data. This poses an interesting challenge in terms of designing DL-based methods for such perceptual tasks with noisy/limited training data – which is the focus of this dissertation. We first explore DL for perceptually-consistent image error assessment, where we want to predict the perceived error between a reference and a distorted image. We begin by addressing the limitations of existing training datasets: we deploy a novel, noise-robust scheme to label our proposed large-scale dataset which is based on pairwise visual preference to reliably capture the human perception of visual error. We then design a learning framework to leverage this dataset and obtain state-of-the-art results in perceptual image-error prediction. Perceptual metrics have been vital to the advancement of deep generative models for images and videos -- which, although promising, also poses a looming societal threat (e.g., in the form of malicious deepfakes). In a separate chapter, we therefore explore a complementary question: given a high-quality video without any human-perceivable artifacts, can we predict whether it is authentic? Within this context, we specifically focus on robust deepfake detection using domain-invariant, generalizable, input features. Lastly, we find that for certain perceptual tasks, such as modeling the visual saliency of a stimulus, the only way to overcome the ambiguity/noise in the training data is to query more humans, e.g., using a gaze tracker. This tends to be onerous - especially for video-based stimuli. Hence, most existing datasets are limited in their accuracy. Considering that noise-robust dataset capture in this case can often be impossible, we design a noise-aware training paradigm for video and image saliency prediction that prevents overfitting to the noise in the training data and shows consistent improvement compared to traditional training schemes. Further, since the existing video-saliency datasets do not capture video-specific aspects such as temporally evolving content, we design a novel videogame-based saliency dataset with temporally-evolving semantics and multiple attractors of human attention. Overall, through this dissertation, we make critical strides towards robust DL for visual perceptual tasks related to visual quality and authenticity assessment.