In this thesis, we focused on investigating novelty modules integrated into popular detection network for assisting it to learn attentive representations for several practical applications in objects detection and instance segmentation tasks, including universal object detection and 3D medical image segmentation tasks. For universal object detection task, despite increasing efforts on universal representations for visual recognition, few have addressed object detection. In this thesis, we develop an effective and efficient universal object detection system that is capable of working on various image domains, from human faces and traffic signs to medical CT images. Unlike multi-domain models, this universal model does not require prior knowledge of the domain of interest. This is achieved by the introduction of a new family of adaptation layers, based on the principles of squeeze and excitation, and a new domain-attention mechanism. In the proposed universal detector, all parameters and computations are shared across domains, and a single network processes all domains all the time. Experiments, on a newly established universal object detection benchmark of 11 diverse datasets, show that the proposed detector outperforms a bank of individual detectors, a multi-domain detector, and a baseline universal detector, with a 1.3 parameter increase over a single-domain baseline detector. For 3D medical images segmentation tasks, although high resolution 3D medical images offer abundant detail information of human body parts and allow early detection of small lesions, due to the limitation of GPU memory, most methods either use down-sampled 3D volume as input, which significantly affects the detectability of small lesions, or use 2.5D networks to crop out neighboring image slices at original resolution, which loses context information along z direction. Both ways can significantly affect the performance of final model. In this paper, we propose a cross-slice spatial and channel attention module, which can maintain spatial resolution of input data, and effectively utilize context information along z direction of 3D volume. In order to get higher quality mask prediction, a cascade mask refinement module is designed to provide an objectiveness pixel-wise attention map for input feature maps. Furthermore, our scheme allows us to utilize the pretrained 2D detection models to achieve good results even with limited amount of training data, which is often met in medical applications and imposes big challenge to many deep learning methods. By utilizing the two novel modules, we achieve state-of-art performance 74.10 dice per case on Liver Tumor Segmentation Challenge(LiTS), which outperforms previous year challenge winner by 6.7 points and rank as 1st on leader board of LiTS benchmark upon submission of this paper.