Egocentric vision captures the scene from the point of the view of the camera wearer while exocentric vision captures the overall scene context. Jointly modelling ego and exo views is a crucial step towards developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While, third-person view and first-person has been thoroughly investigated, very few works aim to study the both synchronously. Exocentric videos contain many relevant signals transferrable to egocentric videos. We propose a multimodal-LLM model that leverages large-scale exocentric information for the task of egocentric action recognition. This thesis also provides a broad overview of works combining both the egocentric and exocentric vision.