Advancements in deep learning have catalyzed growth in robotic applications, extending their utility beyond constrained settings. Nevertheless, a significant challenge remains in en- abling robots to efficiently navigate and interact within unstructured and dynamic environments. Existing methods in robot navigation require the use of dense geometric representations such as high definition maps or full 3D reconstruction. But these methods are non trivial and consume significant resources in both creation and usage. They also become stale for environments with constant changes. To be able to scale in terms of size and novelty of the environment, new algorithms that use representations accounting for semantics is required. Besides that, to be able to interact and collaborate with humans, these representations need to be able to ground the visual information in other modalities such as text, while retaining a long term memory. This thesis presents work on these directions, development of new approach and the discussion on the experiments applying these methods to navigation and robot instruction following in home environments.