We introduce a modular recurrent neural architecture, which learns distributed, generative temporal models of bio-logical motion. It encodes modal visual and proprioceptive (angular) biological motions separately by means of autoencoders,structuring respective postures, motion directions, and motion magnitudes separately. The submodal encoders are interdepen-dent by predicting each other’s next autoencoder states temporally. As a result, distributed attractor states can develop fromself-generated motions. We show that the architecture is able to synchronize its activities across modalities towards overallconsistent action-encoding attractors. Moreover, the developing spatial and temporal structures can complete partially observ-able actions, e.g., when only providing visual information. Furthermore, we show that the network is capable of simulatingwhole-body actions without any sensory stimulation, thus imagining unfolding actions. Finally, we show that the network isable to infer the visual perspective on a biological motion. Thus, the neural architecture enables embodied perspective takingand action inference.