Human beings have an excellent ability which can form and recognise object categories. In this paper, a novel systemof multimodal object recognition and categorisation by perform- ing interactive behaviours is introduced. Video clips are filmedas the raw input of the system. A dataset of 100 objects with 18 categories and 5 different interactions is used to evaluated theperformance. Convolutional neural network is used to train the classifier and learn the categories. The result shows the high-est, lowest and average recognition accuracies of every specific object in every category and the receiver operating character-istic for every category. The connection between the presented system and human cognitive system is discussed in the conclu-sion and future works.