Infants learn to imitate and recognize words at an early age,but phonemic awareness develops at a later age, guided byacquisition of literacy for example. We investigate ahypothesis that speech representations in the brain are formedpartly due to articulatory-acoustic learning, and theserepresentations may be used as a basis when learning anadditional mapping to phonemes. We train a convolutionalrecurrent neural network, having an articulatory branch and aphonemic branch for multitask learning. When trained withreal conversational speech and aligned synthesizedarticulation, it is shown that the use of the articulatoryrepresentation boosts phoneme recognition accuracy, whenthe first convolutional layers are shared between the twobranches. It is hypothesized that representations involved inspeech perception formed in the brain during childhood maybe partly based on articulatory learning, and an additionalmapping from these low-level speech representations tophonemes has to be learned.