Infants’ speech perception adapts to the phonemic categoriesof their native language, a process assumed to be driven bythe distributional properties of speech. This study investigateswhether deep neural networks (DNNs), the current state-of-the-art in distributional feature learning, are capable oflearning phoneme-like representations of speech in anunsupervised manner. We trained DNNs with unlabeled andlabeled speech and analyzed the activations of each layer withrespect to the phones in the input segments. The analysesreveal that the emergence of phonemic invariance in DNNs isdependent on the availability of phonemic labeling of theinput during the training. No increased phonemic selectivityof the hidden layers was observed in the purely unsupervisednetworks despite successful learning of low-dimensionalrepresentations for speech. This suggests that additionallearning constraints or more sophisticated models are neededto account for the emergence of phone-like categories indistributional learning operating on natural speech.