Part-whole relations and their representation play a vital role in perceptual organization and conceptual reasoning. It is critical for humans to parse visual scenes into objects and parts, and organize them into hierarchies. Few studies have examined how well neural networks learn part-whole hierarchies from visual inputs. In this paper, we introduce a new diagnostic dataset, CChar, to facilitate their understanding. It contains frame-based images of writing 6,840 Chinese characters and annotations on hierarchical structures. The results show that RNN and Transformer models could recognize a part of high-level components above strokes and illustrate a certain ability in learning part-whole hierarchies. However, these models do not have robust compositional reasoning. To identify the role of conceptual guidance in predicting hierarchical structures, we prepare visual features extracted by self-supervised and fine-tuned models, test them on generating hierarchical sequences, and observe that conceptual guidance is important to learn part-whole hierarchies. In addition, we also explore the relationship between the depth of hierarchies and model performance. It is found that RNNs perform worse as the hierarchies deepen, but the performance of Transformers becomes better with increasing depth.