The ever-increasing variety of microcontrollers aggravatesthe challenge of porting embedded software to new devicesthrough much manual work, whereas code generators can beused only in special cases. Moreover, only little technical documentation for these devices is available in machine-readableformats that could facilitate automating porting efforts. Instead, the bulk of documentation comes as print-orientedPDFs. We hence identify a strong need for a processor toaccess the PDFs and extract their data with a high quality toimprove the code generation for embedded software.
In this paper, we design and implement a modular processor for extracting detailed datasets from PDF files containing technical documentation using deterministic table processing for thousands of microcontrollers. Namely, we systematically extract device identifiers, interrupt tables, package and pinouts, pin functions, and register maps. In our evaluation, we compare the documentation from STMicro againstexisting machine-readable sources. Our results show thatour processor matches 96.5 % of almost 6 million referencedata points, and we further discuss identified issues in bothsources. Hence, our tool yields very accurate data with onlylimited manual effort and can enable and enhance a significant amount of existing and new code generation use cases inthe embedded software domain that are currently limited by alack of machine-readable data sources.