Objective: The rapidly increasing prevalence and application of machine learning (ML) across disciplines creates a pressing need to establish guidance for data curation professionals. However, we must first understand the characteristics of ML-related objects shared in generalist and specialist repositories and the extent to which repository metadata fields enable findability and reuse of ML objects.
Methods: We used a combination of API queries and web scraping to retrieve metadata for ML objects in eight commonly used generalist and ML-specific data repositories. We assessed both metadata schema and characteristics of deposited ML objects, within the context of the widely adopted FAIR Principles. We also calculated summary statistics for properties of objects, including number of objects per year, dataset size, domains represented, and availability of related resources.
Results: Generalist repositories excelled at providing provenance metadata, specifically unique identifiers, unambiguous citations, clear licenses, and related resources, while specialist repositories emphasized ML-specific descriptive metadata, such as number of attributes and instances and task type. In terms of object content, we noted a wide range of file formats, as well as licenses, all of which impact reusability.
Conclusions: Generalist repositories will benefit from some of the practices adopted by specialists, and specialist repositories will benefit from adopting proven data curation practices of generalist repositories. A step forward for repositories will be to invest more into use of labels and persistent identifiers to improve workflow documentation, provenance, and related resource linking of ML objects, which will increase their findability, interoperability, and reusability.