Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, Noah A. Smith
Optical music recognition (OMR) aims to convert music notation into digitalformats. One approach to tackle OMR is through a multi-stage pipeline, wherethe system first detects visual music notation elements in the image (objectdetection) and then assembles them into a music notation (notation assembly).Most previous work on notation assembly unrealistically assumes perfect objectdetection. In this study, we focus on the MUSCIMA++ v2.0 dataset, whichrepresents musical notation as a graph with pairwise relationships amongdetected music objects, and we consider both stages together. First, weintroduce a music object detector based on YOLOv8, which improves detectionperformance. Second, we introduce a supervised training pipeline that completesthe notation assembly stage based on detection output. We find that this modelis able to outperform existing models trained on perfect detection output,showing the benefit of considering the detection and assembly stages in a moreholistic way. These findings, together with our novel evaluation metric, areimportant steps toward a more complete OMR solution.