Earlier studies have shown that musically trained individuals may have a benefit in adverse listening situations when compared to non-musicians, especially in speech-on-speech perception. However, the literature provides mostly conflicting results. In the current study, by employing different measures of spoken language processing, we aimed to test whether we could capture potential differences between musicians and non-musicians in speech-on-speech processing. We used an offline measure of speech perception (sentence recall task), which reveals a post-task response, and online measures of real time spoken language processing: gaze-tracking and pupillometry. We used stimuli of comparable complexity across both paradigms and tested the same groups of participants. In the sentence recall task, musicians recalled more words correctly than non-musicians. In the eye-tracking experiment, both groups showed reduced fixations to the target and competitor words' images as the level of speech maskers increased. The time course of gaze fixations to the competitor did not differ between groups in the speech-in-quiet condition, while the time course dynamics did differ between groups as the two-talker masker was added to the target signal. As the level of two-talker masker increased, musicians showed reduced lexical competition as indicated by the gaze fixations to the competitor. The pupil dilation data showed differences mainly in one target-to-masker ratio. This does not allow to draw conclusions regarding potential differences in the use of cognitive resources between groups. Overall, the eye-tracking measure enabled us to observe that musicians may be using a different strategy than non-musicians to attain spoken word recognition as the noise level increased. However, further investigation with more fine-grained alignment between the processes captured by online and offline measures is necessary to establish whether musicians differ due to better cognitive control or sound processing.