Background The Patient Benefit Assessment Scale for Hospitalised Older Patients (P-BAS HOP) is a tool developed to both identify the priorities of the individual patient and to measure the outcomes relevant to him/her, resulting in a Patient Benefit Index (PBI), indicating how much benefit the patient had experienced from the hospitalisation. The reliability and the validity of the P-BAS HOP appeared to be not yet satisfactory and therefore the aims of this study were to adapt the P-BAS HOP and transform it into a picture version, resulting in the P-BAS-P, and to evaluate its feasibility, reliability, validity, responsiveness and interpretability. Methods Process of instrument development and evaluation performed among hospitalised older patients including pilot tests using Three-Step Test-Interviews (TSTI), test-retest reliability on baseline and follow-up, comparing the PBI with Intraclass Correlation Coefficient (ICC), and hypothesis testing to evaluate the construct validity. Responsiveness of individual P-BAS-P scores and the PBI with two different weighing schemes were evaluated using anchor questions. Interpretability of the PBI was evaluated with the visual anchor-based minimal important change (MIC) distribution method and computation of smallest detectable change (SDC) based on ICC. Results Fourteen hospitalised older patients participated in TSTIs at baseline and 13 at follow-up after discharge. After several adaptations, the P-BAS-P appeared feasible with good interviewer's instructions. The pictures were considered relevant and helpful by the participants. Reliability was tested with 41 participants at baseline and 50 at follow-up. ICC between PBI1 and PBI2 of baseline test and retest was 0.76, respectively 0.73. At follow-up 0.86, respectively 0.85. For the construct validity, tested in 169 participants, hypotheses regarding importance of goals were confirmed. Regarding status of goals, only the follow-up status was confirmed, baseline and change were not. The responsiveness of the individual scores and PBI were weak, resulting in poor interpretability with many misclassifications. The SDC was larger than the MIC. Conclusions The P-BAS-P appeared to be a feasible instrument, but there were methodological barriers for the evaluation of the reliability, validity, and responsiveness. We therefore recommend further research into the P-BAS-P.