TY - JOUR
T1 - Separability versus prototypicality in handwritten word-image retrieval
AU - van Oosten, Jean-Paul
AU - Schomaker, Lambertus
PY - 2014/3
Y1 - 2014/3
N2 - Hit lists are at the core of retrieval systems. The top ranks are important, especially if user feedback is used to train the system. Analysis of hit lists revealed counter-intuitive instances in the top ranks for good classifiers. In this study, we propose that two functions need to be optimised: (a) in order to reduce a massive set of instances to a likely subset among ten thousand or more classes, separability is required. However, the results need to be intuitive after ranking, reflecting (b) the prototypicality of instances. By optimising these requirements sequentially, the number of distracting images is strongly reduced, followed by nearest-centroid based instance ranking that retains an intuitive (low-edit distance) ranking. We show that in handwritten word-image retrieval, precision improvements of up to 35 percentage points can be achieved, yielding up to 100% top hit precision and 99% top-7 precision in data sets with 84 000 instances, while maintaining high recall performances. The method is conveniently implemented in a massive scale, continuously trainable retrieval engine, Monk. (C) 2013 Elsevier Ltd. All rights reserved.
AB - Hit lists are at the core of retrieval systems. The top ranks are important, especially if user feedback is used to train the system. Analysis of hit lists revealed counter-intuitive instances in the top ranks for good classifiers. In this study, we propose that two functions need to be optimised: (a) in order to reduce a massive set of instances to a likely subset among ten thousand or more classes, separability is required. However, the results need to be intuitive after ranking, reflecting (b) the prototypicality of instances. By optimising these requirements sequentially, the number of distracting images is strongly reduced, followed by nearest-centroid based instance ranking that retains an intuitive (low-edit distance) ranking. We show that in handwritten word-image retrieval, precision improvements of up to 35 percentage points can be achieved, yielding up to 100% top hit precision and 99% top-7 precision in data sets with 84 000 instances, while maintaining high recall performances. The method is conveniently implemented in a massive scale, continuously trainable retrieval engine, Monk. (C) 2013 Elsevier Ltd. All rights reserved.
KW - Image retrieval
KW - Handwriting recognition
KW - Nearest centroid
KW - Support-vector machines
KW - Separability
KW - Prototypicality
KW - Historical manuscripts
KW - Big data
KW - Continuous machine learning
KW - RECOGNITION
U2 - 10.1016/j.patcog.2013.09.006
DO - 10.1016/j.patcog.2013.09.006
M3 - Article
VL - 47
SP - 1031
EP - 1038
JO - Pattern recognition
JF - Pattern recognition
SN - 0031-3203
IS - 3
ER -