The OLIVE project aims the development of a multilingual indexing tool for broadcast material based on speech recognition, which automatically produces indexes from the sound track of a program (television or radio). Such a tool allows multimedia archives to be searched by keywords and corresponding fragments to be retrieved. This paper gives a report on the alignment module, which is one of the components of the retrieval environment to be developed in OLIVE. It assigns time-codes to non-time-coded textual documents that are describing the content of the video. Timecoding of these textual documents will increase the overall level of disclosure. Basis for the assignment is some similarity measure between the non-timed-coded texts and subtitle files or the transcripts from speech recognition. The core of the alignment module is a generic algorithm for generating the links that are the basis for the insertion of time-codes into non-time-coded texts. An additional step combines similarity values with locality information. The data used during testing are closed-caption files of Dutch news-broadcasts and the autocue files of these broadcasts. Adaptations to the initial algorithm for which improved perfomance figures were found involved a threshold related to the sentencelength, and the applications of a high- and low-frequency term stoplist compiled from the time-coded text under consideration.
|Title of host publication||Proceedings of RIAO'2000|
|Place of Publication||College du France Paris France|
|Publication status||Published - 2000|