TY - GEN
T1 - Dynamic pipeline changes in scientific data processing
AU - Mwebaze, Johnson
AU - Boxhoorn, Danny
AU - Valentijn, Edwin
PY - 2011
Y1 - 2011
N2 - Understanding the difference between data objects is a major problem especially in a scientific collaboration which allows scientists to collectively reuse data, modify and adapt scripts developed by their peers to process data while publishing the results to a centralized data store. Although data provenance has been significantly studied to address the origins of a data item, it does not however addresses changes made to the source code. Systems often appear as a large number of modules each containing hundreds of lines of code. It is, in general, not obvious which parts of source code contributed to the change in data object. The paper introduces the Class-Based Object Versioning framework, which overcomes some of the shortcomings of popular versioning systems (e.g. CVS, SVN) in maintaining data and code provenance information in scientific computing environments. The framework automatically identifies and captures useful fine-grained changes in the data and code of scripts that perform scientific experiments so that important information about intermediate stages (i.e. unrecorded changes in experiment parameters and procedures) can be identified and analyzed.
AB - Understanding the difference between data objects is a major problem especially in a scientific collaboration which allows scientists to collectively reuse data, modify and adapt scripts developed by their peers to process data while publishing the results to a centralized data store. Although data provenance has been significantly studied to address the origins of a data item, it does not however addresses changes made to the source code. Systems often appear as a large number of modules each containing hundreds of lines of code. It is, in general, not obvious which parts of source code contributed to the change in data object. The paper introduces the Class-Based Object Versioning framework, which overcomes some of the shortcomings of popular versioning systems (e.g. CVS, SVN) in maintaining data and code provenance information in scientific computing environments. The framework automatically identifies and captures useful fine-grained changes in the data and code of scripts that perform scientific experiments so that important information about intermediate stages (i.e. unrecorded changes in experiment parameters and procedures) can be identified and analyzed.
KW - Astro-WISE
KW - data provenance
KW - object versioning
KW - scientific computing
UR - http://www.scopus.com/inward/record.url?scp=84856316979&partnerID=8YFLogxK
U2 - 10.1109/eScience.2011.44
DO - 10.1109/eScience.2011.44
M3 - Conference contribution
AN - SCOPUS:84856316979
SN - 9780769545974
T3 - Proceedings - 2011 7th IEEE International Conference on eScience, eScience 2011
SP - 263
EP - 270
BT - Proceedings - 2011 7th IEEE International Conference on eScience, eScience 2011
PB - IEEE
T2 - 7th IEEE International Conference on eScience, eScience 2011
Y2 - 5 December 2011 through 8 December 2011
ER -