عنوان مقاله
برچسب گذاری موازی داده های انبوه XML با مپ ردیوس
فهرست مطالب
مقدمه
کلیات
برچسب گذاری موازیXML با MapReduce
بهینه سازیها
مطالعه عملکرد
نتیجه گیری
بخشی از مقاله
طرح برچسب گذاری بر مبنای پیشوند
در طرح برچسب گذاری بر مبنای پیشوند، برچسب برای یک گره درخت XML در واقع تسلسل و الحاق برچسب والد و ترتیب محلی اش را نشان می دهد. طرح برچسب گذاری بر مبنای پیشوند را رسماً به صورت زیر تعریف می کنیم.
طرح برچسب گذاری بر مبنای پیشوند: طرح برچسب گذاری بر مبنای پیشوند موقعیت یک گره درختXML v را باL(v)=a1.a2…am رمزگذاری می کند که والدشu می باشد، به گونه ای که :
1. L(v) الحاقL(u) و ترتیب محلیv می باشد که با ‘,’ تعیین شده است.
2. ترتیب محلی v ، i میباشد، به شرطی که v، iامین فرزند u باشد.
کلمات کلیدی:
Parallel labeling of massive XML data with MapReduce Hyebong Choi · Kyong-Ha Lee · Yoon-Joon Lee © Springer Science+Business Media New York 2013 Abstract The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce’s inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness. Keywords Parallel computing · XML · Tree labeling algorithm · MapReduce H. Choi · Y.-J. Lee Department of Computer Science, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea H. Choi e-mail: hbchoi@dbserver.kaist.ac.kr Y.-J. Lee e-mail: yoonjoon.lee@kaist.ac.kr K.-H. Lee (B) Intelligent Convergence Media Research Department, Broadcasting & Telecommunications Media Research Laboratory, ETRI, 218 Gajeong-ro, Yuseong-gu, Daejeon 305-700, Republic of Korea e-mail: kyongha@etri.re.kr H. Choi et al. 1 Introduction XML is currently one of the most popular data formats for data representation and transmission on the Internet [9]. As many data have been typed in XML, the volume of a single XML document has also become enormous and also grows very quickly. For example, Wikipedia provides page dumps as a single XML document that sizes over 40 GBytes [4]. It is more often to witness large XML documents in scientific areas. For instance, UnitprotKB, which provides a collection of functional relationships between proteins, now hits more than 108 GBytes a file [5]. Moreover, the size of the XML document is continuously growing as biologists find new facts on the proteins in their experiments. Consequently, there is a growing demand for the support of query processing over a large XML document.