Assignment No. 03
SEMESTER Spring 2011
CS614- Data Warehousing
Question # 1:-
How the approach of parallelism is used in Data warehouse and OLAP?
Ans:-
By partitioning a data among a set of processors OLAP queries can be executed in parallel potentially achieving linear speedup and thus significantly improving query response time…..
In recent years the databssase community has experienced a tremendous increase in the availability of new technologies to support efficient storage and retrieval of large volumes of data, namely data warehousing and On-Line Analytical Processing (OLAP) products. Efficient query processing is critical in such an environment, yet achieving quick response times with OLAP queries is still largely an open issue. In this paper we propose a solution approach to this problem by applying parallel processing techniques to a warehouse environment. We suggest an efficient partitioning strategy based on the relational representation of a data warehouse (i.e., star schema). Furthermore, we incorporate a particular indexing strategy, Data Indexes, to further improve query processing times and parallel resource utilization, and propose a preliminary parallel star-join strategy.
A parallel aproch for the data ware house is strong due to inherent nature of such an environment.
Still another appeal for parallelism in datawarehouse in the OLAP system is the logical is a logical design for the warehouse Data warehouses often contain large tables and require techniques both for managing these large tables and for providing good query performance across these large tables. This chapter discusses two key methodologies for addressing these needs: parallelism and partitioning.
Parallelism is certainly an issue if you're using DTS. If you don't specifically set up the package to use parallelism then your only choice is to run multiple packages at once to get parallel processes running. If you
have abstracted the data load in such a way that a single package is used to load multiple files dynamically, then parallelism needs to be considered in order to get the package running multiple threads within the same package.
There are different degrees of how and where parallelism can and should be used. Ensuring that the query processor is setup correctly to handle parallel processing of the data load is also a consideration. One can also have multiple servers running to get the data loaded across federated servers. This is also considered to be parallelism
Question #2:-
1) How the parallel star join works that is discussed in this research article.
Ans:-
To compute the star join, we first find out the rows of the fact table that will be participating in the final
cube grouping. Since restriction predicates reduce the number of rows in the final result table, we utilise
them first. The following query shows how restriction predicates limit the number of rows in the result
table:
SELECT Year, State, Product_Name, SUM (Quantity) AS ‘Total Quantity’
FROM Product P, Date D, Sales S, Region R
WHERE P.Product_No = S. Product_No AND D.Date_key = S.Date_Key AND
R.Region_ID = S. Region_ID AND
D.Year IN (1998, 1999) AND P.Category IN (‘Printer’, ‘Scanner’)
GROUP BY CUBE Year, State, Product_Name;
Assuming the data warehouse (Figure 1) maintains sales records of 10 years, we can straight away notice that only the last two years’ sales rows will be picked from the fact table. The second restriction predicate on product category effects a further reduction.
We use a simple method viz. to form rowsets (of the fact table) for one restriction predicate. The restriction predicates are applied to their corresponding dimensions in parallel. The JVIs that satisfy each predicate are passed on to a coordinator node where an intersection of the RIDs of all the restricted dimensions is performed. The resulting rowset (set of row IDs) thus formed is the result of the star join. The dimensions for which no restriction predicates are present do not participate in this operation as they annot cause any reduction to the resulting rowset. While operating on the fact table we try to distribute equally to each node available as much processing as is possible, so that I/O and computation could be performed efficiently in parallel. Algorithm 2 shows how the parallel star join is performed.
The schema models the activities of a world-wide wholesale supplier over a period of seven years. The fact table is the SALES table, and the dimension tables are the PART, SUPPLIER, CUSTOMER, and TIME tables. The fact table contains foreign keys to each of the dimension tables. This schema suggests an eÆcient data partitioning as we will soon show.
A common type of query in OLAP systems is the star-join query. In a star-join, one or more dimension tables are joined with the fact table.
Query 1
SELECT U.Name, SUM(S.ExtPrice)
FROM SALES S, TIME T, CUSTOMER C, SUPPLIER U
WHERE T.Year BETWEEN 1996 AND 1998
AND U.Nation='United States' AND C.Nation='United States'
AND S.ShipDate = T.TimeKey AND S.CustKey = C.CustKey
AND S.SuppKey = U.SuppKey
GROUP BY U.Name
A set of attributes that is frequently used in join predicates can be readily identi_ed in the structure of a star schema. In the example star schema, ShipDate, CustKey, SuppKey, and PartKey of the SALES table can be identi_ed as attributes that will often participate in joins with the corresponding dimension tables. We can thus use this information to apply a vertical partitioning method on these attributes to achieve the bene_ts of parallelism. This paper shows, in fact, that one can use a combination of vertical and horizontal partitioning techniques to extract the parallelism inherent in star schemas. Speci_cally, we propose a declustering strategy which incorporates both task and data parti- tioning and present the Parallel Star Join (PSJ) algorithm,
which provides a means to perform a star join in parallel using eÆcient operations involving only rowsets and projection columns.
The Parallel Star Join Algorithm:-
In this section we present our algorithm to perform star joins in parallel.
We represent a general k-dimensional star-join query as follows.
Query 3
SELECT Ad
P , Am
P FROM F, D1, : : :, Dk WHERE P./ AND P_
Here D1; : : : ;Dk are the k dimensional tables participating in the join. P_ and P./ denote a set of
restriction and join predicates respectively. We assume that each individual restriction predicate
in P_ only concerns one table and is of the form (aj hopi constant), where aj is any attribute in the
warehouse schema and hopi denotes a comparison operator (e.g., =;_;_). We assume each join
predicate in P./ is of the form al = at where at is any dimensional key attribute and al is the foreign
key referenced by at in the fact table.
0 comments
Post a Comment