A Mixed Transaction Cost Model for Coarse Grained Multi-column Partitioning in a Shared-nothing Database Machine

Michael Unwalla

Submitted for the degree of Doctor of Philosophy
University of Sheffield
Department of Computer Science
Faculty of Engineering
submitted July 1993
accepted 1993


Contents

Abstract
List_of_illustrations
List of tables
Preface
Acknowledgements
Author's declaration
List of terms specific to the thesis

1. Introduction
    1.1 Context and background
    1.2 Overview of work presented
    1.3 Organisation of thesis
    1.4 The problem in more detail

2. Partitioning and placement
    2.1 General aims of partitioning and placement
    2.2 Horizontal partitioning
    2.3 Data placement in shared-nothing database machines
    2.4 Hybrid range partitioning strategy
    2.5 Multi-column partitioning for database machines
    2.6 Liou's MDD file structure
        2.6.1 MDD overview
        2.6.2 Discussion of MDD file structure
    2.7 The Grid File
    2.7.1 Grid File overview
    2.7.2 Discussion of the Grid File
    2.8 General limitations of multi-column partitioning

3. Database machines, IDIOMS and the problem
    3.1 Database Machines
    3.2 IDIOMS
        3.2.1 IDIOMS overview
        3.2.2 Parser/data dictionary
        3.2.3 Current data partitioning in IDIOMS
        3.2.4 Proposed multi-column data partitioning and placement
        3.2.5 Pipelined MIS query
    3.3 Comparison of IDIOMS and other database machines

4. Scan time estimation
    4.1 Introduction
    4.2 Review of the approximate formula
    4.3 Derivation of the approximate formula
    4.4 Exact average number of partitions accessed
    4.5 Scan time differences between levels of partitioning
    4.6 Discussion of results

5. Cost model
    5.1 Introduction
    5.2 Single column cost model
        5.2.1 Single column partitioning costs
        5.2.2 Single column migrations increase with n
        5.2.3 Single column workload savings
    5.3 Multi-column partitioning
        5.3.1 Multi-column partitioning cost
        5.3.2 Multi-column scan time savings
        5.3.3 Heuristic solution to determine scan time saving
        5.3.4 Underestimating STS is a safe solution
        5.3.5 Determination of n for each partitioning column
    5.4 Partitioning when OLTP costs are zero or unknown
    5.5 Discussion

6. Worked example
    6.1 Real-world database background
        6.1.1 Database statistics
        6.1.2 Real world OLTP transactions
    6.2 Table for partitioning
    6.3 Sample MIS queries
    6.4 Determination of pcostn
    6.5 Multi-column partitioning
        6.5.1 Determination of n
        6.5.2 Calculation of total WLS
    6.6 Single column partitioning
    6.7 Placement of partitions
    6.8 Discussion

7. Summary and conclusions
    7.1 Summary
    7.2 Contribution to field
    7.3 Conclusions
    7.4 Future Research

Appendices

a1. Comments on the determination of the number of partitions for MDD
a2. Data values for figure 5.10
a3. Data values to determine n
a4. Determination of WLS for various partitionings
a5. List of publications
a6. Glossary of terms and abbreviations
References

List of illustrations

1.1 Single column partition trade-offs

2.1 Types of horizontal partitioning
2.2 Uneven load balancing
2.3 Hybrid range partitioning strategy (HRPS)
2.4 First degree cells
2.5 Second degree cells
2.6 MDD mapping to data pages
2.7(a) The Grid File
2.7(b) Grid block assignment
2.8 Operation of the grid directory
2.9 Bucket and partition splitting
2.10 Efficiency varies with query size and number of partitions
2.11 Cell to bucket mapping may be sub-optimal
2.12 High precision does not imply contiguous access

3.1 Logical multi-processor database designs
3.2 IDIOMS architecture
3.3 Partitioning to aid MIS queries
3.4 Fragmenting an MIS partition over many discs
3.5 Access plan for example query
3.6 Data flow for example query

4.1 Access of single data valued partitions
4.2 Maximum number of partitions accessed by large queries
4.3 Minimum and maximum partitions accessed
4.4 "End effect"
4.5 Average number of partitions accessed
4.6 Scan time for varying numbers of partitions
4.7 % scan time differences between exact and approximate formulae
4.8 % scan time differences between exact formula and min(1,((1/n)+QS))
4.9 Scan time differences between n and 2n
4.10 % differences between exact and approximate scan time difference calculations
4.11 Maximum and minimum scan times converge

5.1 Single column cost saving
5.2 Migration probability
5.3 Migration probability increases linearly with number of partitions
5.4 Partition cardinality does not affect migration rate
5.5 Value dependent updates
5.6 Logical and physical migration
5.7 Minimum and maximum update
5.8 Multi-column scan time heuristic solution is close to exact solution
5.9 Proportionally large wasted work for small queries
5.10 Workload savings and selection of optimum number of partitions
5.11 Minimum and maximum total savings
5.12 Wasted work for varying QS and n
5.13 Overall cost-benefit when including directory cost

6.1 Approximation of d
6.2a Minimum and maximum STS and WLS for AGE
6.2b Minimum and maximum STS and WLS for CRED_LIM
6.2c WLS (=STS) for JOIN_DATE
6.3 Minimum and maximum WLS of all four columns

List of tables

4.1 Query size affects ΔST3,6

6.1 Real-world database statistics
6.2 Transaction occurrence
6.3 Selected OLTP queries from "other"
6.4 QS values of example MIS queries
6.5 Maximum and minimum WLS for GENDER
6.6 Value of n giving greatest saving for maxWLS and minWLS curves
6.7 JOIN_DATE ΔWLS>n,n+1 compared to file size
6.8 Summary of derived partitionings
6.9 Overall workload savings of different partitionings

Appendix 1 is a PDF file, so these tables are not directly available
a1.1 All query types used to determine mi
a1.2 Query set 1, atomic queries, equal attribute reference
a1.3 Query set 2, atomic and conjunctive queries
a1.4 Query set 3
a1.5 npa for query set 3 using Liou's determination of fi
a1.6 Calculated and integral mi values
a1.7 npa for query set 3 using improved fi determination

a2.1 Data values for figure 5.10

a3.1 ∏QSj on the "other" columns
a3.2 AGE column maximum and minimum scan time savings
a3.3 AGE column costs and savings
a3.4 CRED_LIM column maximum and minimum scan time savings
a3.5 CRED_LIM column costs and savings
a3.6 JOIN_DATE column maximum and minimum scan time savings
a3.7 JOIN_DATE column costs and savings

a4.1 AS for QS values used in following example queries
a4.2 Determination of WLS for different partitionings