Instructor: Xi He

Coordinator: Ahmed Haj-Yasien

Week 1. May 11

defn. database: an organized body of related information

defn. database management system (DBMS): a software system that facilitates the creation, maintenance and use of an electronic database

earily effort: CODASYL standard:

relational revolution (Edgar F. Codd) (1970s)

standard DBMS features:

standard DBMS architecture:

+------------------+ | applications | +----+-------+-----+ | ^ CRUD | | response v | +----+-------+-----+ | DBMS | +---------------------------+ +--------+---------+ | ^ | v File system interface | +--------+---------+ bypass| | OS | | +--------+---------+ | ^ | v Storage system interface | +--------+---------+ | | Disk | <---------------------------+ +------------------+

relational data model

defn. schema (metadata):

defn. instance:

defn. integrity constraints:

defn. a (candidate) key if a set of attributes KK for a relation RR if

  1. in no instance of RR will two different tuples agree on all attributes of KK (ie K is an identifier)
  2. no proper subset of KK satisfies the above condition (ie K is minimal)

eg. in User(uid, name, age, pop)

defn. a primary key is a designated candidate key in the schema declaration (we can have multiple keys)

eg. foreign keys ("pointer" to other rows)

|uid |name |age |pop | |142 |bart |10 |0.9 | |gid | name | |dps |Deal Putting Society| |uid |gid | # refer to above two tables |142 |dps |

referential integrity: a tuple with non-null value for a foreign key that does not match the primary key value of a tuple in the referenced relation is not allowed

relational algebra

a language for querying relational data based on "operators"

core operator 1: selection

eg. users with popularity higher than 0.5: σpop>0.5User\sigma_{\mathrm{pop}>0.5}\mathrm{User}

core operator 2: projection

eg. get user ages: πageUser\pi_\mathrm{age}\mathrm{User}

core operator 3: cross product

derived operator 1: join (theta join)

eg. info about users, plus IDs of their groups: UserUser.uid=Member.uidMember\mathrm{User}\Join_\mathrm{User.uid=Member.uid}\mathrm{Member}

USER MEMBER |uid |name |age |pop | |uid |gid | |123 |Milhouse|10 |0.2 | |123 |gov | |857 |Lisa |8 |0.7 | |857 |abc | |857 |gov | USER JOINS MEMBER ON UID=UID |uid |name |age |pop |uid |gid | |123 |Milhouse|10 |0.2 |123 |gov | |857 |Lisa |8 |0.7 |857 |abc | |857 |Lisa |8 |0.7 |857 |gov |

derived operator 2: natural join

eg. like in join eg, duplicate uid is omitted

core operator 4: union

core operator 5: difference

derived operator 3: intersection

core operator 6: renaming

eg. create identical column names for natural joins:

eg. find IDs of users who belong to at least two groups

MEMBER |uid |gid | |100 |gov | |100 |abc | |200 |gov |

answer:

πuid1(ρuiduid1,gidgid1Memberuid1=uid2gid1gid2ρuiduid2,gidgid2Member)\begin{aligned} \pi_\mathrm{uid1}(\\ &\rho_\mathrm{uid\to uid1,gid\to gid1}\mathrm{Member}\\ &\Join_\mathrm{uid1=uid2\land gid1\ne gid2} \\ &\rho_\mathrm{uid\to uid2,gid\to gid2}\mathrm{Member}\\ ) \end{aligned}

eg. USER(uid, name, age, pop), GROUP(gid, name), MEMBER(uid, gid)

defn. the operator is non-monotone if after adding more rows to the input, some old output rows may become invalid and need to be removed.

relational calculus

limits of relational algebra:

Week 2. May 18

SQL

SQL: structured query language

DDL

create table

eg.

CREATE TABLE User (uid DECIMAL(3,0), name VARCHAR(30), age DECIMAL(2,0), pop DECIMAL(3,2)); CREATE TABLE Group (gid CHAR(10), name VARCHAR(100)); CREATE TABLE Member (uid DECIMAL (3,0), gid CHAR(10));

delete table

SFW statement

semantics:

for t1 in R1: ... for tm in Rm: if cond(t1, ..., tm): compute and output A1,...,An as a row

eg.

-- all rows in the User table SELECT * FROM User; -- name of users under 18 SELECT name FROM User WHERE age<18; -- where was Lisa born? -- select list can contain exprs and funcs eg. SUBSTR, ABS -- string literals (case sensitive) are enclosed in single quote SELECT 2021-age FROM User WHERE name = 'Lisa';

eg. list IDs and names of groups with a user whose name contains "Simpson"

-- gid and name column names are not unique, so specify table SELECT Group.gid, Group.name FROM User, Member, Group WHERE User.uid = Member.uid -- join clause AND Group.gid = Member.gid -- join clause AND User.name LIKE '%Simpson%'; -- %: wildcard

eg. Names of all groups that Lisa and Ralph are both in

SELECT g.name FROM User AS u1, User AS u2, Member AS m1, Member AS m2, Group AS g -- can omit AS WHERE u1.name = 'Lisa' AND u2.name = 'Ralph' AND u1.uid = m1.uid AND u2.uid = m2.uid AND m1.uid = g.gid AND m2.uid = g.gid; -- both people are in group with gid

eg. πR.A,S.B(Rp1S)p2(πT.Cσp3T)=πR.A,S.B,T.Cσp1p2p3(R×S×T)\pi_{R.A,S.B}(R\Join_{p_1}S)\Join_{p_2}(\pi_{T.C}\sigma_{p_3}T)=\pi_{R.A,S.B,T.C}\sigma_{p_1\land p_2 \land p_3}(R\times S\times T) (combine selections, projections and products)

difference relational algebra vs. sql:

SQL set & bag operations

eg.

Bag1 Bag2 Fruit Fruit ----- ----- apple apple apple orange orange orange -- add counts > (SELECT * FROM Bag1) UNION ALL (SELECT * FROM Bag2); Fruit ----- apple apple orange apple orange orange -- apple: 3 orange: 3 -- proper subtraction on counts > (SELECT * FROM Bag1) EXCEPT ALL (SELECT * FROM Bag2); Fruit ----- apple -- 2-1=1 -- take minimum of two counts > (SELECT * FROM Bag1) INTERSECT ALL (SELECT * FROM Bag2); Fruit ----- apple orange

table subqueries

eg. names of users who poked others more than others poked them

-- uid1 uid2 ---------------- -- user1 user2 SELECT DISTINCT name FROM User, (SELECT uid1 FROM Poke) EXCEPT ALL (SELECT uid2 FROM Poke) AS T WHERE User.uid = T.uid;

scalar subqueries

eg. users at the same age as Bart

SELECT * FROM User WHERE age = (SELECT age FROM User WHERE name = 'Bart');

IN subqueries

eg. users at the same age as (some) Bart(s)

SELECT * FROM User WHERE age IN (SELECT age FROM User WHERE name = 'Bart');

EXISTS subqueries

eg. users at the same age as (some) Bart(s)

-- a correlated subquery: subquery that references tuple variables in surrounding queries SELECT * FROM User u WHERE EXISTS(SELECT * FROM User WHERE name = 'Bart' AND age = u.age);

quantified subqueries

eg.

-- most popular users SELECT * FROM User WHERE pop >= ALL(SELECT pop FROM User); SELECT * FROM User WHERE NOT (pop < ALL(SELECT pop FROM User));

aggregates

eg.

-- number of users under 18 and their avg pop -- COUNT(*) -> number of rows SELECT COUNT(*), AVG(pop) FROM User WHERE age < 18; -- how many users are in some group? SELECT COUNT(*) FROM (SELECT DISTINCT uid FROM MEMBER); SELECT COUNT(DISTINCT uid) FROM MEMBER;

grouping

eg.

uid name age pop ---------------------- 142 Bart 10 .9 857 Lisa 8 .7 123 Milhouse 10 .2 456 Ralph 8 .3 -- average popularity for each age group > SELECT age, AVG(pop) FROM User GROUP BY age; age AVG(pop) ------------- 10 .55 8 .5

having

eg.

-- list avg pop for each age group with more than 100 users SELECT age, AVG(pop) FROM User GROUP BY age HAVING COUNT(*) > 100; SELECT T.age, T.apop FROM (SELECT age, AVG(pop) apop, COUNT(*) gsize FROM User GROUP BY age) T WHERE T.gsize > 100;

aggregation and group provide more expressive power to relational algebra

order by

eg.

-- all users sort them by pop (descending), then name (ascending) SELECT * FROM User ORDER BY pop DESC, name; SELECT * FROM User ORDER BY 4 DESC, 2;

three-valued logic

eg.

-- if a row has NULL pop, following are not equivalent SELECT AVG(pop) FROM User; SELECT SUM(pop)/COUNT(*) FROM User; -- if a row has NULL pop, following are not equivalent SELECT * FROM User; SELECT * FROM User WHERE pop=pop; -- select users with NULL pop values -- this does not work SELECT * FROM User WHERE pop=NULL; -- works (SELECT * FROM User) EXCEPT ALL (SELECT * FROM User WHERE pop=pop); SELECT * FROM User WHERE pop IS NULL; -- IS NOT NULL for non-null

outerjoin

eg.

Group gid name uid gid ---------------------------- --------- abc Book Club 142 dps gov Student Govrnt 123 gov dps Dead Putting Society 857 abc nuk United Nuclear Workers 857 gov 789 foo > SELECT * FROM Group FULL OUTER JOIN Member ON Group.gid = Member.gid; gid name uid -------------------------------- abc Book Club 857 gov Student Govrnt 123 gov Student Govrnt 857 dps Dead Putting Society 142 foo United Nuclear Workers NULL foo NULL 789 > SELECT * FROM Group LEFT OUTER JOIN Member ON Group.gid = Member.gid; gid name uid -------------------------------- abc Book Club 857 gov Student Govrnt 123 gov Student Govrnt 857 dps Dead Putting Society 142 foo United Nuclear Workers NULL -- similar for ... INNER JOIN ... ON ...; -- ... NATURAL JOIN ...;

DML

insert

eg.

-- User 789 joins dps INSERT INTO Members VALUES (789, 'dps'); -- everybody joins dps INSERT INTO Member ( SELECT uid, 'dps' FROM User WHERE uid NOT IN ( SELECT uid FROM Member WHERE gid = 'dps' ) );

delete

eg.

-- delete everything from table DELETE FROM Member; -- User 789 leaves dps DELETE FROM Member WHERE uid = 789 AND gid = 'dps'; -- Users under age 18 must leave unk DELETE FROM Member WHERE uid in ( SELECT uid FROM User WHERE age < 18 ) AND gid = 'nuk';

update

eg.

-- User 142 changes name to Barney UPDATE User SET name = 'Barney' WHERE uid = 142; -- change everyone's pop -- note subquery avg is always computed ahead using old table... UPDATE User SET pop = (SELECT AVG(pop) FROM User);

constraints

common constraints

eg.

CREATE TABLE Member (uid DECIMAL(3,0) NOT NULL, gid CHAR(10) NOT NULL, PRIMARY KEY(uid, gid));

referential integrity

eg. if an uid appears in Member, it must appear in User, if gid appears in Member, it must appear in Group (to ensure no dangling ptrs exist)

CREATE TABLE Member ( uid DECIMAL(3,0) NOT NULL REFERENCES User(uid), gid CHAR(10) NOT NULL, PRIMARY KEY(uid, gid), FOREIGN KEY (gid) REFERENCES Group(gid) -- required for multiple-key eg Member(uid,gid) );

enforcing referential integrity

general assertion

eg.

CREATE ASSERTION MemberUserRefIntegrity CHECK ( NOT EXISTS ( SELECT * FROM Member WHERE uid NOT IN ( SELECT uid FROM User ) ) );

tuple- and attribute-based checks

eg.

CREATE TABLE User (age INTEGER CHECK(age IS NULL OR age > 0), ...); CREATE TABLE Member (uid INTEGER NOT NULL CHECK(uid IN (SELECT uid FROM User)), ...);

Week 3. May 25

triggers

a trigger is an event-condition-action (ECA) rule

eg. if user with pop < 0.5 joins group, delete them

CREATE TRIGGER PickySGroup AFTER UPDATE OF pop ON User -- event REFERENCING NEW ROW AS newUser -- transition variable FOR EACH ROW WHEN (newUser.pop < 0.5) -- cond AND (newUser.uid IN (SELECT uid FROM Member WHERE gid = 'sgroup')) DELETE FROM Member -- action WHERE uid = newUser.uid AND gid = 'sgroup'; -- same CREATE TRIGGER PickySGroup2 AFTER UPDATE OF pop ON User REFERENCING NEW TABLE AS newUsers -- can only be used with AFTER trigger FOR EACH STATEMENT DELETE FROM Member WHERE gid = 'sgroup' AND uid IN (SELECT uid FROM newUsers WHERE pop < 0.5);

availability for AFTER trigger:

event row statement
delete old row; old table old table
insert new row; new table new table
update old row, new row; old table, new table old table, new table

availability for BEFORE trigger

event row statement
delete old row
insert new row
update old row, new row

statement- vs. row-level triggers

system issues

views

a view is like a virtual table

eg. reuse query sgroup

CREATE VIEW SGroup AS SELECT * FROM User WHERE uid IN (SELECT uid FROM Member WHERE gid = 'sgroup'); SELECT AVG(pop) FROM SGroup; -- deletion DROP VIEW SGroup;

reasons to use views:

sql92 updateable views:

eg. let the average pop be 0.5

CREATE VIEW AveragePop(pop) AS -- renamed column SELECT AVG(pop) FROM User; CREATE TRIGGER AdjustAveragePop INSTEAD OF UPDATE ON AveragePop -- don't do action on view, do it in base table REFERENCING OLD ROW AS o, NEW ROW AS n FOR EACH ROW UPDATE User SET pop = pop + (n.pop - o.pop); UPDATE AveragePop SET pop = 0.5;

indexes

an index is an auxiliary persistent data structure

recursive queries

eg. find all ancestors of Bart

parent child -------------- Homer Bart Homer Lisa Marge Bart Marge Lisa Abe Homer Ape Abe -- nonlinear recursion WITH RECURSIVE Ancestor(anc, desc) AS (( -- base case SELECT parent, child FROM Parent ) UNION ( -- recursive relation SELECT a1.anc, a2.desc FROM Ancestor a1, Ancestor a2 WHERE a1.desc = a2.enc ) -- execution stops until a fixed point ) SELECT anc FROM Ancestor WHERE desc = 'Bart'; -- linear recursion WITH RECURSIVE Ancestor2(anc, desc) AS (( SELECT parent, child FROM Parent ) UNION ( SELECT anc, child FROM Ancestor2, Parent WHERE Ancestor2.desc = Parent.parent ) )...

((mysql) should you put "desc" condition in base case?)

defn. for f:DDf:D\rightarrow D, a fixed point of ff is a value xx such that f(x)=xf(x)=x.

to compute a fixed point of f, start with a seed xx0x\leftarrow x_0, compute f(x)f(x)

defn. a query q is just a function that maps an input table to an output table, so a fixed point of q is a table T such that q(T)=T.

to compute fixed point of q, start with empty table TT\leftarrow\varnothing, eval q over T

linear vs. nonlinear recursion

eg. Natural = 1,2,3,... 1 is odd; odd+1 is even; even+1 is odd/

WITH RECURSIVE Even(n) AS ( SELECT n FROM Natural WHERE n = ANY(SELECT n+1 FROM odd) ), RECURSIVE Odd(n) AS (( SELECT n FROM Natural WHERE n = 1 ) UNION ( SELECT n FROM Natural WHERE n = ANY(SELECT n+1 FROM Even) ) )...

semantics:
WITH RECURSIVE R1 AS Q1,..., RECURSIVE Rn AS Qn Q;

  1. R1,...,RnR_1\leftarrow\varnothing,...,R_n\leftarrow\varnothing
  2. eval Q1,...,QnQ_1,...,Q_n using current contents of R1,...,RnR_1,...,R_n
  3. if RinewRiR_i^{\textrm{new}}\ne R_i for some i
    1. R1R1new,...,RnRnnewR_1\leftarrow R_1^\textrm{new},...,R_n\leftarrow R_n^\textrm{new}
    2. goto 2
  4. compute QQ using current contents of R1,...,RnR_1,...,R_n and output

if q is monotone, then starting from empty set produces the unique minimal fixed point

if q is non-monotone:

should recursive queries contain negation?

eg. find pairs of persons with no common ancestors

WITH RECURSIVE Ancestor(anc, desc) AS (SELECT parent, child FROM Parent) UNION (SELECT a1.anc, a2.desc FROM Ancestor a1, Ancestor a2 WHERE a1.desc = a2.anc)), Person(person) AS ((SELECT parent FROM Parent) UNION (SELECT child FROM Parent)), NoCommonAnc(person1, person2) AS ((SELECT p1.person, p2.person FROM Person p1, Person p2 WHERE p1.person <> p2.person) EXCEPT (SELECT a1.desc, a2.desc FROM Ancestor a1, Ancestor a2 WHERE a1.anc = a2.anc)) SELECT * FROM NoCommonAnc;

img

evaluating stratified negation

using sql

cursor

SQL/PSM

advantages:

working with SQL through API

embedded sql

pros:

cons:

support db features through an OO language

extend a general-purpose programming language with sql-like constructs

Week 4. June 2

database design

  1. understand real-world domain being modeled -> specify it using database design model
  2. translate specs to date model of DBMS
  3. create DBMS schema

entity-relationship model

eg. how to add info regarding isMemberOfSince? img

multiplicity of relationships

general cardinality constraints

eg.
img

weak entity sets

eg. seats in rooms in building
img

ISA (extension)

eg.
img

structured attributes (extension)

eg. an emplyee can have multiple hobbies
img

attributes or entity sets?

eg. how to model employee's phones?
img

entity sets or relationships?

eg.

eg.

img

translate entity sets

translate relationship sets

eg.
img

translate ISA

eg.
img

represent aggregation

principle

Week 5. June 8

eg. supplier & part

SuppliedItems(*Sno*, Sname, City, *Pno*, Pname, Price) -- vs Suppliers(*Sno*, Sname, City) Parts(*Pno*, Pname) Supplies(*Sno*, *Pno*, Price)

single-table schema suffer from

designing good dbs

goals:

functional dependency

a functional dependency has the form X->Y where X and Y are sets of attributes in the relation. whenever two rows agree on all the attributes in X, they must agree on all attributes in Y.

eg.

EmpProj(*SIN*, *PNum*, Hours, EName, PName, PLoc, Allowance)
  1. SIN determines employee name: SIN -> EName
  2. project # determines project name and location: PNum -> PName, PLoc
  3. allowances are always the same for the same number of hours at same location: PLoc, Hours -> Allowance

defn. a set of FDs F\mathcal{F} logically implies a FD X->Y if X->Y holds in all instances of R that satisfy F\mathcal{F}.

defn. F+:={XY:F logically implies XY}\mathcal{F^+}:=\{X\rightarrow Y: \mathcal{F}\text{ logically implies } X\rightarrow Y\} is the closure of a FD set F\mathcal{F}.

axioms. (Armstrong's)

theorem. (decomposition) if XYZX\rightarrow YZ, then XYX\rightarrow Y and XZX\rightarrow Z.

theorem. (union) if XY,XZX\rightarrow Y,X\rightarrow Z, then XYZX\rightarrow YZ.

eg. given

show SIN, PNum -> Allowance.

1 SIN, PNum -> Hours in F
2 PNum -> PName, PLoc in F
3 PLoc, Hours -> Allowance in F
4 SIN, PNum -> PNum reflexivity
5 SIN, PNum -> PName, PLoc transitivity, 4, 2
6 SIN, PNum -> PLoc decomposition, 5
7 SIN, PNum -> PLoc, Hours union, 6, 1
8 SIN, PNum -> Allowance transitivity, 7, 3

defn. the closure of attributes ZZ in a relation (denoted Z+Z^+) with respect to a set of FDs is the set of all attributes {A1,A2,...}\{A_1,A_2,...\} such that ZA1,A2...Z\rightarrow A_1,A_2....

to compute Z+(Z,F)Z^+(Z,\mathcal{F}):

  1. start with closure = Z
  2. if X->Y and X is already in the closure, then also add Y to the closure
  3. repeat until no new attrs can be added

eg. we can show SIN, PNum -> Allowance by claiming {Allowance} is contained by Z+({SIN, Num}, F).

eg. to check K is a key of R, check its attribute closure is equal to the set of all keys.

decomposing relation

defn. let R be a relation schema, the collection {R1,...,Rn}\{R_1,...,R_n\} of relations is a decomposition of R if R=R1...RnR=R_1\cup...\cup R_n.

'good' schema decomposition

eg. R={Student, Assignment, Group, Mark}, F={Student, Assignment -> Group, Mark}

defn. a relation R is in BCNF iff whenever (XY)F+(X\rightarrow Y)\in \mathcal{F}+ and XYRXY\subseteq R, then either

BCNF decomposition algorithm:

  1. find a BCNF violation, ie a nontrivial FD XYX\rightarrow Y where X is not superkey
  2. decompose R into
  3. repeat until all in BCNF

theorem. BCNF decomposition is lossless, ie given nontrivial XYX\rightarrow Y in R where X is not a superkey, we have R=πXYRπXZRR=\pi_{XY}R\Join \pi_{XZ}R.

eg.
img

defn. a relation R is in 3NF (third normal form) iff whenever (XY)F+(X\rightarrow Y)\in \mathcal{F}+ and XYRXY\subseteq R, then either

defn. a set of FD is minimal if

finding minimal cover:

  1. remove first violation: replace XYZX\rightarrow YZ with XYX\rightarrow Y and XZX\rightarrow Z
  2. remove second violation: remove XAX\rightarrow A from F\mathcal{F} if AcomputeX+(X,F{XA})A\in\text{compute}X^+(X,\mathcal{F}-\{X\rightarrow A\})
  3. remove third violation: remove A from lhs of XBX\rightarrow B iif BcomputeX+(X{A},F)B\in\text{compute}X^+(X-\{A\},\mathcal{F})

computing 3NF decomposition:

  1. initialize the decomposition with empty set
  2. find a minimal cover for F\mathcal{F}, let it be F\mathcal{F}^*
  3. for every (XY)F(X\rightarrow Y)\in \mathcal{F}^*, add a relation {XY}\{XY\} to the decomposition
  4. if no relation contains a candidate key for R, then compute a candidate key K for R, and add {K}\{K\} to decomposition.

3NF allows more redundancy, but is dependency-preserving.

Week 6. June 15

physical data organization

location cycles
registers 1
on-chip cache 2
on-board cache 10
memory 100
disk 10^6 non volatile
tape 10^9 non volatile
spindle | dsik =============== platter head ---------+ | =============== platter =============== platter | tracks (concentric circles)

magnetic disks

disk access time is the sum of:

random disk access:

sequential disk access:

SSD

consequences:

performance tricks:

record layout

record = row in a table

eg. fixed-length fields

CREATE TABLE User(uid INT, name CHAR(20), age INT, pop FLOAT); 0 4 24 28 36 +-----+----------------------+----+----------+ | 142 | 'Bart' | 10 | 0.9 |

what about NULL?

eg. variable-length record

CREATE TABLE User(uid INT, name VARCHAR(20), age INT, pop FLOAT, comment VARCHAR(100));

put all variable-length field at then end, and

eg. LOB fields

CREATE TABLE User(uid INT, name CHAR(20), age INT, pop FLOAT, avatar BLOB(32000));

block layout

NSM (N-ary Storage Model):

img

PAX (Partition Attributes Across):

img

column stores:

indexing

img

ISAM (index sequential access method)

B+-tree:

range query:
img

insert:

deleteion:

performance:

in practice:

the halloween problem:

B+-tree vs. ISAM

B+-tree vs. B-tree

multi-attribute indices

eg.

CREATE INDEX NameIndex ON User(LastName, FirstName); -- beneficial SELECT * FROM User WHERE LastName = '...' AND FirstName = '...'; -- not beneficial SELECT * FROM User WHERE FirstName = '...';

eg. index-only plan

-- if there is non-clustering index on (firstname, pop) SELECT COUNT(*) FROM User WHERE pop > '0.8' AND firstname = '...';

guidelines for indices:

Week 7. June 22

query processing

notation:

|u1 u2| |u3 u4| Memory: - # memory blocks: M | User | Member | | u1 | m1 | | u2 | m2 | Disk: - # rows for a table: |Users| - # disk blocks for a table: B(Users) = |Users| / # rows per block

table scan

scan table R and proces the query

nested-loop join: RpSR\Join_p S

for each block of R: for each row r in R block: for each block of S: for each row s in S block: if p(r, s) is true: output rs

blocked-based nested-loop join

for each block of R: for each block of S: for each r in R block: for each s in S block: if p(r, s) is true: output rs

more improvements

sorting

sort R, but R does not fit in memory. use external merge sort

eg.

3 memory blocks available, each holding 1 number input: 1,7,4,5,2,8,3,6,9 phase 0: - 1,7,4 -> 1,4,7 - 5,2,8 -> 2,5,8 - 9,6,3 -> 3,6,9 phase 1: - 1,4,7+2,5,8 -> 1,2,4,5,7,8 - 3,6,8 phase 2: - 1,2,4,5,7,8+3,6,9 -> 1,2,3,4,5,6,7,8,9

sort-merge join: RR.A=S.BSR\Join_{R.A=S.B}S

sort R by A, sort S by B r = R[0], s = S[0] repeat: if r.A > s.B: s = next in S else if r.A < s.B: r = next in R else: output all matching tuples s = next in S r = next in R until R or S is exhausted

optimization of SMJ:

other sort-based algos:

hashing

hash join: RR.A=S.BSR\Join_{R.A=S.B}S

main idea:

steps:

what if partition is too large for memory? read the partition back and partition it again. at most O(logMB(R))O(\log_MB(R)) times

hash join vs smj

other hash algorithms:

index-based algos

selection using index:

index nested-loop join: RR.A=S.BSR\Join_{R.A=S.B}S

for each block of R: for each r in R block: if r.A exists in S(B) s: output rs

zig-zag join using ordered indices: RR.A=S.BSR\Join_{R.A=S.B}S

index vs table scan:

summary

Week 8. June 29

query optimization

query tree logical plan physical plan parser --> validator --> optimizer --> executer

parsing and validation

logical plan:

physical plan

physical plan:

eg.

SELECT Group.name FROM User, Member, Group WHERE User.name = 'Bart' AND User.uid = Member.uid AND Member.gid = Group.gid; -- physical 1: PROJECT(Group.name) | MERGE-JOIN(gid) / \ / \ SORT(gid) SCAN(Group) | MERGE-JOIN(uid) / \ / \ FILTER(name='Bart') SORT(uid) | | SCAN(User) SCAN(Member) -- physical 2: PROJECT(Group.name) | INDEX-NESTED-LOOP-JOIN(gid) / \ / \ INDEX-NESTED-LOOP-JOIN(uid) index on Group(gid) / \ / \ INDEX-SCAN(name='Bart') index on Member(uid) | index on User(name)

cardinality estimation

selection with equality predicate: Q=σA=vRQ=\sigma_{A=v}R

eg. consider |User| = 1000, πnameUser|\pi_{\text{name}}\text{User}| = 50

FILTER(name='Bart') | SCAN(User)

then size of this output is 1000/50=20.

selection with conjunctive predicate: Q=σA=uB=vRQ=\sigma_{A=u\land B=v}R

selection with negation: Q=σAvRQ=\sigma_{A\neq v}R

selection with disjunctive predicate: Q=σA=uB=vRQ=\sigma_{A=u\lor B=v}R

eg. consider |User| = 1000, πnameUser|\pi_{\text{name}}\text{User}| = 50, πpopUser|\pi_{\text{pop}}\text{User}| = 5.
then σname=Bartpop=2User|\sigma_{\text{name=Bart}\lor pop=2}\text{User}| = 1000/50 + 1000/(50*5) - 1000/(50*5) = 216.

selection with range: Q=σA>vRQ=\sigma_{A>v}R

two-way equi-join: Q=R(A,B)S(A,C)Q=R(A,B)\Join S(A,C)

using similar ideas, we can estimate size of projection, duplicate elimination, union, difference, aggregation (with grouping).

eg. estimate io cost of physical 1 plan from bottom to up.

heuristics-based query optimization

we use heuristic transformation rules to find a cheaper plan, for example:

convert σ−× to/from $\Join_p$: σp(R×S)=RpS\sigma_p(R\times S)=R\Join_pS

merge/split σ's: σp(σqR)=σpqR\sigma_{p}(\sigma_qR)=\sigma_{p\land q}R

merge/split π's: πL1(πL2R)=πL1R\pi_{L_1}(\pi_{L_2}R)=\pi_{L_1}R when L1L2L_1\subseteq L_2

push down/pull up σ: σpprps(RpS)=(σprR)pp(σpsS)\sigma_{p\land p_r\land p_s}(R\Join_{p'}S)=(\sigma_{p_r}R)\Join_{p\land p'}(\sigma_{p_s}S), where

push down π: πL(σpR)=πL(σp(πLLR))\pi_L(\sigma_pR)=\pi_L(\sigma_p(\pi_{LL'}R)), where

eg.
img

heuristics-based query optimization:

cost-based optimization:

Week 9. July 6

transactions

problems to solve:

defn. a transaction is a sequence of database operations with the following properties (ACID):

sql transactions

sql transactions:

fine prints:

atomicity:

durability:

consistency:

isolation:

sql isolation

sql isolation levels:

READ UNCOMMITTED:

-- t1 UPDATE User SET pop = 0.99 WHERE uid = 142; -- 1 ROLLBACK; -- 3 -- t2 SELECT AVG(pop) FROM User; COMMIT; -- 2

READ COMMITTED:

-- t1 UPDATE User SET pop = 0.99 WHERE uid = 142; COMMIT; -- 2 -- t2 SELECT AVG(pop) FROM User; -- 1 SELECT AVG(pop) FROM User; COMMIT; -- 3

REPEATABLE READ:

-- t1 INSERT INTO User VALUES(789, '', 10, 0.1); COMMIT; -- 2 -- t2 SELECT AVG(pop) FROM User; -- 1 SELECT AVG(pop) FROM User; COMMIT; -- 3

SERIALIZABLE:

defn. let

then transaction T is a partial order T={Σ,<}T=\{\Sigma,<\} where

eg. consider transaction T = {read(s), read(y), x=x+y, write(x), commit}

defn. an execution history over a set of transactions T1,...,TnT_1,...,T_n is an interleaving of the operations of T1,...,TnT_1,...,T_n where the operation ordering imposed by each transaction is preserved

eg. T1={w1[x],w1[y],c1},T2={r2[x],r2[y],c2}T_1=\{w_1[x],w_1[y],c_1\},T_2=\{r_2[x],r_2[y],c_2\}, we can have history

we can check equivalence also by:

defn. two operations conflict if

  1. they belong to different transactions,
  2. they operate on same object, and
  3. at least one of the operations is a write.

defn. two histories are (conflict) equivalent if

defn. a history HH is (conflict) serializable if there exists some serial hostory HH' that is (conflict) equivalent to HH.

eg. show Hc=w1[x]r2[x]r2[y]w1[y]c1c2H_c=w_1[x]r_2[x]r_2[y]w_1[y]c_1c_2 is not serializable.

defn. a serialization graph SGH(V,E)SG_H(V,E) for history H is

theorem. a history is serializable iff its serialization graph is acyclic.

eg. show Ha=w1[x]r2[x]w1[y]r2[y]c1c2H_a=w_1[x]r_2[x]w_1[y]r_2[y]c_1c_2 is serializable.

locking (for isolation)

rules:

lock compatibility:

S X
S grant no
X no no

eg. basic locking is not enough. given a non-serializable history:
img

x and y start from x=y=100. at end x=202, y=201. the interleaving breaks x=y.

two-phase locking:

img

remaining problem of 2PL:

strict 2PL:

recovery

execution model:

failures:

naive approach

T1: read(A, a); a = a - 100; write(A, a); read(B, b); b = b + 100; write(B, b); commit;

logging:

steps:

checkpointing:

Opt. July 27

distributed systems

data delivery alternatives:

distributed dbms promises:

distributed dbms architecture:

+-------+ | USER | ++-----++ ^ | +-------------------------+-----+-----------------------------------------------------------+ | | v | | user +------+-----+-----------+ +----------------+ | | processor | user interface handler | <-------+-+external schema | | | +-----------+------------+ | +----------------+ | | v | | | +-----------+------------+ <-------+ | | |semantic data controller| | | +-----------+------------+ <-------+ +------------------------+ | | v +-+global conceptual schema| | | +-----------+------------+ | +------------------------+ | | | global query optimizer | <-------+ | | +-----------+------------+ | | v | | +-----------+------------+ | | |global execution monitor| | | +------------------------+ | +------------------------------|------------------------------------------------------------+ +------------------------------v------------------------------------------------------------+ | +------------------------+ +-----------------------+ | | data | local query processor | <--------+ |local conceptual schema| | | processor +-----------+------------+ +-----------------------+ | | v | | +-----------+------------+ +-----------+ | | |local recovery manager | <--------+ |system log | | | +-----------+------------+ +-----------+ | | v | | +-----------+------------+ +-----------------------+ | | |runtime support processor <--------+ |local internal schema | | | +-----------+------------+ +-----------------------+ | | | | +------------------------------+------------------------------------------------------------+ v +--+---+ |store | +------+

client/server architecture:
img

distributed database server:

+---------+ +-------+ | clients | ... |clients| +----+----+ +---+---+ network | | <-----------------+--+----------------------+-----+---------------> | | | | +--------+---------+ +--------+---------+ |application server| |application server| ... +--------+---------+ +--------+---------+ | | network | | <-------------+------+----------+-----------+------+---------------> | | | | | | +------+--------+ +------+--------+ +-------+-------+ |database server| |database server| |database server| +------+--------+ +------+--------+ +-------+-------+ ^ ^ ^ v v v +--+--+ +--+--+ +--+--+ |store| |store| |store| | | | | | | +-----+ +-----+ +-----+

peer-to-peer architecture:

+-------------------------------------------------------------+ | | | +-----------------------+ | local query | +---------+ | data management layer | | +----------->+ data | | | +--------+ | global query | | mgmt | | +query manager | | P2P | | +----------->+ api / +<-->+ | +<-->+ network| | results | | user | | |update manager | | layer | | +--------+ <------------+interface| | | | +--------+ | | peer | | +---------+ | +cache manager | | +--------+ | | | +<---> cloud | v-->+--------+------------+-+ | | wrapper ^ ^ | +--------+ | ^v v v | | peer | | +----------+ +---+----+ +-+--------+ | +--------+ | |local data| |semantic| |remote | | | |source | |mappings| |data cache| | | | | | | | | | | +----------+ +--------+ +----------+ | | | | peer | +-------------------------------------------------------------+

data distribution

eg. horizontal fragmentation:

PROJ Pno Pname Budget Loc ------------------------------------ P1 instrut 150000 montreal P2 db dev 135000 NY P3 cad/cam 250000 NY -- projects with budgets < 200000 PROJ1 Pno Pname Budget Loc ------------------------------------ P1 instrut 150000 montreal P2 db dev 135000 NY -- projects with budgets >= 200000 PROJ2 Pno Pname Budget Loc ------------------------------------ P3 cad/cam 250000 NY

eg. vertical fragmentation:

-- project budget info PROJ3 Pno Budget ---------------- P1 150000 P2 135000 P3 250000 -- project other info Pno Pname Loc ------------------------ P1 instrut montreal P2 db dev NY P3 cad/cam NY

allocation alternatives:

if read-only queriesupdate queries1\frac{\text{read-only queries}}{\text{update queries}}\gg 1, then replication is advantageous, otherwise problems.

query processing

+ calculus query on | distributed relations | v | +-------------------------+ | |query normalization | +-----------------+ control| |analysis, simplification | <- |global schema | site | +-------------------------+ +-----------------+ | v | algebraic query on | distributed relations | v | +-------------------------+ +------------------+ | |data localization | <- |fragment schema | | +-------------------------+ +------------------+ | v | fragment query | v | +-------------------------+ +------------------+ + |global optimization | <- |stats on fragments| +-------------------------+ +------------------+ v optimized fragment query + with communication operations | v local| +-------------------------+ +------------------+ sites| |local optimization | <- |local schema | | +-------------------------+ +------------------+ | v + optimized local queries

data localization:

eg. ProjEmp(Proj1Emp)(Proj2Emp)\texttt{Proj}\Join\texttt{Emp}\Rightarrow (\texttt{Proj1}\Join\texttt{Emp})\cup(\texttt{Proj2}\Join\texttt{Emp}) (parellel).

global query optimization:

cost-based optimization:

cost function:

concurrency control

distributed locking:

distributed reliability:

data replication

one-copy equivalence: effect of transactions performed by clients on replicated objects should be the same as if they had been performed on single set of objects.

write(x) v +---+ | x | logical data +---+ item +------------+------------+ wtite(x1) write(x2) write(x3) v v v +--+ +--+ +--+ |x1| |x2| |x3| physical items +--+ +--+ +--+ (replicas)

consistency models: how to reason about consistency of the 'global execution state'?

update management strategies:

eg. there is only one copy which can be updated (master), all others (slaves) are updated reflecting changes to the master.

parallel systems

parallel dbms:

parallel system architectures:

img

issues:

warehousing

OLAP data data + visualization mining | + + | | | +-v v v-----+ +------+-------+ |data warehouse| | | +-+----+-----+-+ | | | | | | | | | | | | | | | v----+ v +---v +--+ +-++ +--+ | | | | | | | | | | | | +--+ +--+ +--+ operational databases

creating warehouse:

on-line analytical processing (OLAP)

OLAP queries:

eg.

Location lid store city province country ------------------------------------------------- A Weber Waterloo ON CA B F-H Kitchener ON CA C Park Kitchener ON CA Product pid pname category price --------------------------------------- 1 Bolt Hardware .1 2 Nut Hardware .05 3 Wrench Tools 1.99 Sales -------------------------------------- lid pid tid sales A 1 1 11 A 2 1 12 B 1 1 14 ... Time tid date week month quarter year ---------------------------------------------------- virtual relation

operations:
img

generating the data cube:

SUM(sales) GROUP BY location, product, time; -- raw cells SUM(sales) GROUP BY location, time; SUM(sales) GROUP BY product, time; SUM(sales) GROUP BY product, location; SUM(sales) GROUP BY product; SUM(sales) GROUP BY location; SUM(sales) GROUP BY time; SUM(sales); -- sql:1999 cube operator groups by all combinations SELECT lid, pid, tid, SUM(sales) FROM Sales GROUP BY CUBE(lid, pid, tid); -- materialized view -- useful for any query that -- 1. rolls-up the Location dimension to at least City -- 2. rolls-up the Time dimension to at least Quarter CREATE VIEW ByCityQuarter(city, pid, quarter, sales) AS SELECT city, pid, QUARTER(tid), SUM(sales) FROM Sales S, Location L WHERE S.lid = L.lid GROUP BY city, pid, QUARTER(tid);

issues related to using materialized views:

data mining

classification rules:

img

construction of decision trees:

other types of classifiers:

association rules:

clustering:

hierarchical clustering:

chordata / \ mammal reptiles / \ / \ leopards humans snakes crocodiles