Disk Storage

Primary file organization: how files are physically stored on disk; where the read/write head has to go to read the file. You want to mess with this as little as possible! Find your data a good home and leave it there. There’s only one way to do this per file: you will store/arrange your records by one particular field (ID, SSN, LName, etc.)

Secondary file organization: auxiliary means of accessing data; indices! You can do all of these you want! Index by City, index by SSN… you basically list the records and their physical locations in order of whatever field suits your needs.

Primary file organization (Chapter 13!)

Things to remember when working with disks :

  1. Disks offer random access, read by block (LBA: Logical Block Address — usu. starts at 0, goes to n-1, where n = # blocks on disk)
  2. We still use tape backup because tape is the cheapest storage per bit!
  3. reading by bits or records is inefficient! We read by blocks!
  4. seek time: time to move head to block
  5. latency (a.k.a. rotation delay): time to spin disk to right positiontion under head
  6. block transfer time: time to move block from disk to head to memory (or back, for write) — usually takes the most time
  7. placing blocks contiguously reduces seek time and latency, but that’s beyond scope of this course, so we will concentrate on reducing the number of block reads and writes
  8. Double buffering allows us to read Block 1, let CPU start beating on it while we load Block 2 into the other buffer. Then the moment CPU gets done beating on #1, it turns to the other buffer and beats on #2, no waiting! This speeds up the total process by allowing processing to happen simultaneously with some data transfer
  9. Non-spanning: records can’t be broken up across blocks. This wastes some block space
  10. Blocking factor: how many records we can fit per block
  11. Fixed-length records waste space, but they are much more efficient to process and are what we will usually use.
  12. Read takes one block transfer; write usually requires two transfers, unless we luck out and the block we want is sitting in memory. Systems may allow pegging: we designate certain blocks as worth keeping around in the buffer.
  13. Choose your primary file organization to make most efficient the operations most important to you (search, insert, update, delete)
    1. Heap: unordered! INSERT by APPEND, records ordered by time created.
      1. Inserting takes 1-2 block reads (find last block, see if there’s room, write here or create new block to write).
      2. Searching is a pain, requires b/2 block reads on average; requires reading all b blocks on search of non-existent file.
      3. DELETE: must first search (avg. b/2). Then must decide what to do with the empty space! Instead of physically deleting the record and either Swiss-cheesing the disk or having to haul records in to backfill, you can just flag the record with a deletion bit. Odds are we’ll use downtime to compress the disk to fill the holes (garbage collection).
    2. Physical sort: ordered on some oft-used field (like EmpID, SSN, ZIP…)
      1. INSERT: pain! You’ve got to move everyone over to make room for new records… or you create overflow file in heap, deal with searching the whole overflow until we have time to do garbage collection during downtime.
      2. search: easier on the sort field (so you use this method when you do a lot of searches on that particular field)
      3. DELETE: again, sped up. Leave flag on DELETEd record. Note that DELETE could allow us to stick in INSERTs if there’s room… but making the computer consider that possibility usually eats more time than just sending every INSERT to the overflow.
      4. Note that physical sort allows binary search! Go to the middle, check whether we need to go below or above, keep halving until we hit target! Binary search will take average log2(b) block reads.
  14. Hashing: way of mapping things on disk, reducing the range of numbers; mod of primary key often used;  good academic topic that we’re not spending a lot of time on!

Secondary File Structures (Indexing! Chapter 14!)

  1. Making index will save time in most cases
  2. Primary Index: built on primary (unique) key
  3. Clustering index: orders in chunks (all the Bobs, all the Marys, etc.)
  4. Primary Index is another table, a separate file, with two fields: the primary (candidate) key and the block number in which the record with that primary key value is stored. Primary key may be the ordering key for the primary storage; doing a binary search on this index is still shorter than doing a binary search on the disk itself (fewer blocks to church through).
  5. Primary index allows binary search! Zoom! Searches now take log2(#blocks in Primary Index) + one more block read from disk.
  6. Remember, the primary index may still be huge (imagine Blogger’s primary index on usernames), but it won’t be as huge (use as many blocks) as the actual database
  7. Dense index: every record in database gets entry in index.
  8. Sparse index: not dense!
  9. A primary index can be sparse: since it uses the primary key, it can simply list the first or last key value (the anchor value) of each block on disk. This is good!
  10. Secondary index uses some key other than the ordering field of the primary file organization. Since the value of the key in this index has nothing to do with the block number, we need to have an index entry for every record.
  11. If search matters, create an index!
  12. You probably still have to read the index from disk. If you can keep the index in memory for lots of opeations, you really speed things up.
  13. Binary search tree: every child to left smaller, every child to right larger; efficiency comes from minimizing the depth of the roots (number of levels necessary = ROUNDUP((log2(n)+1)) where n = # records)
  14. B-tree: index scheme based on binary search tree: idea is that since we live in the world of disks and have to read entire blocks, we might as well pack as much into the block as we can
  15. B+-tree: internal nodes do not store pointer to the data block, which means there’s room for more index values and thus larger fan-out; only the leaf nodes have records paired with block pointers, so every search must go to the bottom of the tree. But the B+-tree’s bigger fan-out gives an advantage!
  16. So, B-tree, you probably need to go to the bottom leaf, but you might get lucky. B+-tree always goes to the bottom leaf. (Somewhere there’s a forumla for this!)
  17. B+-tree also has the advantage that you can sequentially read the bottom row and get every record! B-tree, you’ve got to jump all over to get that complete list.

Next Time: Optimization!

Advertisements

Chapter 6: Relational Algebra (yum!)

plan: expect midterm posted Monday or Tuesday; due Monday following spring break. Open note, open Net….

SQL is based on relational calculus, which asks for results but doesn’t specify an order of operations. Relational algebra is the basic set of operations for the relational model, more prescriptive, spelling out the steps. We’re looking at RA here because that’s what query optimizers (Chapter 15) use to perform the transformations necessary.

All objects in RA are relations: no duplicate tuples (records)! SQL doesn’t mind duplicates, eliminates them with UNIQUE or DISTINCT. Sorting to removing duplicates is expensive! In the old days, we didn’t want to incur that expense, so the guys who came up with SQL let duplicates slide.

sigma σ = SELECT

  • subscript like WHERE clause
  • horizontal partition: selects rows
  • results always have fewer or same number of tuples as original R
  • commutative

PROJECT = pi π

  • vertical partition: selects columns
  • note! this is like a straight SQL SELECT, without WHERE
  • results must be set of tuples, no duplicates
  • results always have fewer or same number of tuples as original R
  • not commutative

You can write RA ops in a big nested expression, or you can step it and name the intermediate results

RENAME = rho ρ

  • renames table and/or columns

UNION ( ∪ )

  • the two relations we unite must be “type compatible”: same number of attributes, and each pair of corresponding attributes must be type compatible (have same or compatible domains: strings, CHARS, integers, etc…. it’s the datatype that matters, not the field name: default is to take on the field name of the first table)
  • Note that usually what you’ll do is PROJECT two disparate tables, then UNION the results, since we don’t leave a bunch of type-compatible tables sitting around in the database

Type compatibility is a requirement for all three set operations, UNION, INTERSECTION ( ∩ ), and DIFFERENCE ( – )

UNION and INTERSECTION are commutative

CARTESIAN/CROSS PRODUCT ( × )

  • has lots of tuples: multiply tuple count from each relation
  • add attribute count from each relation
  • usually not a meaningful relation in itself!

THETA JOIN, EQUIJOIN, NATURAL JOIN (NJ removes duplicate attributes)

SQL.SELECT = PROJECT
SQL.FROM = JOIN
SQL.WHERE = SELECT

We used to use rule-based query optimization, rules of thumb; now we use statistics-based query optimization: we have stats on the tables that inform our choices (see Chaper 15).

Relational databases took off because we can mathematically prove that stuff will work, using relational algebra! It is not the most efficient model (JOINs take time!), but that’s outweighed by the formal definitions that make stuff work.

Dr. Stephen Krebsbach introduces us to the course. This is a capstone course. Interestingly, this course is a theoretical response to the practical nature of the preceding database courses in this track. We’re going to talk about the why.

Want to know how advanced we are? We’re starting in Chapter 17. :-)

Get your money’s worth: ask questions! Read the book! Communicate!

First assignment: read Chapters 17-19, watch the videos for each chapter, and come up with good questions for next class period! Watch for homework assignment, probably coming out next week.

And remember: the videos were recorded last year: ignore any references to dates, other assignments, etc.! Admin info will come by e-mail, not video!

Cory presents on open access publication and peer review… holy cow! Then he gets to rest and listen to Ruba and Matt.

Quick note: during Q&A, Ruba observes that maybe we miss important phenomena by focusing on reviewing all the A-lit and not reading the trade journals.

Ruba develops ontology and model management further (refer to earlier presentation).

  • Based on existing semantic Web tech: use what you’ve got
  • Standardization is a big deal! allows widespread adoption
  • Remember Krishnan and Chari (2000) give the really good background on model management
  • distributed model management needed (models just like info; gotta be able to share)
  • model management and Service-Oriented Architectures synergy
  • models are like services: they have inputs, processes, and outputs
  • ontologies are applied to services… maybe do same to models?
  • RDF: Resource Description Framework: mechanism for describing Web resources; effort at standardization of metadata
  • then came RDF-S, OWL (more complex), SWRL/SQWRL (these are pronounced swirl and squirrel)…W3C can explain further
  • a lot of this is about helping machines reason through rules, figure out things like “The brother of Bob’s son is another one of Bob’s sons.”
  • We’re trying to incorporate semantic info (that’s hard!) along with syntactic info (that’s easy) and apply it to decision-making models
  • Three levels, from abstract to concrete:
    1. modeling paradigm: capture constructs
    2. model schema: model indep. of data
    3. model instance: particular problem with data
  • Again, the service-model analogy is a big deal. Models can be thought of as services, services can be thought of as models….
  • [Warning: the acronyms are growing to, in one instance, six letters long. Could the length of acronyms could be an indicator of ripeness for a paradigm shift?… check that: eight letters. And a hyphen.]
  • I ask whether there is a field where we might get good empirical validation of the model; Deokar mentions semantic wikis in academic setting, useful also as portal for industry to register their own models in collaboration with other organizations, maybe security….

Matt Wills returns with electronic health records (EHR) and performance assessment (see Feb presentation). Tonight: some further investigation and proposal

  • Objectives:
    1. estab. strong theoretical foundation for model of clinical decision task performance with EHR
    2. redefine EHR in context of ERP systems
    3. explore/further define clinical decision task and EHR characteristics
    4. propose research protocol for data analysis and collection
  • Task-technology fit (TTF) theory (Goodhue and Thompson 1995)
    • nutshell: IT will have pos. impact when tech cap. matches task demands
  • EHR as ERP?
    • third-gen (3G) similar in many ways: basically, EHR can be the ERP for health care industry, can cover much the same business process aspects as ERP; applying the ERP model makes sense
    • TTF proposed for eval of entire IS infrastructure, not isolated tech
    • unlike EHR, ERP extensively studied, can guide this research
  • Four major themese in clinical decision theory:
    1. Pattern recognition (lowest level: requires no specific depth of knowledge; experience-based; basis of majority of decisions)
    2. Algorithms/heuristics (rules of thumb still grouping like pattern recognition, but requires more knowledge)
    3. Event-driven (no existing rules of thumb or patterns help; focus shifts to immediate therapeutic options, not driven by diagnosis or deep info)
    4. Hypothetical deduction (highest level of deduction)
  • utilization may drop from Matt’s approach to the TTF model, since EHR ut. will likely be mandated: nothing to study there! Performance is the only outcome variable of interest

Exam up tonight! Due Thurs. 11:55 p.m.

  • Remember: you get 2 hours!
  • Penalty per word over 150 per essay!

Panko, Chapter 11: Network Applications

Users care about applications. The rest is a black box they don’t want or need to open.

1950s-60s: apps hosted by mainframes that did all the work, dumb terminals connected (usu. via coax-style cable, but some by modem-telephonenetwork-modem). No intergrated circuit CPU until 1971 (Intel! the 4004 4-bit microprocessor); all done by vacuum tubes (when “bugs” in the system meant bugs in the system: moths came to the light emitted!). Slow response time, monochrome text, graphics rare, transmission expensive.

1981: first IBM PC used Intel 8088 8-bit CPU

Client/Server Computing: the terminal does at least some processing. Remember: today’s laptops can run the Space Shuttle. Servers are still faster.

Filtering e-mail out of your server is really important if you’re in one of the industries that regs require to archive all e-mail for seven years! Several apps can help with that process.

E-mail protocols:

  1. SMTP (Simple Mail Transfer Protocol): sender-initiated
  2. POP or IMAP to download; receiver initiates

Separate E-mail Body Standards for all-text, HTML…

NetAdmins need to worry more about viruses, worms, Trojan horses, spam, spyware, etc. Widespread problems; antivirus software is almost universal and generally ineffective

  1. So outsource to Postini! Even if you still maintain your own firewall, two lines of defense are better than one.

Spam is bad:

  1. consumes bandwidth
  2. consumes network staff time
  3. consumes user time
  4. may trigger sexual harassment suit (hostile environment! network admin has responsibility not to let such garbage into the workplace!)
  5. consumes storage space if regs require archiving of all e-mails

E-Commerce: Internet opened to it in 1991 (come on, almost 20 years: we should have it figured out by now! ;-) )

  • started with online catalogs
  • then shopping cart, checkout, payment
  • customer resource management (CRM)
  • links to external systems
  • links to internal systems (accounting, pricing, warehousing, shipment, etc.: go check out eBay!)

Peer-to-Peer Computing:

  • Gehl investigated using this model to distribute user manuals and other docs so they wouldn’t have to upgrade a file server
  • method avoids overload of central server
  • also avoids single point of failure
  • gives end user more freedom
  • uses client capacity better
  • Problem: clients come and go (switched off for night!)
  • Problem: client IP addresses switch!
  • Biggest Problem: sceurity! no central control!
  • grad student at research symposium last week talked about working up a model on using P2P capacity in Mayo Clinic to analyze med data
  • Pure P2P IM has no servers, but finding each other can be a hassle if you restart your computer and get a new IP

Ruba Aljafari speaks tonight (with contributions from Drs. Deokar and El-Gayar) on model management. (more…)

Hey! Don’t forget to do the quiz on D2L before April 15! It will be over the CSI 2008 slideshow PDF.

APA Bib: single-space refs, hanging indent, leave double space between entries.

Mark also said we need not use all-caps for headings. He doesn’t like the shouting.

Table titles go on top; figure titles go on bottom!