CS186 SQL5 Berkeley

notes / Database notes

字数统计: 1.3k | 阅读时长≈ 5 分钟

¶写在前面

Berkeley CS186 Intro to DB Systems

视频地址课程记录

¶SQL 5 File Organizations

¶Recall: Heap Files

回顾：堆文件

Unordered collection of records
记录的无序集合
Recall API for higher layers of the DBMS
用于数据库管理系统的更高层的API
- Insert / delete / modify record 增删改记录
- Fetch a particular record by record id 通过记录id获取特定记录
  - Record id is a pointer encoding pair of (pageID, location on page) 记录ID是一个指针，根据页面ID+记录位置编码而成的指针
- Scan all records 扫描所有记录
  - Possibly with some conditions on the records to be retrieved

¶Recall: Multiple File Organizations

数据库中的多种文件组织性质

Many alternatives exist, each good in some situationa and less so in others
各种文件形式都各有好坏
Heap Files：Suitable when typical access is a full scan of all record 堆文件→适用于对所有记录的完整浏览
Sorted Files：Best for retrieval in order, or when a range of records is needed 排序文件→适用于按顺序检索，尤其是需要同时检索若干个记录时
Clustered Files & Indexes：Group data into blocks to enable fast lookup and efficient modifications 集群文件和索引→适用于将数据分组到块中，实现快速查找和高效修改

¶Question

What is the “best” file organization? 最好的文件组织形式是什么？
- 取决于access pattern 即对数据库的访问模式

¶COST MODEL AND ANALYSIS

成本模型和分析

B：The number of data blocks in the file
B→文件中的数据块的数量
R：Number of records per block
R→每个数据块中的记录数量
D：(Average) time to read / write disk block
D→读写磁盘块的平均时间
Focus：Average case analysis for uniform random workloads 对统一的随机工作负载进行平均案例分析

¶Find Key x: Heap File

在堆文件形式中寻找某一记录x

P(i)：Probability that key is on page i is 1/B
T(i)： Number of pages touched if key on page i is i
此处缺一张图 Lec 5 Part 4 Equality (youtube.com) 1:17
平均检索x的成本：∑P(i) * T(i) = 2/B

¶Find Key x: Sorted File

在排序文件中寻找某一记录x

Worst-case: Pages touched in binary search 采用二分搜索
- log2 (B)
Average-case
- log2 (B) - (B - 1) / B
可以把二分搜索的过程画成二分树理解一下
推导的过程在Lec 5 Part 4 Equality (youtube.com) 4:00左右此处还是缺图。。。

¶Find Keys Between x and y: Heap File

在堆形式文件中寻找范围从x到y的若干记录

Always touch all blocks
基本都要遍历所有项，因为无序+可能存在重复项

¶Find Keys Between x and y: Comparison between Heap File and Sorted File

比较在堆形式文件和排序文件中寻找范围从x到y的若干记录

Heap File 堆
- Find beginning of range
  从开头开始搜索
Sorted File 排序文件
- 可以采用二分搜索

¶Insert x: Heap File

在堆形式文件中插入记录x

在堆末尾插入
插入成本：**2 * D ** = 读取该页 + 写入数据 + 把该内存区域写回磁盘上的页面【即读取+写入=2D】
注：每次处理磁盘上的数据时，不是直接对磁盘进行操作
- 而是：从磁盘读取数据后，在内存中操作数据，然后将内存缓冲区写回磁盘
- 所以任何的数据修改都会产生双重成本【读取+写入】

¶Insert x: Heap Vs Sorted

Heap File 堆文件
- Read last page, append, write. Cost = 2 * D 读取最后一部分，添加数据，写入磁盘
Sorted File
- Find location for record. Cost = log2 (BD) 遍历找到插入位置
- insert and shift rest of file 插入并移动文件的其余部分

¶Delete x: Heap File

在堆形式文件中删除记录x

Average case to find the record: B / 2 reads
Delete record from page
Cost = (B / 2 + 1) * D
- 找到要删除的数据的位置 + 删除操作

¶Delete x: Heap File Vs Sorted File

比较在堆形式文件和中寻找范围从x到y的若干记录

Heap File 堆文件
- Average case runtime: (B / 2 + 1) * D
Sorted File 排序文件
- Find location for record: log2 B 找到要删除的位置
- Shift the rest by 1 record 2 * (B / 2) 将剩余的记录移动一位，填补空缺位置

¶Cost of Operations: Equation Search Cost

操作成本:等式搜索成本

	Heap File	Sorted File
Scan all records 全浏览	B*D	B*D
Equality Search 寻找某一匹配项返回一个数据	0.5 * B * D	(log2 B) * D
Range Search 范围搜索返回一些数据	B*D	((log2 B) + pages) * D
Insert 插入	2*D	((log2 B) + B) * D
Delete 删除	(0.5 * B + 1) * D	((log2 B) + B) * D

B：The number of data blocks in the file
B→文件中的数据块的数量
R：Number of records per block
R→每个数据块中的记录数量
D：(Average) time to read / write disk block
D→读写磁盘块的平均时间
可以发现：对于排序文件来说，插入和删除记录需要的成本是一样的，但是对于堆文件就不一样：插入是无意识的操作，只需要在末尾插入；但是删除则需要找到对应的数据再进行删除，多了检索成本
do better ？ → use indexes 使用索引，可以be better！

请我喝杯咖啡吧~

支付宝

微信