CS186 SQL4 Disks, Files and Buffers I

2024-07-13

notes / Database notes

字数统计: 3.4k | 阅读时长≈ 14 分钟

¶写在前面

Berkeley CS186 Intro to DB Systems

视频地址课程记录

对应的视频链接地址：
- 本次第一部分：Lec 4 5 Part 1 Files (youtube.com)

¶SQL 4

¶Overview: Files of Pages of Records

Tables stored as logical files 在数据库中，表通常被视为逻辑上的文件，它们是数据的逻辑存储单元
- Consist of pages 由页面组成
  - Pages contain a collection of records 页面包含一系列记录
Pages are managed 页面管理
- On disk by the disk space manager: pages read / written to physical disk/files 磁盘空间管理器在磁盘上：页面被读取/写入到物理磁盘/文件
- In memory by the buffer manager: higher levels of DBMS only operate in memory 缓冲区管理器在内存中：数据库管理系统的更高层次仅在内存中操作

¶Database Files (Files of Pages of Records)

数据库文件（记录页文件）
DB FILE: A collection of pages, each containing a collection of records 数据库文件：页面集合，每页包含一组记录
API for higher layers of the DBMS 为数据库管理系统的更高层提供API
- Insert / delete / modify record 增删查改
- Fetch a particular record by record id 通过记录ID获取特定记录
  - Record id is a pointer encoding pair of (pageID, location on page) 记录ID是一个指针，编码为（页面ID，页面上的位置）
- Scan all records 扫描所有记录
  - Possibly with some conditions on the records to be retrieved 对要检索的记录有一些条件
Could span multiple OS files and even machines 跨越多个操作系统文件甚至机器
- Or “raw” disk devices 或“原始”磁盘设备

¶Many DB File Structure

数据库文件结构
数据库中的文件本身可以以多种方式构建
Unordered Heap Files 无序堆文件
- Records placed arbitrarily across pages 记录以没有特定顺序的方式任意放置在各个页面中
Clustered Heap Files 集群堆文件
- Records and pages are grouped 记录和页面被分组为具有相似值的记录
Sorted Files 排序的文件
- Pages and records are in sorted order 页面和记录以特定顺序保存
Index Files 索引文件
- 本质上是基于磁盘的数据结构，用于对数据进行查找
- B+ Trees, Linear Hashing B+ 树、线性哈希
- May contain records or point to records in other files 本身可能包含记录 / 包含指向其他文件中的记录的指针

¶Unordered Heap Files （无序堆文件）

Collection of records in no particular order 没有特定顺序的记录集合
- Not to be confused with “heap” data-structure 堆这种数据结构支持高效查找最小值，
As file shrinks / grows, pages (de)allocated 随着文件缩小/增大，页面分配（解除分配）
To support record level operations, we must 为了支持记录级操作
- Keep track of the pages in a file 跟踪文件中的页面
- Keep track of free space on pages 跟踪页面上的可用空间
- Keep track of the records on a page 跟踪页面上的记录

¶Heap File Implemented as List

页眉本身只有两个指针，一个指向没有可用空间的页面链接列表，另一个指向有可用空间的页面链接列表
Header page ID and Heap file name stored elsewhere 页眉 ID 和堆文件名存储在其他地方
- Database catalog 数据库目录
Each page contains 2 “pointers” plus free space and data 每个页面包含 2 个“指针”加上 可用空间 和数据
What is wrong with this? 这种设计有缺陷
- How do I find a page with enough space for a 20 byte records? 如何找到有足够空间容纳 20 字节记录的页面？
- 目前已知的方法：沿着可用空间页的链接列表走下去，查看其中一个页面是否有20字节。
- 问题：这样的方法效率较低
此处缺少一张图 5:09 画面下方的结构图

¶Better: Use a Page Directory

使用页面目录
Directory entries include: 目录条目包括
- #free bytes on the referenced page 引用页面上的空闲字节数
- 指向该页面的指针
Header pages accessed often → likely in cache 经常访问的标题页 → 可能在RAM的缓存中
Finding a page to fit a record required far fewer page 查找适合记录的页面所需的成本要少得多
- Because: One header page load reveals free space of many pages 因为：一次标题页加载会显示许多页面的可用空间
You can imagine optimizing the page directory further 进一步优化页面目录→可能通过压缩或将其放入更复杂的数据结构中
- But diminishing returns? 为了少量的Header页覆盖大量数据空间，不划算
此处缺少一张图 5:51 画面右侧

¶Summary

Table encoded as files which are collections of pages 表格被编码为页面集合的文件
表table是一个逻辑对象，但在磁盘上编码为文件，且文件本身是页面page的集合

¶Page Layout

单个页面的布局

¶Page Basics: The Header

页面标题区域可能包含以下内容
Header may contain:
- Number of records 页面包含的记录总数
- Free space 页面上的可用空间
- Maybe a next / last pointer 指向链接列表中下一个/最后一个页面的指针
- Bitmaps, Slot Table 位图、插槽表用于查找页面上的可用空间

¶Things to Address

需要在页面设计中解决的问题
Record length? Fixed√ or Variable 固定长度页面：记录records都有相同的长度可变长度页面：记录records长度不同
需要判断页面长度是否可变→来设计不同的页面布局
Find records by record id ? 通过记录的ID查找记录
- Record id = (Page, Location in Page)
- 同时还需要使ID随着时间推移而保持稳定
How do we add and delete records?

¶Options for Page Layouts

页面布局过程中的选择
Depends on
- Record length (fixed or variable) 记录长度是否可变
- Page packing (packed or unpacked) 页面是否根据其可用空间进行打包/解包

¶A Note On Imagery

尝试展示页面的图片（？
Data is stored in linear order 数据以线性顺序存储在磁盘上
- 1 byte per position 页面上每个位置都是1字节内存地址
- Memory addresses are ordered 有序的内存地址
- Disk addresses are ordered 有序的磁盘地址
This doesn’t fit nicely on screen 但是长长的一条，没法展示在屏幕上
- So we will “wrap around” the linear order into a rectangle 改进成矩形的内存区域/磁盘页面区域
此处缺一张图 2:06 Lec 4 5 Part 2 Pages for Fixed Length (youtube.com)

¶Fixed Length Records: Packed

Pack records densely 记录在固定长度的页头之后密集地打包在页面
将记录ID存储为偏移量从距页头的字节偏移量开始字节数加上record的长度
Record id = (pageId, “location in page”)?
- (pageId, record number in page)!
- We know the offset from start of page
- 举例：Record id: (Page 2, Record 4)
Easy to add: just append 计算空闲位置在哪里，然后把记录加到末尾就行了
Delete?
- Packed implies re-arrange 删除记录，需要重新打包界面
- Record Id pointers need to be updated 记录id的指针也需要更新
  - Could be expensive if they 're in other files 如果这些指针存储在数据库的其他文件夹中，需要的成本就更高
结构图就是Page Header为头指针的单向链表，后面跟着每条记录，连接下一条record
此处缺一张图 2:15 (16) Lec 4 5 Part 2 Pages for Fixed Length - YouTube

¶Fixed Length Records: Unpacked

未打包布局+固定长度记录
Bitmap denotes “slots” with records 额外成本：在页眉保留一个位图，这个位图将为页面上的每种插槽提供一个位，并且记录ID
位图作用：跟踪页面上的空白和完整空间
Record id: record number in page
Insert: find first empty slot 需要插入record时，在位图中找到第一个空槽，将记录插入其中并将位图该处标记为已满
Delete: Clear bit 将位图中对应的槽标记为空，则该处对应的数据就不存在了好处是不用改变指针，因为unpacked
还有个好处：固定长度记录的页面上没有浪费任何空间，新记录可以插入先前删除记录留下的孔中
比起固定长度+packed方法，就多了bitmap位图这一部分的开销，且更稳定【better choice】
此处缺一张图 4:00 (16) Lec 4 5 Part 2 Pages for Fixed Length - YouTube

¶Variable Length Records

可变长度记录
How do we know where each record begins? 查找记录开始的位置
What happens when we add and delete records? 增删记录时发生

此处缺一张图 Lec 4 5 Part 3 Pages for Variable Length (youtube.com) 0:51
把元数据放在页面末尾的页脚footer，不放页面的开头

¶Slotted Page

此处缺一张图 Lec 4 5 Part 3 Pages for Variable Length (youtube.com) 1:19
Introduce slot directory in footer 在页脚引入槽位目录
- Pointer to free space 空闲空间指针
- Length + Pointer to beginning of record 长度+指向记录开始的指针
  - reverse order 倒序
- Record ID = Location in slot table 记录ID =在槽位表中的位置
  - from right 槽位的顺序是从右到左
- Delete record (page2, record 4): set 4th slot directory pointer to null 删除某条记录，就直接删除其在槽位中的对应指针
  - doesn’t affect pointers to other records 而不会影响槽位内的其他指针

¶Slotted Page: Insert Record

Insert:
- Place record in free space on page 将记录放置在页面的空闲空间中
- Create pointer/length pair in next open slot in slot directory 在slot目录下一个空槽中创建指针/长度对
- Update the free space pointer 更新空闲空间指针
- fragmentation? 对于碎片化空间的处理？→合并
  - reorganize data on page 将所有输出重新打包，然后更改指针来反映记录的新位置

¶Slotted Page: Leading Questions

Recognize data on page
- Safe?
  - √ records ids don’t change
When should reorganize?
- coudl re-organize on delete 有删除记录后，可以重新organize
- or wait until fragmentation blocks record addition and then reorganize 等到碎片块记录添加，然后重新组织
- Often pays to be a little sloppy if page never gets more records 如果页面没有得到更多的记录，那么有点马虎通常是有好处的

¶Slotted Page: Growing Slots

插槽页面:增长插槽
Tracking number of slots in slot directory 跟踪槽位目录中的槽位数
- Empty of full
Extend number of slots in slot directory 扩展槽位目录的槽位数
- Slots grow from end of page inward 槽从页尾向内生长
- Records grow from beginning of page inward 记录从页面开始向内增长
- if we need more slots 当插槽用完，但是页面仍有剩余空间
  - 可以将另一个槽插入到当前槽的槽位中

¶Slotted Page: Summary

Typically use Slotted Page 通常使用开槽页
- Good for variable and fixed length records 适用于可变和固定长度的记录
Not bad for fixed length records too 对于固定长度的记录也不错
- Re-arrange (e.g., sort) and squash null fields 重新排列(例如，排序)和压缩空字段
- But for a whole table of fixed-length non-null records, can be worth the optimization of fixed length format 但是对于整张表的定长非空记录，可以值得优化定长格式

¶Record Layout

记录的布局

¶Record Formats

Relational Model 关系模型→
- Each record in table has some fixed type 每个记录都有固定的类型
Assume System Catalog stores the Schema 假设有一个当前正在存储的表的架构
- No need to store type information with records (save space!) 无需存储带有记录的类型信息(节省空间!)
- Catalog is just another table … 目录只是另一个表…
Goals:
- Records should be compact in memory & disk format 记录应以内存和磁盘格式压缩
- Fast access to fields (why?) 快速访问字段
Easy Case: Fixed Length Fields 简单情况:固定长度字段
Interesting Case: Variable Length Fields 有趣的案例:可变长度字段

¶Record Formats: Fixed Length

记录的格式：固定长度

Field types same for all records in a file.
一个文件中所有记录的字段类型相同
- Type into stored separately in system catalog.
On disk byte representation same as in memory
在磁盘上的字节表示也完全相同
将格式保留在内存中，并将逐字存储在磁盘上
实际上将在磁盘上保留原始内存表示形式（不会像Java那样进行序列化）
Finding i’th field ?
- done via arithmetic (fast)
  通过算数来完成寻找记录中的第i字段（固定长度）
Compact? (Nulls?)
记录在内存中的分布是否紧凑？（如果有空记录，会造成内存空间的浪费）

¶Record Formats: Variable Length

第一种：例如给只需要3字节的数据分配32字节
- 可能出现情况：数据长度超过分配的空间
第二种：在字段之间使用分隔符，例如逗号
- 提取数据时：从左到右开始，计算逗号，直到到达正确的位置
- 可能出现情况1：分隔符本身也要作为数据存储到record内，但是分隔符内容有可能跟record内容产生冲突（例如重复），造成无法分辨分隔符的结果
  - 逗号分隔值也是CSV文件等文本格式中常见的挑战→例如使用转义字符
- 可能出现情况2：为了访问特定record，需要在内存中产生一些扫描成本
第三种：引入一个Header（记录头）
- 在存储中，把可变长度字段放在最后，在Header中存储指向可变长度字段的指针
- 访问固定长度字段：采用算术计算长度 / 跟随指向变量的指针来直接访问
- 如果有空值→使指针指向与下一个指针相同的位置，

¶Summary

^ ^？？？

¶写在前面

¶SQL 4

¶Overview: Files of Pages of Records

¶Database Files (Files of Pages of Records)

¶Many DB File Structure

¶Unordered Heap Files （无序堆文件）

¶Heap File Implemented as List

¶Better: Use a Page Directory

¶Summary

¶Page Layout

¶Page Basics: The Header

¶Things to Address

¶Options for Page Layouts

¶A Note On Imagery

¶Fixed Length Records: Packed

¶Fixed Length Records: Unpacked

¶Variable Length Records

¶First: Relocate metadata to footer

¶Slotted Page

¶Slotted Page: Insert Record

¶Slotted Page: Leading Questions

¶Slotted Page: Growing Slots

¶Slotted Page: Summary

¶Record Layout

¶Record Formats

¶Record Formats: Fixed Length

¶Record Formats: Variable Length

¶Summary