rebuild: rocksdb dbbench 运行成功
This commit is contained in:
784
README.md
784
README.md
@@ -1,640 +1,218 @@
|
||||
# ZVFS
|
||||
|
||||
## usage
|
||||
```shell
|
||||
ZVFS 是一个基于 `SPDK Blobstore` 的轻量级用户态文件系统原型,
|
||||
通过 `LD_PRELOAD` 拦截常见 POSIX 文件 API,把 `/zvfs` 路径下的文件 I/O 转换为 Blob I/O。
|
||||
|
||||
目标是让上层应用尽量少改动地复用阻塞式文件接口,同时接近 SPDK 在低队列深度(QD≈1)场景的性能上限。
|
||||
|
||||
## 1. 项目结构
|
||||
|
||||
```text
|
||||
zvfs/
|
||||
├── src/
|
||||
│ ├── hook/ # POSIX API hook 层(open/read/write/...)
|
||||
│ ├── fs/ # inode/path/fd 运行时元数据管理
|
||||
│ ├── spdk_engine/ # SPDK Blobstore 封装
|
||||
│ ├── common/ # 对齐与缓冲区工具函数
|
||||
│ ├── config.h # 默认配置(JSON、bdev、xattr key 等)
|
||||
│ └── Makefile # 产出 libzvfs.so
|
||||
├── tests/
|
||||
│ ├── hook/ # hook API 语义测试
|
||||
│ ├── ioengine_test/ # Blob 引擎单元测试
|
||||
│ └── Makefile
|
||||
├── scripts/ # db_bench/hook 测试辅助脚本
|
||||
├── spdk/ # SPDK 子模块
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## 2. 核心架构
|
||||
|
||||
### 2.1 分层
|
||||
|
||||
当前实现:
|
||||
|
||||
```text
|
||||
App (open/read/write/fstat/...)
|
||||
-> LD_PRELOAD Hook (src/hook)
|
||||
-> ZVFS Runtime Metadata (src/fs)
|
||||
-> SPDK Engine (src/spdk_engine)
|
||||
-> SPDK Blobstore
|
||||
-> bdev (Malloc/NVMe)
|
||||
```
|
||||
|
||||
目标架构(Daemon + IPC):
|
||||
|
||||
```text
|
||||
App (multi-process, e.g. PostgreSQL)
|
||||
-> LD_PRELOAD Hook Client
|
||||
-> IPC (Unix Domain Socket)
|
||||
-> zvfs daemon
|
||||
-> metadata manager
|
||||
-> SPDK worker threads
|
||||
-> SPDK Blobstore / bdev
|
||||
```
|
||||
|
||||
### 2.2 目标架构简版(HOOK 层 + daemon 层)
|
||||
|
||||
- `HOOK 层`
|
||||
- 拦截 `/zvfs` 路径的 POSIX API 并同步发起 IPC 请求。
|
||||
- 维护本地最小状态(如 `fd -> remote_handle_id`)。
|
||||
- 对非 `/zvfs` 路径继续透传到 `real_*` syscall(POSIX passthrough)。
|
||||
- `daemon 层`
|
||||
- 独占 SPDK 资源(`spdk_env/blobstore/spdk_thread`)。
|
||||
- 统一处理元数据与并发控制(path/inode/handle)。
|
||||
- 接收 IPC 请求并执行实际 I/O,返回 POSIX 风格结果与 errno。
|
||||
|
||||
### 2.3 元数据与数据映射
|
||||
|
||||
- 文件数据:存储在 SPDK blob 中。
|
||||
- 文件到 blob 的映射:写入真实文件的 `xattr`(key: `user.zvfs.blob_id`)。
|
||||
- 运行时维护三张表:
|
||||
- `inode_table`:`blob_id -> inode`
|
||||
- `path_cache`:`path -> inode`
|
||||
- `fd_table`:`fd -> open_file`
|
||||
|
||||
### 2.4 当前实现的 I/O 路径要点
|
||||
|
||||
- `blob_read/blob_write` 统一走按 `io_unit_size` 对齐的 DMA 缓冲。
|
||||
- 非对齐写会触发读改写(RMW):先读对齐块,再覆盖局部写回。
|
||||
- `readv/writev` 在 hook 层会做聚合,减少多次 I/O 提交。
|
||||
- `fsync/fdatasync` 对 zvfs fd 调用 `blob_sync_md`;`sync_file_range` 在 zvfs 路径直接返回成功。
|
||||
|
||||
## 3. 构建
|
||||
|
||||
> 下面命令以仓库根目录为 `/home/lian/try/zvfs` 为例。
|
||||
|
||||
### 3.1 初始化并构建 SPDK
|
||||
|
||||
```bash
|
||||
git submodule update --init --recursive
|
||||
|
||||
cd spdk
|
||||
./scripts/pkgdep.sh
|
||||
./configure --with-shared
|
||||
make -j
|
||||
|
||||
make
|
||||
|
||||
# sometimes dd if=/dev/zero of=/dev/nvme0n1 bs=1M count=10
|
||||
LD_PRELOAD=./libzvfs.so ./func_test
|
||||
make -j"$(nproc)"
|
||||
```
|
||||
|
||||
## 测试
|
||||
### 总结
|
||||
由于是目标是hook阻塞的API,相当于队列深度为1。
|
||||
### 3.2 构建 ZVFS 与测试
|
||||
|
||||
队列深度为1的情况下,spdk测试工具spdk_nvme_perf的测试结果:
|
||||
1. iosize = 4K:100MiB/s
|
||||
2. ioszie = 128K:1843MiB/s
|
||||
|
||||
zvfs的测试结果:
|
||||
1. iosize = 4K:95MiB/s
|
||||
2. ioszie = 128K:1662MiB/s
|
||||
|
||||
相当于spdk测试工具读写的90%性能。
|
||||
|
||||
对比系统调用:
|
||||
1. O_DIRECT
|
||||
1. 大块4K:43MiB/s
|
||||
2. 小块128K:724MiB/s
|
||||
2. !O_DIRECT
|
||||
1. 大块4K:1460MiB/s
|
||||
2. 小块128K:1266MiB/s
|
||||
|
||||
非对齐情况下,写入性能/2,因为需要read-update-write。
|
||||
|
||||
### spdk_nvme_perf 性能基准测试
|
||||
```shell
|
||||
cd /home/lian/share/10.1-spdk/spdk
|
||||
|
||||
export LD_LIBRARY_PATH=/home/lian/share/10.1-spdk/zvfs/spdk/build/lib:/home/lian/share/10.1-spdk/zvfs/spdk/dpdk/build/lib:$LD_LIBRARY_PATH
|
||||
export PATH=/home/lian/share/10.1-spdk/zvfs/spdk/build/bin:$PATH
|
||||
|
||||
./build/bin/spdk_nvme_perf \
|
||||
-r 'trtype:PCIe traddr:0000:03:00.0' \
|
||||
-q 1 -o 4096 -w randwrite -t 5
|
||||
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/spdk# ./build/bin/spdk_nvme_perf -r 'trtype:PCIe traddr:0000:03:00.0' -q 1 -o 4096 -w randwrite -t 5
|
||||
Initializing NVMe Controllers
|
||||
Attached to NVMe Controller at 0000:03:00.0 [15ad:07f0]
|
||||
Associating PCIE (0000:03:00.0) NSID 1 with lcore 0
|
||||
Initialization complete. Launching workers.
|
||||
========================================================
|
||||
Latency(us)
|
||||
Device Information : IOPS MiB/s Average min max
|
||||
PCIE (0000:03:00.0) NSID 1 from core 0: 25765.92 100.65 38.77 16.58 802.32
|
||||
========================================================
|
||||
Total : 25765.92 100.65 38.77 16.58 802.32
|
||||
|
||||
|
||||
./build/bin/spdk_nvme_perf \
|
||||
-r 'trtype:PCIe traddr:0000:03:00.0' \
|
||||
-q 32 -o 4096 -w randwrite -t 5
|
||||
|
||||
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/spdk# ./build/bin/spdk_nvme_perf -r 'trtype:PCIe traddr:0000:03:00.0' -q 32 -o 4096 -w randwrite -t 5
|
||||
Initializing NVMe Controllers
|
||||
Attached to NVMe Controller at 0000:03:00.0 [15ad:07f0]
|
||||
Associating PCIE (0000:03:00.0) NSID 1 with lcore 0
|
||||
Initialization complete. Launching workers.
|
||||
========================================================
|
||||
Latency(us)
|
||||
Device Information : IOPS MiB/s Average min max
|
||||
PCIE (0000:03:00.0) NSID 1 from core 0: 80122.94 312.98 399.36 36.31 2225.64
|
||||
========================================================
|
||||
Total : 80122.94 312.98 399.36 36.31 2225.64
|
||||
|
||||
|
||||
./build/bin/spdk_nvme_perf \
|
||||
-r 'trtype:PCIe traddr:0000:03:00.0' \
|
||||
-q 1 -o 131072 -w write -t 5
|
||||
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/spdk# ./build/bin/spdk_nvme_perf -r 'trtype:PCIe traddr:0000:03:00.0' -q 1 -o 131072 -w write -t 5
|
||||
Initializing NVMe Controllers
|
||||
Attached to NVMe Controller at 0000:03:00.0 [15ad:07f0]
|
||||
Associating PCIE (0000:03:00.0) NSID 1 with lcore 0
|
||||
Initialization complete. Launching workers.
|
||||
========================================================
|
||||
Latency(us)
|
||||
Device Information : IOPS MiB/s Average min max
|
||||
PCIE (0000:03:00.0) NSID 1 from core 0: 14746.80 1843.35 67.79 40.16 4324.96
|
||||
========================================================
|
||||
Total : 14746.80 1843.35 67.79 40.16 4324.96
|
||||
|
||||
|
||||
./build/bin/spdk_nvme_perf \
|
||||
-r 'trtype:PCIe traddr:0000:03:00.0' \
|
||||
-q 32 -o 131072 -w write -t 5
|
||||
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/spdk# ./build/bin/spdk_nvme_perf -r 'trtype:PCIe traddr:0000:03:00.0' -q 32 -o 131072 -w write -t 5
|
||||
Initializing NVMe Controllers
|
||||
Attached to NVMe Controller at 0000:03:00.0 [15ad:07f0]
|
||||
Associating PCIE (0000:03:00.0) NSID 1 with lcore 0
|
||||
Initialization complete. Launching workers.
|
||||
========================================================
|
||||
Latency(us)
|
||||
Device Information : IOPS MiB/s Average min max
|
||||
PCIE (0000:03:00.0) NSID 1 from core 0: 21997.40 2749.68 1455.09 96.64 26152.13
|
||||
========================================================
|
||||
Total : 21997.40 2749.68 1455.09 96.64 26152.13
|
||||
```
|
||||
### 系统调用
|
||||
#### no O_DIRECT 小块
|
||||
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# ./func_test
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 4 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
WRITE:
|
||||
total : 12668.9 MB
|
||||
time : 10.003 sec
|
||||
IOPS : 324211 ops/sec
|
||||
BW : 1266.45 MB/s
|
||||
|
||||
READ:
|
||||
total : 7664.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 196210 ops/sec
|
||||
BW : 766.44 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```
|
||||
#### no O_DIRECT 大块
|
||||
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# ./func_test
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 128 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
WRITE:
|
||||
total : 14609.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 11688 ops/sec
|
||||
BW : 1460.95 MB/s
|
||||
|
||||
READ:
|
||||
total : 8138.6 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 6511 ops/sec
|
||||
BW : 813.85 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```bash
|
||||
cd /home/lian/try/zvfs
|
||||
make -j"$(nproc)"
|
||||
make test -j"$(nproc)"
|
||||
```
|
||||
|
||||
#### no O_DIRECT 随机 对齐 大块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs/zvfs# ./func_test
|
||||
产物:
|
||||
|
||||
=== test_single_file_random_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 128 KB
|
||||
Range : 2048 MB
|
||||
Duration: 10 sec
|
||||
- `src/libzvfs.so`
|
||||
- `tests/bin/hook_api_test`
|
||||
- `tests/bin/ioengine_single_blob_test`
|
||||
- `tests/bin/ioengine_multi_blob_test`
|
||||
- `tests/bin/ioengine_same_blob_mt_test`
|
||||
|
||||
RANDOM WRITE:
|
||||
total : 8930.8 MB
|
||||
time : 10.001 sec
|
||||
IOPS : 7144 ops/sec
|
||||
BW : 893.01 MB/s
|
||||
## 4. 运行与验证
|
||||
|
||||
RANDOM READ:
|
||||
total : 8238.9 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 6591 ops/sec
|
||||
BW : 823.89 MB/s
|
||||
### 4.1 Hook API 语义测试
|
||||
|
||||
=== all tests PASSED ===
|
||||
```
|
||||
#### no O_DIRECT 随机 非对齐 大块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs/zvfs# ./func_test
|
||||
|
||||
=== test_single_file_random_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 128 KB
|
||||
Range : 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
RANDOM WRITE:
|
||||
total : 5964.4 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 4771 ops/sec
|
||||
BW : 596.43 MB/s
|
||||
|
||||
RANDOM READ:
|
||||
total : 6607.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 5286 ops/sec
|
||||
BW : 660.77 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```bash
|
||||
mkdir -p /zvfs
|
||||
cd /home/lian/try/zvfs
|
||||
LD_PRELOAD=$PWD/src/libzvfs.so ZVFS_TEST_ROOT=/zvfs ./tests/bin/hook_api_test
|
||||
```
|
||||
|
||||
#### O_DIRECT 小块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# ./func_test
|
||||
覆盖点包括:
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 4 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
- `open/openat/rename/unlink`
|
||||
- `read/write/pread/pwrite/readv/writev/pwritev`
|
||||
- `fstat/lseek/ftruncate`
|
||||
- `fcntl/ioctl(FIONREAD)`
|
||||
- `fsync/fdatasync`
|
||||
|
||||
WRITE:
|
||||
total : 434.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 11122 ops/sec
|
||||
BW : 43.45 MB/s
|
||||
### 4.2 SPDK 引擎测试
|
||||
|
||||
READ:
|
||||
total : 373.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 9568 ops/sec
|
||||
BW : 37.38 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```
|
||||
#### O_DIRECT 大块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# ./func_test
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /tmp/test.dat
|
||||
IO size : 128 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
WRITE:
|
||||
total : 7245.4 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 5796 ops/sec
|
||||
BW : 724.53 MB/s
|
||||
|
||||
READ:
|
||||
total : 9006.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 7205 ops/sec
|
||||
BW : 900.64 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```bash
|
||||
cd /home/lian/try/zvfs
|
||||
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_single_blob_test
|
||||
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_multi_blob_test
|
||||
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_same_blob_mt_test
|
||||
```
|
||||
|
||||
### SPDK
|
||||
#### 非对齐
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# LD_PRELOAD=./libzvfs.so ./func_test /zvfs
|
||||
## 5. 关键环境变量
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /zvfs/file.dat
|
||||
IO size : 128 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
- `SPDK_BDEV_NAME`:选择后端 bdev(默认 `Malloc0`)。
|
||||
- `ZVFS_BDEV`:`zvfs_ensure_init` 使用的 bdev 名称(未设置时使用 `config.h` 默认值)。
|
||||
- `SPDK_JSON_CONFIG`:覆盖默认 SPDK JSON 配置路径。
|
||||
|
||||
WRITE:
|
||||
total : 10304.0 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 8243 ops/sec
|
||||
BW : 1030.40 MB/s
|
||||
## 6. 性能说明(仅保留趋势)
|
||||
|
||||
READ:
|
||||
total : 17788.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 14231 ops/sec
|
||||
BW : 1778.85 MB/s
|
||||
`README` 历史压测数据来自旧版本,不能直接当作当前版本结论;但可作为设计趋势参考:
|
||||
|
||||
=== all tests PASSED ===
|
||||
- 目标工作负载为阻塞 API,近似 `QD=1`。
|
||||
- 旧数据下,ZVFS 在 `QD=1` 时约达到 `spdk_nvme_perf` 的 `90%~95%`。
|
||||
- 4K:约 `95 MiB/s` vs `100 MiB/s`
|
||||
- 128K:约 `1662 MiB/s` vs `1843 MiB/s`
|
||||
- 相对同机 `O_DIRECT` 路径,旧数据写带宽约有 `2.2x~2.3x` 提升。
|
||||
- 非对齐写存在 RMW,吞吐明显下降(旧数据常见接近对齐写的一半)。
|
||||
|
||||
如果需要用于对外汇报,请重新在当前 commit 与固定硬件环境下复测。
|
||||
|
||||
## 7. 当前限制
|
||||
|
||||
- 仅拦截 `/zvfs` 路径。
|
||||
- `mmap` 对 zvfs fd 当前返回 `ENOTSUP`(建议上层关闭 mmap 读写)。
|
||||
- `dup/dup2/dup3` 对 zvfs fd 当前返回 `ENOTSUP`。
|
||||
- `rename` 跨 `/zvfs` 与非 `/zvfs` 路径返回 `EXDEV`。
|
||||
- `fallocate(FALLOC_FL_PUNCH_HOLE)` 未实现。
|
||||
|
||||
## 8. 后续建议
|
||||
|
||||
- 补齐 mmap 路径(mmap table + 脏页回写)。
|
||||
- 完善多线程/高并发下的语义与压测基线。
|
||||
- 增加版本化 benchmark 报告,避免 README 中历史数据失真。
|
||||
|
||||
## 9. Blob Store 血泪教训
|
||||
|
||||
### Owner Thread 绑定
|
||||
|
||||
blobstore内部负责并发控制,让所有metadata操作都在一个线程上执行,回调固定绑定给创建blobstore的线程。所以多线程模型下不是send给谁谁就能poll到回调的。
|
||||
|
||||
正确架构:
|
||||
```
|
||||
#### 全对齐大块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# LD_PRELOAD=./libzvfs.so ./func_test /zvfs
|
||||
metadata thread
|
||||
spdk_bs_load()
|
||||
resize
|
||||
delete
|
||||
sync_md
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /zvfs/file.dat
|
||||
IO size : 128 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
worker thread
|
||||
blob_io_read
|
||||
blob_io_write
|
||||
```
|
||||
|
||||
WRITE:
|
||||
total : 16624.4 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 13299 ops/sec
|
||||
BW : 1662.43 MB/s
|
||||
|
||||
READ:
|
||||
total : 16430.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 13145 ops/sec
|
||||
BW : 1643.07 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
### spdk_for_each_channel() Barrier
|
||||
某些 metadata 操作非常慢:
|
||||
```
|
||||
|
||||
#### 全对齐小块
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs# LD_PRELOAD=./libzvfs.so ./func_test /zvfs
|
||||
|
||||
=== test_single_file_perf ===
|
||||
Path : /zvfs/file.dat
|
||||
IO size : 4 KB
|
||||
Max file: 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
WRITE:
|
||||
total : 944.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 24179 ops/sec
|
||||
BW : 94.45 MB/s
|
||||
|
||||
READ:
|
||||
total : 982.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 25159 ops/sec
|
||||
BW : 98.28 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
resize
|
||||
delete
|
||||
unload
|
||||
snapshot
|
||||
```
|
||||
这些操作内部会调用:spdk_for_each_channel()
|
||||
|
||||
#### 对齐随机写(大块)
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs/zvfs# LD_PRELOAD=./libzvfs.so ./func_test /zvfs
|
||||
|
||||
=== test_single_file_random_perf ===
|
||||
Path : /zvfs/file.dat
|
||||
IO size : 128 KB
|
||||
Range : 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
RANDOM WRITE:
|
||||
total : 17461.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 13969 ops/sec
|
||||
BW : 1746.17 MB/s
|
||||
|
||||
RANDOM READ:
|
||||
total : 17439.5 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 13952 ops/sec
|
||||
BW : 1743.95 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```
|
||||
#### 非对齐随机写(大块)
|
||||
```shell
|
||||
root@ubuntu:/home/lian/share/10.1-spdk/zvfs/zvfs# LD_PRELOAD=./libzvfs.so ./func_test /zvfs
|
||||
|
||||
=== test_single_file_random_perf ===
|
||||
Path : /zvfs/file.dat
|
||||
IO size : 128 KB
|
||||
Range : 2048 MB
|
||||
Duration: 10 sec
|
||||
|
||||
RANDOM WRITE:
|
||||
total : 7500.2 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 6000 ops/sec
|
||||
BW : 750.02 MB/s
|
||||
|
||||
RANDOM READ:
|
||||
total : 15143.8 MB
|
||||
time : 10.000 sec
|
||||
IOPS : 12115 ops/sec
|
||||
BW : 1514.35 MB/s
|
||||
|
||||
=== all tests PASSED ===
|
||||
```
|
||||
|
||||
## SPDK
|
||||
1. blob_store: blob仓库,管理多个blob对象。
|
||||
2. blob: 存储对象,逻辑上连续,物理上不一定连续。相当于文件。
|
||||
3. cluster: 分配单元,一个 blob 可以由多个 cluster 构成,扩容即分配新的 cluster。相当于文件系统的block group。
|
||||
4. page: IO单元,一个 cluster 等于多个 page。
|
||||
|
||||
文件系统
|
||||
|
||||
## 架构设计
|
||||
```scss
|
||||
| 应用程序
|
||||
| (POSIX API: open/read/write/close)
|
||||
| LD_PRELOAD 拦截层
|
||||
| (简单路径判断和转发到zvfs)
|
||||
| zvfs 文件系统层
|
||||
| (blob 操作)
|
||||
| SPDK Blobstore
|
||||
| 块设备 (Malloc0)
|
||||
```
|
||||
|
||||
### 磁盘布局
|
||||
```scss
|
||||
BlobStore:
|
||||
|—— Super Blob(元数据,使用SPDK的Super Blob锚定)
|
||||
|——超级块
|
||||
|——目录项/目录日志
|
||||
|—— Blob 1 (文件A...)
|
||||
|—— Blob 2 (文件B...)
|
||||
|—— Blob N (文件C...)
|
||||
```
|
||||
|
||||
|
||||
### 数据结构
|
||||
#### Super Blob(元数据)
|
||||
```scss
|
||||
[超级块]
|
||||
- magic_number: 0x5A563146 (ZV1F)
|
||||
- version: 1
|
||||
|
||||
[目录项]
|
||||
- filename[256]: 文件名
|
||||
- blob_id: 对应的数据blob ID
|
||||
- file_size: 文件实际大小(字节)
|
||||
- allocated_clusters: 已分配的cluster数量
|
||||
- is_valid: 标记是否有效(用于删除)
|
||||
```
|
||||
语义:在所有 io_channel 所属线程执行 callback
|
||||
|
||||
类似
|
||||
```c
|
||||
/* 目录项(内存中的目录) */
|
||||
typedef struct {
|
||||
char filename[256];
|
||||
spdk_blob_id blob_id;
|
||||
uint64_t file_size; // 文件逻辑大小(字节)
|
||||
uint32_t allocated_clusters; // 已分配的cluster数量
|
||||
bool is_valid; // false 表示已删除
|
||||
int32_t open_count; // 打开的文件句柄数量
|
||||
} zvfs_dirent_t;
|
||||
|
||||
/* 文件系统全局结构 */
|
||||
typedef struct zvfs {
|
||||
struct spdk_blob_store *bs;
|
||||
struct spdk_io_channel *channel;
|
||||
struct spdk_blob *super_blob; // 承载目录日志的blob
|
||||
uint64_t io_unit_size; // page大小,单位字节
|
||||
|
||||
/* 目录 */
|
||||
zvfs_dirent_t *dirents; // 目录项数组 #define ZVFS_MAX_FILES 1024
|
||||
uint32_t dirent_count; // 当前有效项数
|
||||
|
||||
/* 伪FD表 */
|
||||
struct zvfs_file *fd_table[ZVFS_MAX_FD]; // // e.g., #define ZVFS_MAX_FD 64
|
||||
int fd_base; // 伪FD起始值,如10000
|
||||
int openfd_count;
|
||||
|
||||
/* 元数据 */
|
||||
uint32_t magic; // 0x5A563146 (ZV1F)
|
||||
uint32_t version; // 1
|
||||
} zvfs_t;
|
||||
|
||||
/* 打开的文件句柄 */
|
||||
typedef struct zvfs_file {
|
||||
zvfs_t *fs;
|
||||
struct spdk_blob *blob;
|
||||
zvfs_dirent_t *dirent; // 指回目录项 file_size/allocated_clusters
|
||||
|
||||
uint64_t current_offset; // 当前读写位置
|
||||
int flags; // O_RDONLY / O_RDWR / O_CREAT 等
|
||||
int pseudo_fd;
|
||||
|
||||
/* 临时DMA缓冲区(可选:每个file一个,避免每次malloc) */
|
||||
void *dma_buf;
|
||||
uint64_t dma_buf_size;
|
||||
} zvfs_file_t;
|
||||
for each channel:
|
||||
send_msg(channel->thread)
|
||||
```
|
||||
|
||||
### 工作流程
|
||||
#### mount
|
||||
hook POSIX API没有很好的调用时机,单线程目前采用懒加载。
|
||||
```scss
|
||||
1. [创建块设备]
|
||||
- spdk_bdev_create_bs_dev_ext
|
||||
2. [初始化文件系统]
|
||||
- spdk_bs_init 或者 spdk_bs_load(已有数据时)
|
||||
- spdk_bs_get_io_unit_size 获取io单元大小(page)
|
||||
- spdk_bs_alloc_io_channel 分配blobstore的读写入口
|
||||
3. [读取元数据]
|
||||
- spdk_bs_get_super_blob 获取 Super Blob ID
|
||||
- spdk_bs_open_blob 打开 Super Blob
|
||||
- 读取超级块,校验 magic
|
||||
- 读取目录项数组,加载到内存 dirents
|
||||
4. [创建zvfs_t结构体]
|
||||
- 创建 zvfs_t 结构体
|
||||
- 填充 bs/channel/super_blob/dirents 等字段
|
||||
```
|
||||
#### open
|
||||
##### O_RDONLY / O_RDWR
|
||||
```scss
|
||||
1. [文件名查找]
|
||||
- 遍历 dirents,匹配 filename 且 is_valid=true
|
||||
- 找不到返回 -ENOENT
|
||||
2. [打开blob]
|
||||
- spdk_bs_open_blob(dirent->blob_id)
|
||||
- dirent->open_count++
|
||||
- fs->openfd_count++
|
||||
3. [分配文件句柄]
|
||||
- 创建 zvfs_file_t,dirent 指针指向目录项
|
||||
- 分配伪FD,写入 fd_table
|
||||
5. [返回伪FD]
|
||||
```
|
||||
#### 问题1:持有 Channel 的 Thread 不 poll
|
||||
如果所属线程不poll,就会卡住。
|
||||
#### 问题2:线程退出 Channel 没有释放
|
||||
永远卡住
|
||||
|
||||
##### O_CREAT
|
||||
```scss
|
||||
1. [文件名查找]
|
||||
- 遍历 dirents,若 filename 已存在且 is_valid=true,返回 -EEXIST
|
||||
- 找一个 is_valid=false 的空槽位;没有空槽则追加(dirent_count < max_files)
|
||||
2. [创建blob]
|
||||
- spdk_bs_create_blob → 得到 blob_id
|
||||
- spdk_bs_open_blob → 得到 blob 句柄
|
||||
- spdk_blob_resize 初始分配空间
|
||||
- spdk_blob_sync_md 持久化 cluster 分配
|
||||
3. [写目录]
|
||||
- 填充 filename/blob_id/file_size=0/is_valid=true
|
||||
- dirent->open_count = 1
|
||||
4. [创建文件句柄]
|
||||
- 创建 zvfs_file_t
|
||||
- 分配伪FD,写入 fd_table
|
||||
5. [返回伪FD]
|
||||
### IO 操作的回调行为与 metadata 操作不同
|
||||
spdk_blob_io_read / spdk_blob_io_write 的回调,是通过传入的 io_channel 投递的,回调回到分配该 channel 的 thread。
|
||||
|
||||
```
|
||||
> 说明:目录变更只写内存,unmount 时统一持久化。
|
||||
|
||||
### read
|
||||
读写都以字节为单位,offset / count 单位为字节;根据 io_unit_size 做对齐计算。
|
||||
|
||||
```scss
|
||||
1. [参数]
|
||||
- fd
|
||||
- buffer
|
||||
- count
|
||||
- offset(隐含)
|
||||
2. [边界检查]
|
||||
- 实际可读 = min(count, dirent->file_size - current_offset)
|
||||
- 实际可读为0则返回0
|
||||
3. [计算Blob位置]
|
||||
- start_page = current_offset / io_unit_size
|
||||
- page_offset = current_offset % io_unit_size
|
||||
- num_pages = (page_offset + 实际可读 + io_unit_size - 1) / io_unit_size
|
||||
4. [DMA读取]
|
||||
- 非对齐读(offset != 0 || count 不是整页)
|
||||
- 需要DMA临时缓冲区(spdk_dma_zmalloc)
|
||||
- spdk_blob_io_read(blob, channel, dma_buffer, start_page, num_pages, ...)
|
||||
- 从 dma_buffer + page_offset 拷贝到用户 buffer
|
||||
- 对齐
|
||||
- 仍使用DMA缓冲区执行读取,再拷贝到用户buffer
|
||||
5. [更新offset]
|
||||
- current_offset += 实际可读
|
||||
6. [返回实际读取字节数]
|
||||
```
|
||||
> 说明:SPDK需要DMA可用的内存,应用提供的用户缓冲区通常不满足要求。即便对齐也不能直接提交给spdk_blob_io_*,应使用DMA缓冲作为跳板;未来通过注册内存池可优化直传。
|
||||
|
||||
### write
|
||||
```scss
|
||||
1. [参数]
|
||||
- fd
|
||||
- buffer
|
||||
- count
|
||||
- offset(隐含)
|
||||
2. [检查空间是否足够]
|
||||
- 需要大小 = current_offset + count
|
||||
- 若超过 allocated_clusters 对应容量:
|
||||
- spdk_blob_resize 扩容
|
||||
- spdk_blob_sync_md
|
||||
- 更新 dirent->allocated_clusters
|
||||
3. [计算写入位置]
|
||||
- start_page / page_offset / num_pages(同read)
|
||||
4. [DMA写入]
|
||||
- 非对齐写(offset != 0 || count 不是整页)
|
||||
- 读取涉及的首尾page到DMA临时缓冲区
|
||||
- 修改对应位置的数据
|
||||
- 写回:spdk_blob_io_write(blob, channel, dma_buffer, start_page, num_pages, ...)
|
||||
- 对齐
|
||||
- 仍通过DMA缓冲区提交写入
|
||||
5. [更新状态]
|
||||
- current_offset += count
|
||||
- dirent->file_size = max(dirent->file_size, current_offset)
|
||||
6. [返回写入字节数]
|
||||
```
|
||||
|
||||
|
||||
### close
|
||||
```scss
|
||||
1. [关闭Blob]
|
||||
- spdk_blob_close(file->blob)
|
||||
- dirent->open_count--
|
||||
- fs->openfd_count++
|
||||
- 若 open_count == 0 且 is_valid == false(已unlink):spdk_bs_delete_blob, 清空dirent
|
||||
- 若 openfd_count == 0 则 unmount
|
||||
2. [释放缓冲区]
|
||||
- 释放 dma_buf
|
||||
- 清除 fd_table[pseudo_fd]
|
||||
- free(zvfs_file_t)
|
||||
3. [返回0]
|
||||
```
|
||||
### unlink
|
||||
```scss
|
||||
1. [查找目录项]
|
||||
- 遍历 dirents,匹配 filename 且 is_valid=true
|
||||
- 找不到返回 -ENOENT
|
||||
2. [标记删除]
|
||||
- dirent->is_valid = false
|
||||
3. [判断是否立即删除]
|
||||
- open_count == 0:spdk_bs_delete_blob,清空该槽位
|
||||
- open_count > 0:延迟,最后一个 close 负责删除
|
||||
4. [返回0]
|
||||
```
|
||||
|
||||
### unmount
|
||||
```scss
|
||||
1. [关闭channel]
|
||||
- spdk_bs_free_io_channel
|
||||
2. [关闭BlobStore]
|
||||
- spdk_bs_unload
|
||||
3. [释放FS]
|
||||
- free(fs)
|
||||
```
|
||||
|
||||
|
||||
### 其他方案
|
||||
如果不使用`LD_PRELOAD`hook,可以使用FUSE。\
|
||||
FUSE是一种内核文件系统程序,挂载在文件目录上,对这个目录的访问,会使用这个文件系统程序。\
|
||||
文件系统程序会将请求转发给应用层程序,这里的应用层程序可以是SPDK。这样就不用管其他的操作。
|
||||
### 超时任务
|
||||
设置超时就免不了超时后回调成功执行,超时后回调仍会触发,存在 UAF 风险
|
||||
|
||||
Reference in New Issue
Block a user