postgres hook 测试成功

This commit is contained in:
2026-03-13 01:59:05 +00:00
parent a153ca5040
commit 544f532bf5
53 changed files with 5964 additions and 1674 deletions

376
README.md
View File

@@ -1,101 +1,126 @@
# ZVFS
ZVFS 是一个基于 `SPDK Blobstore` 的轻量级用户态文件系统原型,
通过 `LD_PRELOAD` 拦截常见 POSIX 文件 API`/zvfs` 路径下的文件 I/O 转换为 Blob I/O
ZVFS 是一个基于 SPDK Blobstore用户态文件系统原型,目标是在不改业务代码的前提下,将常见 POSIX 文件 I/O 重定向到用户态高性能存储路径。
核心思想是复用 Linux 文件管理机制(命名空间/目录/元数据),把文件数据平面放到 ZVFS
目标是让上层应用尽量少改动地复用阻塞式文件接口,同时接近 SPDK 在低队列深度QD≈1场景的性能上限。
- Hook 方式:`LD_PRELOAD`
- 挂载前缀:`/zvfs`
- 架构:多进程 Client + 独立 Daemon + SPDK
- 语义:同步阻塞(请求-响应)
## 1. 项目结构
---
## 1. 项目定位
这个项目重点不只是“把 I/O 跑起来”,而是把以下工程问题串起来:
1. 在多线程/多进程应用RocksDB / PostgreSQL里做透明接管。
2. 保留 POSIX 语义open/close/dup/fork/append/sync 等)。
3. 把 SPDK 资源集中在 daemon 管理,避免每进程重复初始化。
4. 在同步阻塞语义下,把协议、并发、错误处理做完整。
---
## 2. 架构设计
![](zvfs架构图.excalidraw.svg)
```text
zvfs/
├── src/
├── hook/ # POSIX API hook 层open/read/write/...
│ ├── fs/ # inode/path/fd 运行时元数据管理
├── spdk_engine/ # SPDK Blobstore 封装
├── common/ # 对齐与缓冲区工具函数
├── config.h # 默认配置JSON、bdev、xattr key 等)
│ └── Makefile # 产出 libzvfs.so
├── tests/
│ ├── hook/ # hook API 语义测试
│ ├── ioengine_test/ # Blob 引擎单元测试
│ └── Makefile
├── scripts/ # db_bench/hook 测试辅助脚本
├── spdk/ # SPDK 子模块
└── README.md
```
## 2. 核心架构
### 2.1 分层
当前实现:
```text
App (open/read/write/fstat/...)
-> LD_PRELOAD Hook (src/hook)
-> ZVFS Runtime Metadata (src/fs)
-> SPDK Engine (src/spdk_engine)
-> SPDK Blobstore
-> bdev (Malloc/NVMe)
```
目标架构Daemon + IPC
```text
App (multi-process, e.g. PostgreSQL)
-> LD_PRELOAD Hook Client
-> IPC (Unix Domain Socket)
-> zvfs daemon
-> metadata manager
-> SPDK worker threads
App (PostgreSQL / RocksDB / db_bench / pgbench)
-> LD_PRELOAD libzvfs.so
-> Hook Client (POSIX 拦截 + 本地状态)
-> Unix Domain Socket IPC (sync/blocking)
-> zvfs_daemon
-> 协议反序列化 + 分发
-> metadata thread + io threads
-> SPDK Blobstore / bdev
```
### 2.2 目标架构简版HOOK 层 + daemon 层)
### 2.1 透传策略
- `HOOK 层`
- 拦截 `/zvfs` 路径的 POSIX API 并同步发起 IPC 请求。
- 维护本地最小状态(如 `fd -> remote_handle_id`)。
- 对非 `/zvfs` 路径继续透传到 `real_*` syscallPOSIX passthrough
- `daemon 层`
- 独占 SPDK 资源(`spdk_env/blobstore/spdk_thread`)。
- 统一处理元数据与并发控制path/inode/handle
- 接收 IPC 请求并执行实际 I/O返回 POSIX 风格结果与 errno。
**控制面复用 Linux数据面走 ZVFS**
### 2.3 元数据与数据映射
- 控制面Linux 负责)
- 目录/命名空间管理。
- 文件节点生命周期与权限语义create/open/close/stat/rename/unlink 等)。
- 这些操作在 `/zvfs` 下也会真实执行系统调用ZVFS 不重复实现目录树管理。
- 文件数据:存储在 SPDK blob 中。
- 文件 blob 的映射:写入真实文件的 `xattr`key: `user.zvfs.blob_id`
- 运行时维护三张表:
- `inode_table``blob_id -> inode`
- `path_cache``path -> inode`
- `fd_table``fd -> open_file`
- 数据面ZVFS 负责)
- 文件内容读写由 blob 承载
- `read/write` 的真实数据路径不走 Linux 文件数据面,而走 ZVFS IPC + SPDK。
### 2.4 当前实现的 I/O 路径要点
- 关键绑定方式
- `create`:真实创建 Linux 文件 + 在 ZVFS 创建 blob + 把 `blob_id` 写入文件 xattr。
- `open`:真实 `open` Linux 文件 + 读取 xattr 获取 `blob_id` + 在 ZVFS 打开 blob。
- `write`:写入 blob 成功后,使用 `ftruncate` 同步 Linux 视角 `st_size`
- `blob_read/blob_write` 统一走按 `io_unit_size` 对齐的 DMA 缓冲。
- 非对齐写会触发读改写RMW先读对齐块再覆盖局部写回
- `readv/writev` 在 hook 层会做聚合,减少多次 I/O 提交
- `fsync/fdatasync` 对 zvfs fd 调用 `blob_sync_md``sync_file_range` 在 zvfs 路径直接返回成功。
- 工程收益
- 直接减少约 50% 的实现工作量
- 兼容性更好,数据库可直接复用现有文件组织方式
## 3. 构建
### 2.2 分层职责
> 下面命令以仓库根目录为 `/home/lian/try/zvfs` 为例。
- Client`src/hook` + `src/spdk_engine/io_engine.c`
- 判断是否 `/zvfs` 路径。
- 拦截 POSIX API 并发起同步 IPC。
- 维护最小本地状态(`fd_table/path_cache/inode_table`)。
### 3.1 初始化并构建 SPDK
- Daemon`src/daemon`
- 独占 SPDK 环境与线程。
- 统一执行 blob create/open/read/write/resize/sync/delete。
- 统一管理 handle/ref_count。
- 协议层(`src/proto/ipc_proto.*`
- 统一头 + per-op body。
- Request Header`opcode + payload_len`
- Response Header`opcode + status + payload_len`
### 2.3 为什么是同步阻塞 IPC
- 业务侧兼容成本低,最容易对齐 POSIX 语义。
- 调试路径更直接(一个请求对应一个响应)。
- 先解决正确性和语义完整,再考虑异步化。
---
## 3. 功能覆盖(当前)
### 3.1 已接管的核心接口
- 控制面协同:`open/openat/creat/rename/unlink/...`(真实 syscall + ZVFS 元数据协同)
- 数据面接管:`read/write/pread/pwrite/readv/writev/pwritev`
- 元数据:`fstat/lseek/ftruncate/fallocate`
- 同步:`fsync/fdatasync/sync_file_range`
- FD 语义:`dup/dup2/dup3/fork/close_range`
### 3.2 语义要点
- `write` 默认使用 `AUTO_GROW`
-`AUTO_GROW` 写越界返回 `ENOSPC`
- `O_APPEND` 语义由 inode `logical_size` 保证。
- `write` 成功后会同步更新 Linux 文件大小(`ftruncate`),保持 `stat` 视角一致。
- `mmap` 对 zvfs fd 当前返回 `ENOTSUP`(非 zvfs fd 透传)。
### 3.3 映射关系
- 文件数据在 SPDK blob 中。
- 文件到 blob 的映射通过 xattr`user.zvfs.blob_id`
---
## 4. 构建与运行
### 4.1 构建
```bash
cd /home/lian/try/zvfs
git submodule update --init --recursive
cd spdk
./scripts/pkgdep.sh
./configure --with-shared
make -j"$(nproc)"
```
### 3.2 构建 ZVFS 与测试
```bash
cd /home/lian/try/zvfs
make -j"$(nproc)"
make test -j"$(nproc)"
@@ -104,115 +129,158 @@ make test -j"$(nproc)"
产物:
- `src/libzvfs.so`
- `tests/bin/hook_api_test`
- `tests/bin/ioengine_single_blob_test`
- `tests/bin/ioengine_multi_blob_test`
- `tests/bin/ioengine_same_blob_mt_test`
- `src/daemon/zvfs_daemon`
- `tests/bin/*`
## 4. 运行与验证
### 4.2 启动 daemon
### 4.1 Hook API 语义测试
```bash
cd /home/lian/try/zvfs
./src/daemon/zvfs_daemon
```
可选环境变量:
- `SPDK_BDEV_NAME`
- `SPDK_JSON_CONFIG`
- `ZVFS_SOCKET_PATH` / `ZVFS_IPC_SOCKET_PATH`
### 4.3 快速验证
```bash
mkdir -p /zvfs
cd /home/lian/try/zvfs
LD_PRELOAD=$PWD/src/libzvfs.so ZVFS_TEST_ROOT=/zvfs ./tests/bin/hook_api_test
LD_PRELOAD=./src/libzvfs.so ZVFS_TEST_ROOT=/zvfs ./tests/bin/hook_api_test
./tests/bin/ipc_zvfs_test
```
覆盖点包括:
---
- `open/openat/rename/unlink`
- `read/write/pread/pwrite/readv/writev/pwritev`
- `fstat/lseek/ftruncate`
- `fcntl/ioctl(FIONREAD)`
- `fsync/fdatasync`
## 5. 性能测试
### 4.2 SPDK 引擎测试
### 5.1 测试目标
```bash
cd /home/lian/try/zvfs
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_single_blob_test
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_multi_blob_test
SPDK_BDEV_NAME=Malloc0 ./tests/bin/ioengine_same_blob_mt_test
```
- 目标场景:低队列深度下阻塞 I/O 性能。
- 对比对象:`spdk_nvme_perf` 与内核路径(`O_DIRECT`)。
## 5. 关键环境变量
### 5.2 工具与脚本
- `SPDK_BDEV_NAME`:选择后端 bdev默认 `Malloc0`)。
- `ZVFS_BDEV``zvfs_ensure_init` 使用的 bdev 名称(未设置时使用 `config.h` 默认值)。
- `SPDK_JSON_CONFIG`:覆盖默认 SPDK JSON 配置路径。
- RocksDB`scripts/run_db_bench_zvfs.sh`
- PostgreSQL`codex/run_pgbench_no_mmap.sh`
## 6. 性能说明(仅保留趋势)
建议:
`README` 历史压测数据来自旧版本,不能直接当作当前版本结论;但可作为设计趋势参考:
- PostgreSQL 测试时关闭 mmap 路径shared memory 改为 sysv避免 mmap 干扰)。
- 目标工作负载为阻塞 API近似 `QD=1`
- 旧数据下ZVFS 在 `QD=1` 时约达到 `spdk_nvme_perf``90%~95%`
- 4K`95 MiB/s` vs `100 MiB/s`
- 128K`1662 MiB/s` vs `1843 MiB/s`
- 相对同机 `O_DIRECT` 路径,旧数据写带宽约有 `2.2x~2.3x` 提升。
- 非对齐写存在 RMW吞吐明显下降旧数据常见接近对齐写的一半
### 5.3 历史结果
如果需要用于对外汇报,请重新在当前 commit 与固定硬件环境下复测
> 以下是历史版本结论,用于说明设计方向
## 7. 当前限制
- QD=1 下可达到 `spdk_nvme_perf` 的约 `90%~95%`
- 相对同机 `O_DIRECT`,顺序写吞吐可有约 `2.2x~2.3x` 提升。
- 非对齐写因 RMW 开销,吞吐明显下降。
- 仅拦截 `/zvfs` 路径。
- `mmap` 对 zvfs fd 当前返回 `ENOTSUP`(建议上层关闭 mmap 读写)。
- `dup/dup2/dup3` 对 zvfs fd 当前返回 `ENOTSUP`
- `rename``/zvfs` 与非 `/zvfs` 路径返回 `EXDEV`
- `fallocate(FALLOC_FL_PUNCH_HOLE)` 未实现。
---
## 8. 后续建议
## 6. 关键工程难点与踩坑复盘(重点)
- 补齐 mmap 路径mmap table + 脏页回写)
- 完善多线程/高并发下的语义与压测基线。
- 增加版本化 benchmark 报告,避免 README 中历史数据失真。
这一节是项目最有价值的部分,记录了从“能跑”到“可用于数据库 workload”过程中遇到的关键问题
## 9. Blob Store 血泪教训
### 6.1 SPDK 元数据回调线程模型
### Owner Thread 绑定
问题:把 metadata 操作随意派发到任意线程,容易卡住或回调不回来。
blobstore内部负责并发控制让所有metadata操作都在一个线程上执行回调固定绑定给创建blobstore的线程。所以多线程模型下不是send给谁谁就能poll到回调的。
根因:
正确架构:
```
metadata thread
spdk_bs_load()
resize
delete
sync_md
- blobstore metadata 操作与创建线程/通道绑定。
- `resize/delete/unload` 内部会走 `spdk_for_each_channel()` barrier。
worker thread
blob_io_read
blob_io_write
```
修复策略:
### spdk_for_each_channel() Barrier
某些 metadata 操作非常慢:
```
resize
delete
unload
snapshot
```
这些操作内部会调用spdk_for_each_channel()
- 明确 metadata thread 和 io thread 分工。
- 保证持有 channel 的线程持续 poll。
- 线程退出时严格释放 channel避免 barrier 永久等待。
语义:在所有 io_channel 所属线程执行 callback
### 6.2 Daemon 卡住(请求已收但流程停滞)
类似
```c
for each channel:
send_msg(channel->thread)
```
现象:请求日志打印到一半后卡住,压测进程阻塞。
#### 问题1持有 Channel 的 Thread 不 poll
如果所属线程不poll就会卡住。
#### 问题2线程退出 Channel 没有释放
永远卡住
根因:
### IO 操作的回调行为与 metadata 操作不同
spdk_blob_io_read / spdk_blob_io_write 的回调,是通过传入的 io_channel 投递的,回调回到分配该 channel 的 thread
- UDS 流式读取没有完整分帧处理。
- 固定小缓冲导致回包序列化失败(`serialize resp failed`
### 超时任务
设置超时就免不了超时后回调成功执行,超时后回调仍会触发,存在 UAF 风险
修复:
- 改为连接级接收缓冲,循环读到 `EAGAIN`
- 按“完整包”消费,残包保留到下一轮。
- 回包序列化改为动态缓冲 + `send_all`
### 6.3 PostgreSQL Tablespace 无法命中 Hook
现象:建表空间后文件操作路径是 `pg_tblspc/...`daemon 无请求日志。
根因:
- PostgreSQL 通过符号链接访问 tablespace。
- 仅按字符串前缀 `/zvfs` 判断会漏判。
修复:
- 路径判定增加 `realpath()` 后再判断。
- `O_CREAT` 且文件尚不存在时,使用 `realpath(parent)+basename` 判定。
### 6.4 PostgreSQL 报 `Permission denied`(跨用户连接 daemon
现象:`CREATE DATABASE ... TABLESPACE ...` 报权限错误。
根因:
- daemon 由 root 启动UDS 文件权限受 umask 影响。
- postgres 用户无法 `connect(/tmp/zvfs.sock)`
修复:
- daemon `bind` 后显式 `chmod(socket, 0666)`
### 6.5 PostgreSQL 报 `Message too long`
现象:部分 SQL尤其 `CREATE DATABASE` 路径)失败,错误为 `Message too long`
根因:
- 不是 daemon 解析失败,而是 client 序列化请求时超出 `ZVFS_IPC_BUF_SIZE`
- 当前 hook 会把 `writev` 聚合成一次大写请求,容易触发上限。
当前处理:
-`ZVFS_IPC_BUF_SIZE` 提高到 `16MB``src/common/config.h`)。
后续优化方向:
- 在 client `blob_write_ex` 做透明分片发送(保持同步阻塞语义)。
### 6.6 dup/dup2/fork 语义一致性
问题:多个 fd 指向同一 open file description 时,如何保证 handle 引用计数一致。
方案:
- 协议增加 `ADD_REF` / `ADD_REF_BATCH`
- 在 hook 中对 `dup/dup2/dup3/fork` 明确执行引用增加。
- `close_range` 增加边界保护(避免 `UINT_MAX` 场景死循环)。
---
## 7. 当前限制与下一步
### 7.1 当前限制
- 单请求仍受 `ZVFS_IPC_BUF_SIZE` 约束。
- `mmap` 暂不支持 zvfs fd。
- `ADD_REF_BATCH` 当前优先功能,不保证原子性。
### 7.2 下一步计划
1. 实现 `WRITE` 客户端透明分片,彻底消除单包上限问题。
2. 持续完善 PostgreSQL 场景tablespace + pgbench + crash/restart
3. 补齐更系统的性能复测(固定硬件、固定参数、全量报告)。

751
postgresql.conf Normal file
View File

@@ -0,0 +1,751 @@
# -----------------------------
# PostgreSQL configuration file
# -----------------------------
#
# This file consists of lines of the form:
#
# name = value
#
# (The "=" is optional.) Whitespace may be used. Comments are introduced with
# "#" anywhere on a line. The complete list of parameter names and allowed
# values can be found in the PostgreSQL documentation.
#
# The commented-out settings shown in this file represent the default values.
# Re-commenting a setting is NOT sufficient to revert it to the default value;
# you need to reload the server.
#
# This file is read on server startup and when the server receives a SIGHUP
# signal. If you edit the file on a running system, you have to SIGHUP the
# server for the changes to take effect, run "pg_ctl reload", or execute
# "SELECT pg_reload_conf()". Some parameters, which are marked below,
# require a server shutdown and restart to take effect.
#
# Any parameter can also be given as a command-line option to the server, e.g.,
# "postgres -c log_connections=on". Some parameters can be changed at run time
# with the "SET" SQL command.
#
# Memory units: B = bytes Time units: us = microseconds
# kB = kilobytes ms = milliseconds
# MB = megabytes s = seconds
# GB = gigabytes min = minutes
# TB = terabytes h = hours
# d = days
#------------------------------------------------------------------------------
# FILE LOCATIONS
#------------------------------------------------------------------------------
# The default values of these variables are driven from the -D command-line
# option or PGDATA environment variable, represented here as ConfigDir.
#data_directory = 'ConfigDir' # use data in another directory
# (change requires restart)
#hba_file = 'ConfigDir/pg_hba.conf' # host-based authentication file
# (change requires restart)
#ident_file = 'ConfigDir/pg_ident.conf' # ident configuration file
# (change requires restart)
# If external_pid_file is not explicitly set, no extra PID file is written.
#external_pid_file = '' # write an extra PID file
# (change requires restart)
#------------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#------------------------------------------------------------------------------
# - Connection Settings -
#listen_addresses = 'localhost' # what IP address(es) to listen on;
# comma-separated list of addresses;
# defaults to 'localhost'; use '*' for all
# (change requires restart)
#port = 5432 # (change requires restart)
max_connections = 100 # (change requires restart)
#superuser_reserved_connections = 3 # (change requires restart)
#unix_socket_directories = '/var/run/postgresql' # comma-separated list of directories
# (change requires restart)
#unix_socket_group = '' # (change requires restart)
#unix_socket_permissions = 0777 # begin with 0 to use octal notation
# (change requires restart)
#bonjour = off # advertise server via Bonjour
# (change requires restart)
#bonjour_name = '' # defaults to the computer name
# (change requires restart)
# - TCP settings -
# see "man 7 tcp" for details
#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
#tcp_user_timeout = 0 # TCP_USER_TIMEOUT, in milliseconds;
# 0 selects the system default
# - Authentication -
#authentication_timeout = 1min # 1s-600s
#password_encryption = md5 # md5 or scram-sha-256
#db_user_namespace = off
# GSSAPI using Kerberos
#krb_server_keyfile = 'FILE:${sysconfdir}/krb5.keytab'
#krb_caseins_users = off
# - SSL -
#ssl = off
#ssl_ca_file = ''
#ssl_cert_file = 'server.crt'
#ssl_crl_file = ''
#ssl_key_file = 'server.key'
#ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL' # allowed SSL ciphers
#ssl_prefer_server_ciphers = on
#ssl_ecdh_curve = 'prime256v1'
#ssl_min_protocol_version = 'TLSv1'
#ssl_max_protocol_version = ''
#ssl_dh_params_file = ''
#ssl_passphrase_command = ''
#ssl_passphrase_command_supports_reload = off
#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------
# - Memory -
shared_buffers = 128MB # min 128kB
# (change requires restart)
#huge_pages = try # on, off, or try
# (change requires restart)
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
# (change requires restart)
# Caution: it is not advisable to set max_prepared_transactions nonzero unless
# you actively intend to use prepared transactions.
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
shared_memory_type = sysv # the default is the first option
# supported by the operating system:
# mmap
# sysv
# windows
# (change requires restart)
dynamic_shared_memory_type = sysv # the default is the first option
# supported by the operating system:
# posix
# sysv
# windows
# mmap
# (change requires restart)
# - Disk -
#temp_file_limit = -1 # limits per-process temp file space
# in kB, or -1 for no limit
# - Kernel Resources -
#max_files_per_process = 1000 # min 25
# (change requires restart)
# - Cost-Based Vacuum Delay -
#vacuum_cost_delay = 0 # 0-100 milliseconds (0 disables)
#vacuum_cost_page_hit = 1 # 0-10000 credits
#vacuum_cost_page_miss = 10 # 0-10000 credits
#vacuum_cost_page_dirty = 20 # 0-10000 credits
#vacuum_cost_limit = 200 # 1-10000 credits
# - Background Writer -
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # max buffers written/round, 0 disables
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multiplier on buffers scanned/round
#bgwriter_flush_after = 512kB # measured in pages, 0 disables
# - Asynchronous Behavior -
#effective_io_concurrency = 1 # 1-1000; 0 disables prefetching
#max_worker_processes = 8 # (change requires restart)
#max_parallel_maintenance_workers = 2 # limited by max_parallel_workers
#max_parallel_workers_per_gather = 2 # limited by max_parallel_workers
#parallel_leader_participation = on
#max_parallel_workers = 8 # number of max_worker_processes that
# can be used in parallel operations
#old_snapshot_threshold = -1 # 1min-60d; -1 disables; 0 is immediate
# (change requires restart)
#backend_flush_after = 0 # measured in pages, 0 disables
#------------------------------------------------------------------------------
# WRITE-AHEAD LOG
#------------------------------------------------------------------------------
# - Settings -
#wal_level = replica # minimal, replica, or logical
# (change requires restart)
#fsync = on # flush data to disk for crash safety
# (turning this off can cause
# unrecoverable data corruption)
#synchronous_commit = on # synchronization level;
# off, local, remote_write, remote_apply, or on
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync (default on Linux and FreeBSD)
# fsync
# fsync_writethrough
# open_sync
#full_page_writes = on # recover from partial page writes
#wal_compression = off # enable compression of full-page writes
#wal_log_hints = off # also do full page writes of non-critical updates
# (change requires restart)
#wal_init_zero = on # zero-fill new WAL files
#wal_recycle = on # recycle WAL files
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
#wal_writer_flush_after = 1MB # measured in pages, 0 disables
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
# - Checkpoints -
#checkpoint_timeout = 5min # range 30s-1d
max_wal_size = 1GB
min_wal_size = 80MB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 256kB # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
# - Archiving -
#archive_mode = off # enables archiving; off, on, or always
# (change requires restart)
#archive_command = '' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
# - Archive Recovery -
# These are only used in recovery mode.
#restore_command = '' # command to use to restore an archived logfile segment
# placeholders: %p = path of file to restore
# %f = file name only
# e.g. 'cp /mnt/server/archivedir/%f %p'
# (change requires restart)
#archive_cleanup_command = '' # command to execute at every restartpoint
#recovery_end_command = '' # command to execute at completion of recovery
# - Recovery Target -
# Set these only when performing a targeted recovery.
#recovery_target = '' # 'immediate' to end recovery as soon as a
# consistent state is reached
# (change requires restart)
#recovery_target_name = '' # the named restore point to which recovery will proceed
# (change requires restart)
#recovery_target_time = '' # the time stamp up to which recovery will proceed
# (change requires restart)
#recovery_target_xid = '' # the transaction ID up to which recovery will proceed
# (change requires restart)
#recovery_target_lsn = '' # the WAL LSN up to which recovery will proceed
# (change requires restart)
#recovery_target_inclusive = on # Specifies whether to stop:
# just after the specified recovery target (on)
# just before the recovery target (off)
# (change requires restart)
#recovery_target_timeline = 'latest' # 'current', 'latest', or timeline ID
# (change requires restart)
#recovery_target_action = 'pause' # 'pause', 'promote', 'shutdown'
# (change requires restart)
#------------------------------------------------------------------------------
# REPLICATION
#------------------------------------------------------------------------------
# - Sending Servers -
# Set these on the master and on any standby that will send replication data.
#max_wal_senders = 10 # max number of walsender processes
# (change requires restart)
#wal_keep_segments = 0 # in logfile segments; 0 disables
#wal_sender_timeout = 60s # in milliseconds; 0 disables
#max_replication_slots = 10 # max number of replication slots
# (change requires restart)
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
# - Master Server -
# These settings are ignored on a standby server.
#synchronous_standby_names = '' # standby servers that provide sync rep
# method to choose sync standbys, number of sync standbys,
# and comma-separated list of application_name
# from standby(s); '*' = all
#vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
# - Standby Servers -
# These settings are ignored on a master server.
#primary_conninfo = '' # connection string to sending server
# (change requires restart)
#primary_slot_name = '' # replication slot on sending server
# (change requires restart)
#promote_trigger_file = '' # file name whose presence ends recovery
#hot_standby = on # "off" disallows queries during recovery
# (change requires restart)
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
#max_standby_streaming_delay = 30s # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
#wal_receiver_status_interval = 10s # send replies at least this often
# 0 disables
#hot_standby_feedback = off # send info from standby to prevent
# query conflicts
#wal_receiver_timeout = 60s # time that receiver waits for
# communication from master
# in milliseconds; 0 disables
#wal_retrieve_retry_interval = 5s # time to wait before retrying to
# retrieve WAL after a failed attempt
#recovery_min_apply_delay = 0 # minimum delay for applying changes during recovery
# - Subscribers -
# These settings are ignored on a publisher.
#max_logical_replication_workers = 4 # taken from max_worker_processes
# (change requires restart)
#max_sync_workers_per_subscription = 2 # taken from max_logical_replication_workers
#------------------------------------------------------------------------------
# QUERY TUNING
#------------------------------------------------------------------------------
# - Planner Method Configuration -
#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_indexonlyscan = on
#enable_material = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_parallel_append = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on
#enable_partitionwise_join = off
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on
#enable_partition_pruning = on
# - Planner Cost Constants -
#seq_page_cost = 1.0 # measured on an arbitrary scale
#random_page_cost = 4.0 # same scale as above
#cpu_tuple_cost = 0.01 # same scale as above
#cpu_index_tuple_cost = 0.005 # same scale as above
#cpu_operator_cost = 0.0025 # same scale as above
#parallel_tuple_cost = 0.1 # same scale as above
#parallel_setup_cost = 1000.0 # same scale as above
#jit_above_cost = 100000 # perform JIT compilation if available
# and query more expensive than this;
# -1 disables
#jit_inline_above_cost = 500000 # inline small functions if query is
# more expensive than this; -1 disables
#jit_optimize_above_cost = 500000 # use expensive JIT optimizations if
# query is more expensive than this;
# -1 disables
#min_parallel_table_scan_size = 8MB
#min_parallel_index_scan_size = 512kB
#effective_cache_size = 4GB
# - Genetic Query Optimizer -
#geqo = on
#geqo_threshold = 12
#geqo_effort = 5 # range 1-10
#geqo_pool_size = 0 # selects default based on effort
#geqo_generations = 0 # selects default based on effort
#geqo_selection_bias = 2.0 # range 1.5-2.0
#geqo_seed = 0.0 # range 0.0-1.0
# - Other Planner Options -
#default_statistics_target = 100 # range 1-10000
#constraint_exclusion = partition # on, off, or partition
#cursor_tuple_fraction = 0.1 # range 0.0-1.0
#from_collapse_limit = 8
#join_collapse_limit = 8 # 1 disables collapsing of explicit
# JOIN clauses
#force_parallel_mode = off
#jit = on # allow JIT compilation
#plan_cache_mode = auto # auto, force_generic_plan or
# force_custom_plan
#------------------------------------------------------------------------------
# REPORTING AND LOGGING
#------------------------------------------------------------------------------
# - Where to Log -
#log_destination = 'stderr' # Valid values are combinations of
# stderr, csvlog, syslog, and eventlog,
# depending on platform. csvlog
# requires logging_collector to be on.
# This is used when logging to stderr:
#logging_collector = off # Enable capturing of stderr and csvlog
# into log files. Required to be on for
# csvlogs.
# (change requires restart)
# These are only used if logging_collector is on:
#log_directory = 'log' # directory where log files are written,
# can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
# can include strftime() escapes
#log_file_mode = 0600 # creation mode for log files,
# begin with 0 to use octal notation
#log_truncate_on_rotation = off # If on, an existing log file with the
# same name as the new log file will be
# truncated rather than appended to.
# But such truncation only occurs on
# time-driven rotation, not on restarts
# or size-driven rotation. Default is
# off, meaning append to existing files
# in all cases.
#log_rotation_age = 1d # Automatic rotation of logfiles will
# happen after that time. 0 disables.
#log_rotation_size = 10MB # Automatic rotation of logfiles will
# happen after that much log output.
# 0 disables.
# These are relevant when logging to syslog:
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'
#syslog_sequence_numbers = on
#syslog_split_messages = on
# This is only relevant when logging to eventlog (win32):
# (change requires restart)
#event_source = 'PostgreSQL'
# - When to Log -
#log_min_messages = warning # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# info
# notice
# warning
# error
# log
# fatal
# panic
#log_min_error_statement = error # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# info
# notice
# warning
# error
# log
# fatal
# panic (effectively off)
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
#log_transaction_sample_rate = 0.0 # Fraction of transactions whose statements
# are logged regardless of their duration. 1.0 logs all
# statements from all transactions, 0.0 never logs.
# - What to Log -
#debug_print_parse = off
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = on
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_duration = off
#log_error_verbosity = default # terse, default, or verbose messages
#log_hostname = off
#log_line_prefix = '%m [%p] ' # special values:
# %a = application name
# %u = user name
# %d = database name
# %r = remote host and port
# %h = remote host
# %p = process ID
# %t = timestamp without milliseconds
# %m = timestamp with milliseconds
# %n = timestamp with milliseconds (as a Unix epoch)
# %i = command tag
# %e = SQL state
# %c = session ID
# %l = session line number
# %s = session start timestamp
# %v = virtual transaction ID
# %x = transaction ID (0 if none)
# %q = stop here in non-session
# processes
# %% = '%'
# e.g. '<%u%%%d> '
#log_lock_waits = off # log lock waits >= deadlock_timeout
#log_statement = 'none' # none, ddl, mod, all
#log_replication_commands = off
#log_temp_files = -1 # log temporary files equal or larger
# than the specified size in kilobytes;
# -1 disables, 0 logs all temp files
log_timezone = 'Etc/UTC'
#------------------------------------------------------------------------------
# PROCESS TITLE
#------------------------------------------------------------------------------
#cluster_name = '' # added to process titles if nonempty
# (change requires restart)
#update_process_title = on
#------------------------------------------------------------------------------
# STATISTICS
#------------------------------------------------------------------------------
# - Query and Index Statistics Collector -
#track_activities = on
#track_counts = on
#track_io_timing = off
#track_functions = none # none, pl, all
#track_activity_query_size = 1024 # (change requires restart)
#stats_temp_directory = 'pg_stat_tmp'
# - Monitoring -
#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off
#------------------------------------------------------------------------------
# AUTOVACUUM
#------------------------------------------------------------------------------
#autovacuum = on # Enable autovacuum subprocess? 'on'
# requires track_counts to also be on.
#log_autovacuum_min_duration = -1 # -1 disables, 0 logs all actions and
# their durations, > 0 logs only
# actions running at least this number
# of milliseconds.
#autovacuum_max_workers = 3 # max number of autovacuum subprocesses
# (change requires restart)
#autovacuum_naptime = 1min # time between autovacuum runs
#autovacuum_vacuum_threshold = 50 # min number of row updates before
# vacuum
#autovacuum_analyze_threshold = 50 # min number of row updates before
# analyze
#autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before vacuum
#autovacuum_analyze_scale_factor = 0.1 # fraction of table size before analyze
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum
# (change requires restart)
#autovacuum_multixact_freeze_max_age = 400000000 # maximum multixact age
# before forced vacuum
# (change requires restart)
#autovacuum_vacuum_cost_delay = 2ms # default vacuum cost delay for
# autovacuum, in milliseconds;
# -1 means use vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for
# autovacuum, -1 means use
# vacuum_cost_limit
#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------
# - Statement Behavior -
#client_min_messages = notice # values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# log
# notice
# warning
# error
#search_path = '"$user", public' # schema names
#row_security = on
#default_tablespace = '' # a tablespace name, '' uses the default
#temp_tablespaces = '' # a list of tablespace names, '' uses
# only default tablespace
#default_table_access_method = 'heap'
#check_function_bodies = on
#default_transaction_isolation = 'read committed'
#default_transaction_read_only = off
#default_transaction_deferrable = off
#session_replication_role = 'origin'
#statement_timeout = 0 # in milliseconds, 0 is disabled
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#vacuum_freeze_min_age = 50000000
#vacuum_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_freeze_table_age = 150000000
#vacuum_cleanup_index_scale_factor = 0.1 # fraction of total number of tuples
# before index cleanup, 0 always performs
# index cleanup
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
#gin_fuzzy_search_limit = 0
#gin_pending_list_limit = 4MB
# - Locale and Formatting -
datestyle = 'iso, mdy'
#intervalstyle = 'postgres'
timezone = 'Etc/UTC'
#timezone_abbreviations = 'Default' # Select the set of available time zone
# abbreviations. Currently, there are
# Default
# Australia (historical usage)
# India
# You can create your own file in
# share/timezonesets/.
#extra_float_digits = 1 # min -15, max 3; any value >0 actually
# selects precise output mode
#client_encoding = sql_ascii # actually, defaults to database
# encoding
# These settings are initialized by initdb, but they can be changed.
lc_messages = 'en_US.UTF-8' # locale for system error message
# strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
# default configuration for text search
default_text_search_config = 'pg_catalog.english'
# - Shared Library Preloading -
#shared_preload_libraries = '' # (change requires restart)
#local_preload_libraries = ''
#session_preload_libraries = ''
#jit_provider = 'llvmjit' # JIT library to use
# - Other Defaults -
#dynamic_library_path = '$libdir'
#------------------------------------------------------------------------------
# LOCK MANAGEMENT
#------------------------------------------------------------------------------
#deadlock_timeout = 1s
#max_locks_per_transaction = 64 # min 10
# (change requires restart)
#max_pred_locks_per_transaction = 64 # min 10
# (change requires restart)
#max_pred_locks_per_relation = -2 # negative values mean
# (max_pred_locks_per_transaction
# / -max_pred_locks_per_relation) - 1
#max_pred_locks_per_page = 2 # min 0
#------------------------------------------------------------------------------
# VERSION AND PLATFORM COMPATIBILITY
#------------------------------------------------------------------------------
# - Previous PostgreSQL Versions -
#array_nulls = on
#backslash_quote = safe_encoding # on, off, or safe_encoding
#escape_string_warning = on
#lo_compat_privileges = off
#operator_precedence_warning = off
#quote_all_identifiers = off
#standard_conforming_strings = on
#synchronize_seqscans = on
# - Other Platforms and Clients -
#transform_null_equals = off
#------------------------------------------------------------------------------
# ERROR HANDLING
#------------------------------------------------------------------------------
#exit_on_error = off # terminate session on any error?
#restart_after_crash = on # reinitialize after backend crash?
#data_sync_retry = off # retry or panic on failure to fsync
# data?
# (change requires restart)
#------------------------------------------------------------------------------
# CONFIG FILE INCLUDES
#------------------------------------------------------------------------------
# These options allow settings to be loaded from files other than the
# default postgresql.conf. Note that these are directives, not variable
# assignments, so they can usefully be given more than once.
#include_dir = '...' # include files ending in '.conf' from
# a directory, e.g., 'conf.d'
#include_if_exists = '...' # include file only if it exists
#include = '...' # include file
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------
# Add settings for extensions here

77
scripts/do_pgbecnh.md Normal file
View File

@@ -0,0 +1,77 @@
```shell
# 1. 安装 PostgreSQL 和 pgbench
sudo apt-get update
sudo apt-get install -y postgresql postgresql-contrib
# 2. 找到 postgresql.confUbuntu 通常在这个目录)
ls /etc/postgresql/*/main/postgresql.conf
# 3. 配置禁用 mmap编辑 postgresql.conf
shared_memory_type = sysv
dynamic_shared_memory_type = sysv
# 4. 重启 PostgreSQL
sudo systemctl stop postgresql
rm -rf /home/lian/pg/pgdata
rm -rf /zvfs/pg_ts_bench
sudo chown -R postgres:postgres /home/lian/pg
sudo -u postgres mkdir -p /home/lian/pg/pgdata
sudo chown -R postgres:postgres /home/lian/pg/pgdata
sudo -u postgres env LD_PRELOAD=/home/lian/try/zvfs/src/libzvfs.so \
/usr/lib/postgresql/12/bin/initdb -D /home/lian/pg/pgdata
cp ./postgresql.conf /home/lian/pg/pgdata/
sudo -u postgres env LD_PRELOAD=/home/lian/try/zvfs/src/libzvfs.so \
/usr/lib/postgresql/12/bin/pg_ctl -D /home/lian/pg/pgdata -l /tmp/pg.log start
sudo -u postgres env LD_PRELOAD=/home/lian/try/zvfs/src/libzvfs.so \
/usr/lib/postgresql/12/bin/psql
sudo -u postgres env LD_PRELOAD=/home/lian/try/zvfs/src/libzvfs.so \
/usr/lib/postgresql/12/bin/pg_ctl -D /home/lian/pg/pgdata -l /tmp/pg.log restart
# 创建测试环境
sudo -u postgres mkdir -p /zvfs/pg_ts_bench
sudo chown -R postgres:postgres /zvfs/pg_ts_bench
sudo chmod 700 /zvfs/pg_ts_bench
CREATE TABLESPACE zvfs_ts LOCATION '/zvfs/pg_ts_bench';
DROP DATABASE IF EXISTS benchdb;
CREATE DATABASE benchdb TABLESPACE zvfs_ts;
DROP TABLE IF EXISTS hook_probe;
CREATE TABLE hook_probe(id int) TABLESPACE zvfs_ts;
INSERT INTO hook_probe VALUES (1);
INSERT INTO hook_probe VALUES (2);
INSERT INTO hook_probe VALUES (3);
INSERT INTO hook_probe VALUES (4);
SELECT * FROM hook_probe;
DELETE FROM hook_probe WHERE id = 1;
UPDATE hook_probe SET id = 11 WHERE id = 2;
SELECT * FROM hook_probe;
# 5. 验证配置生效
pid=$(pgrep -u postgres -xo postgres)
echo "pid=$pid"
sudo grep libzvfs /proc/$pid/maps
sudo -u postgres psql -p 5432 -c "show data_directory;"
sudo -u postgres psql -c "SHOW shared_memory_type;"
sudo -u postgres psql -c "SHOW dynamic_shared_memory_type;"
# 6. 创建测试库(如未创建)
sudo -u postgres createdb benchdb
# 7. 运行你的 bench 脚本
bash /home/lian/try/zvfs/scripts/run_pgbench_no_mmap.sh
```

View File

@@ -21,7 +21,7 @@ BENCHMARKS="fillrandom,readrandom"
# key数
# NUM=1000000
NUM=50000
NUM=500
# 线程数
THREADS=2

91
scripts/run_pgbench_no_mmap.sh Executable file
View File

@@ -0,0 +1,91 @@
#!/usr/bin/env bash
set -euo pipefail
# 仅执行 pgbench 的脚本(不安装 PostgreSQL不 initdb不启停服务不改配置
#
# 前提条件:
# 1) PostgreSQL 已经在运行。
# 2) 测试库已经存在(默认 benchdb
# 3) PostgreSQL 已经在外部配置为禁用 mmap 共享内存:
# shared_memory_type = sysv
# dynamic_shared_memory_type = sysv
#
# 关于 Malloc0
# - 当前后端是内存虚拟设备,容量有限。
# - 默认参数故意设置得较小,避免一次灌入过多数据。
#
# 关于 LD_PRELOAD
# - USE_LD_PRELOAD_INIT=1初始化阶段pgbench -i启用 LD_PRELOAD
# - USE_LD_PRELOAD_RUN=1 :压测阶段启用 LD_PRELOAD
# - 设为 0 即可关闭对应阶段的 LD_PRELOAD
#
# 用法:
# bash codex/run_pgbench_no_mmap.sh
#
# 可选环境变量(含义):
# PG_HOST=127.0.0.1
# PostgreSQL 服务器地址。
# PG_PORT=5432
# PostgreSQL 服务器端口(默认改为 5432
# PG_DB=benchdb
# 压测数据库名。
# PG_SCALE=2
# pgbench 初始化规模因子(-s越大初始数据越多。
# PG_TIME=20
# 压测持续时间pgbench -T
# PG_CLIENTS=2
# 并发客户端数pgbench -c
# PG_JOBS=2
# 工作线程数pgbench -j
# PG_SUPERUSER=postgres
# 执行 pgbench 的系统用户(通常是 postgres
# LD_PRELOAD_PATH=/home/lian/try/zvfs/src/libzvfs.so
# LD_PRELOAD 目标库路径(你的 zvfs hook so
# PG_BIN_DIR=/usr/lib/postgresql/16/bin
# pgbench 所在目录;不填时自动从 PATH 查找。
# USE_LD_PRELOAD_INIT=1
# 初始化阶段pgbench -i是否启用 LD_PRELOAD1=启用0=关闭。
# USE_LD_PRELOAD_RUN=1
# 压测阶段是否启用 LD_PRELOAD1=启用0=关闭。
PG_HOST="${PG_HOST:-127.0.0.1}"
PG_PORT="${PG_PORT:-5432}"
PG_DB="${PG_DB:-benchdb}"
PG_SCALE="${PG_SCALE:-2}"
PG_TIME="${PG_TIME:-20}"
PG_CLIENTS="${PG_CLIENTS:-2}"
PG_JOBS="${PG_JOBS:-2}"
PG_SUPERUSER="${PG_SUPERUSER:-postgres}"
LD_PRELOAD_PATH="${LD_PRELOAD_PATH:-/home/lian/try/zvfs/src/libzvfs.so}"
PG_BIN_DIR="${PG_BIN_DIR:-$(dirname "$(command -v pgbench 2>/dev/null || true)")}"
USE_LD_PRELOAD_INIT="${USE_LD_PRELOAD_INIT:-1}"
USE_LD_PRELOAD_RUN="${USE_LD_PRELOAD_RUN:-1}"
if [[ -z "${PG_BIN_DIR}" || ! -x "${PG_BIN_DIR}/pgbench" ]]; then
echo "未找到 pgbench请设置 PG_BIN_DIR 或把 pgbench 放到 PATH 中。" >&2
exit 1
fi
run_pgbench_cmd() {
local use_preload="$1"
shift
if [[ "${use_preload}" == "1" ]]; then
sudo -u "${PG_SUPERUSER}" env LD_PRELOAD="${LD_PRELOAD_PATH}" "$@"
else
sudo -u "${PG_SUPERUSER}" "$@"
fi
}
echo "当前参数:"
echo " host=${PG_HOST} port=${PG_PORT} db=${PG_DB}"
echo " scale=${PG_SCALE} clients=${PG_CLIENTS} jobs=${PG_JOBS} time=${PG_TIME}s"
echo " preload_init=${USE_LD_PRELOAD_INIT} preload_run=${USE_LD_PRELOAD_RUN}"
echo "[1/2] 初始化数据pgbench -i"
run_pgbench_cmd "${USE_LD_PRELOAD_INIT}" \
"${PG_BIN_DIR}/pgbench" -h "${PG_HOST}" -p "${PG_PORT}" -i -s "${PG_SCALE}" "${PG_DB}"
echo "[2/2] 执行压测pgbench -T"
run_pgbench_cmd "${USE_LD_PRELOAD_RUN}" \
"${PG_BIN_DIR}/pgbench" -h "${PG_HOST}" -p "${PG_PORT}" \
-c "${PG_CLIENTS}" -j "${PG_JOBS}" -T "${PG_TIME}" -P 5 "${PG_DB}"

4
scripts/search_libzvfs.sh Executable file
View File

@@ -0,0 +1,4 @@
pgrep -u postgres -x postgres | while read p; do
echo "PID=$p"
sudo grep -m1 libzvfs /proc/$p/maps || echo " (no libzvfs)"
done

View File

@@ -6,7 +6,6 @@
SPDK_ROOT_DIR := $(abspath $(CURDIR)/../spdk)
include $(SPDK_ROOT_DIR)/mk/spdk.common.mk
include $(SPDK_ROOT_DIR)/mk/spdk.modules.mk
include $(SPDK_ROOT_DIR)/mk/spdk.app_vars.mk
LIBZVFS := libzvfs.so
@@ -18,6 +17,7 @@ C_SRCS := \
fs/zvfs_path_entry.c \
fs/zvfs_open_file.c \
fs/zvfs_sys_init.c \
proto/ipc_proto.c \
hook/zvfs_hook_init.c \
hook/zvfs_hook_fd.c \
hook/zvfs_hook_rw.c \
@@ -28,24 +28,40 @@ C_SRCS := \
hook/zvfs_hook_dir.c \
hook/zvfs_hook_mmap.c \
# 指定头文件搜索路径
CFLAGS += -I$(abspath $(CURDIR)) -fPIC
# SPDK 库依赖
SPDK_LIB_LIST = $(ALL_MODULES_LIST) event event_bdev
LIBS += $(SPDK_LIB_LINKER_ARGS)
CFLAGS += -I$(abspath $(CURDIR))
LDFLAGS += -shared -rdynamic -Wl,-z,nodelete -Wl,--disable-new-dtags \
# 链接选项
LDFLAGS += -shared -Wl,-soname,$(LIBZVFS) -Wl,-z,nodelete \
-Wl,--disable-new-dtags \
-Wl,-rpath,$(SPDK_ROOT_DIR)/build/lib \
-Wl,-rpath,$(SPDK_ROOT_DIR)/dpdk/build/lib
# 系统库
SYS_LIBS += -ldl
# 获取 SPDK 库的链接参数
SPDK_LIBS = $(call spdk_lib_list_to_linker_args,$(SPDK_LIB_LIST))
DEPS = $(OBJS:.o=.d)
all: $(LIBZVFS)
@:
$(MAKE) -C daemon
$(LIBZVFS): $(OBJS) $(SPDK_LIB_FILES) $(ENV_LIBS)
$(LINK_C)
# 构建目标文件
$(OBJDIR)/%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@
# 构建共享库
$(LIBZVFS): $(OBJS)
$(CC) $(LDFLAGS) -o $@ $^ $(SPDK_LIBS) $(SYS_LIBS)
clean:
$(CLEAN_C) $(LIBZVFS)
rm -f $(DEPS) $(OBJS) $(LIBZVFS)
$(MAKE) -C daemon clean
include $(SPDK_ROOT_DIR)/mk/spdk.deps.mk

View File

@@ -1,33 +1,20 @@
#ifndef __ZVFS_CONFIG_H__
#define __ZVFS_CONFIG_H__
/**
* ZVFS
*/
#define ZVFS_XATTR_BLOB_ID "user.zvfs.blob_id"
/**
* SPDK
*/
// dev
#define SPDK_JSON_PATH "/home/lian/try/zvfs/src/zvfsmalloc.json"
// #define ZVFS_BDEV "Nvme0n1"
#ifndef ZVFS_BDEV
#define ZVFS_BDEV "Malloc0"
#endif
// super blob
#define ZVFS_SB_MAGIC UINT64_C(0x5A5646535F534200) /* "ZVFS_SB\0" */
#define ZVFS_SB_VERSION UINT32_C(1)
// dma
#define ZVFS_DMA_BUF_SIZE (1024 * 1024)
// waiter
#define WAITER_MAX_TIME 10000000
#define ZVFS_WAIT_TIME 5000ULL
#define ZVFS_IPC_DEFAULT_SOCKET_PATH "/tmp/zvfs.sock"
// #define ZVFS_IPC_BUF_SIZE 4096
#define ZVFS_IPC_BUF_SIZE (16 * 1024 * 1024)
#endif // __ZVFS_CONFIG_H__

View File

@@ -50,44 +50,3 @@ int zvfs_calc_ceil_units(uint64_t bytes,
}
return 0;
}
int buf_init(zvfs_buf_t *b, size_t initial)
{
b->data = malloc(initial);
if (!b->data) return -1;
b->cap = initial;
b->len = 0;
return 0;
}
void buf_free(zvfs_buf_t *b)
{
free(b->data);
b->data = NULL;
b->len = b->cap = 0;
}
/*
* 确保缓冲区还有 need 字节可用,不够则 realloc 两倍。
*/
int buf_reserve(zvfs_buf_t *b, size_t need)
{
if (b->len + need <= b->cap) return 0;
size_t new_cap = b->cap * 2;
while (new_cap < b->len + need) new_cap *= 2;
uint8_t *p = realloc(b->data, new_cap);
if (!p) return -1;
b->data = p;
b->cap = new_cap;
return 0;
}
int buf_append(zvfs_buf_t *b, const void *src, size_t n)
{
if (buf_reserve(b, n) != 0) return -1;
memcpy(b->data + b->len, src, n);
b->len += n;
return 0;
}

View File

@@ -15,15 +15,4 @@ int zvfs_calc_ceil_units(uint64_t bytes,
uint64_t unit_size,
uint64_t *units_out);
typedef struct {
uint8_t *data;
size_t cap;
size_t len;
} zvfs_buf_t;
int buf_init(zvfs_buf_t *b, size_t initial);
void buf_free(zvfs_buf_t *b);
int buf_reserve(zvfs_buf_t *b, size_t need);
int buf_append(zvfs_buf_t *b, const void *src, size_t n);
#endif // __ZVFS_COMMON_UTILS_H__

20
src/daemon/Makefile Normal file
View File

@@ -0,0 +1,20 @@
# SPDX-License-Identifier: BSD-3-Clause
# Copyright (C) 2017 Intel Corporation
# All rights reserved.
#
SPDK_ROOT_DIR := $(abspath $(CURDIR)/../../spdk)
PROTO_DIR := $(abspath $(CURDIR)/../proto)
COMMON_DIR := $(abspath $(CURDIR)/../common)
include $(SPDK_ROOT_DIR)/mk/spdk.common.mk
include $(SPDK_ROOT_DIR)/mk/spdk.modules.mk
APP = zvfs_daemon
CFLAGS += -I$(abspath $(CURDIR)/..)
C_SRCS := main.c ipc_cq.c ipc_reactor.c spdk_engine.c spdk_engine_wrapper.c $(PROTO_DIR)/ipc_proto.c $(COMMON_DIR)/utils.c
SPDK_LIB_LIST = $(ALL_MODULES_LIST) event event_bdev
include $(SPDK_ROOT_DIR)/mk/spdk.app.mk

61
src/daemon/ipc_cq.c Normal file
View File

@@ -0,0 +1,61 @@
#include "ipc_cq.h"
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
struct cq *g_cq;
struct cq *CQ_Create(void) {
struct cq *q = (struct cq*)malloc(sizeof(*q));
if (!q) return NULL;
q->head = q->tail = NULL;
pthread_mutex_init(&q->lock, NULL);
return q;
}
void CQ_Destroy(struct cq *q) {
while (q->head) {
struct cq_item *tmp = q->head;
q->head = tmp->next;
free(tmp->resp->data); // 如果 resp 有 data
free(tmp->resp);
free(tmp);
}
pthread_mutex_destroy(&q->lock);
free(q);
}
/* 推入响应 */
void CQ_Push(struct cq *q, struct zvfs_resp *resp) {
struct cq_item *item = (struct cq_item *)malloc(sizeof(*item));
item->resp = resp;
item->next = NULL;
pthread_mutex_lock(&q->lock);
if (q->tail) {
q->tail->next = item;
q->tail = item;
} else {
q->head = q->tail = item;
}
pthread_mutex_unlock(&q->lock);
}
/* 弹出响应 */
struct zvfs_resp *CQ_Pop(struct cq *q) {
pthread_mutex_lock(&q->lock);
struct cq_item *item = q->head;
if (!item) {
pthread_mutex_unlock(&q->lock);
return NULL;
}
q->head = item->next;
if (!q->head) q->tail = NULL;
pthread_mutex_unlock(&q->lock);
struct zvfs_resp *resp = item->resp;
free(item);
return resp;
}

26
src/daemon/ipc_cq.h Normal file
View File

@@ -0,0 +1,26 @@
#ifndef __ZVFS_IPC_CQ_H__
#define __ZVFS_IPC_CQ_H__
#include "proto/ipc_proto.h"
#include <pthread.h>
struct cq_item {
struct zvfs_resp *resp;
struct cq_item *next;
};
struct cq {
struct cq_item *head;
struct cq_item *tail;
pthread_mutex_t lock;
};
struct cq *CQ_Create(void);
void CQ_Destroy(struct cq *q);
void CQ_Push(struct cq *q, struct zvfs_resp *resp);
struct zvfs_resp *CQ_Pop(struct cq *q);
extern struct cq *g_cq;
#endif

309
src/daemon/ipc_reactor.c Normal file
View File

@@ -0,0 +1,309 @@
#include "ipc_reactor.h"
#include "ipc_cq.h"
#include "common/config.h"
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/epoll.h>
#include <sys/stat.h>
#include <stdint.h>
static int send_all(int fd, const uint8_t *buf, size_t len) {
size_t off = 0;
while (off < len) {
ssize_t sent = send(fd, buf + off, len - off, 0);
if (sent > 0) {
off += (size_t)sent;
continue;
}
if (sent < 0 && errno == EINTR) {
continue;
}
if (sent < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
/* 当前实现优先功能,等待对端可写后重试。 */
usleep(100);
continue;
}
return -1;
}
return 0;
}
/** ====================================================== */
/* CQ OP */
/** ====================================================== */
static void cq_consume_send(struct cq *q) {
struct zvfs_resp *resp;
while ((resp = CQ_Pop(q)) != NULL) {
struct zvfs_conn *conn = resp->conn;
size_t cap = ZVFS_IPC_BUF_SIZE;
uint8_t *buf = NULL;
// printf("[resp][%s]\n",cast_opcode2string(resp->opcode));
buf = malloc(cap);
if (!buf) {
fprintf(stderr, "serialize resp failed: alloc %zu bytes\n", cap);
free(resp->data);
free(resp);
continue;
}
size_t n = zvfs_serialize_resp(resp, buf, cap);
if (n == 0 && resp->status == 0 && resp->opcode == ZVFS_OP_READ) {
if (resp->length <= SIZE_MAX - 64) {
size_t need = (size_t)resp->length + 64;
uint8_t *bigger = realloc(buf, need);
if (bigger) {
buf = bigger;
cap = need;
n = zvfs_serialize_resp(resp, buf, cap);
}
}
}
if (n == 0) {
fprintf(stderr, "serialize resp failed: op=%u status=%d len=%lu cap=%zu\n",
resp->opcode, resp->status, resp->length, cap);
free(buf);
free(resp->data);
free(resp);
continue;
}
if (send_all(conn->fd, buf, n) != 0) {
perror("send");
free(buf);
free(resp->data);
free(resp);
continue;
}
free(buf);
// 清理
if(resp->data) free(resp->data);
free(resp);
}
}
static int set_nonblock(int fd){
int flags = fcntl(fd, F_GETFL, 0);
if (flags < 0)
return -1;
return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
static void epoll_add(struct zvfs_reactor *r, int fd, void *ptr, uint32_t events)
{
struct epoll_event ev;
memset(&ev, 0, sizeof(ev));
ev.events = events;
ev.data.ptr = ptr;
epoll_ctl(r->epfd, EPOLL_CTL_ADD, fd, &ev);
}
static void epoll_mod(struct zvfs_reactor *r, int fd, void *ptr, uint32_t events){
struct epoll_event ev;
memset(&ev, 0, sizeof(ev));
ev.events = events;
ev.data.ptr = ptr;
epoll_ctl(r->epfd, EPOLL_CTL_MOD, fd, &ev);
}
static void conn_destroy(struct zvfs_conn *c){
close(c->fd);
free(c);
}
int zvfs_conn_get_fd(struct zvfs_conn *conn){
return conn->fd;
}
void zvfs_conn_set_ctx(struct zvfs_conn *conn, void *ctx){
conn->user_ctx = ctx;
}
void *zvfs_conn_get_ctx(struct zvfs_conn *conn){
return conn->user_ctx;
}
void zvfs_conn_enable_write(struct zvfs_conn *conn){
if (conn->want_write)
return;
conn->want_write = 1;
struct zvfs_reactor *r = conn->reactor;
epoll_mod(r, conn->fd, conn,
EPOLLIN | EPOLLOUT | EPOLLET);
}
void zvfs_conn_disable_write(struct zvfs_conn *conn){
if (!conn->want_write)
return;
conn->want_write = 0;
struct zvfs_reactor *r = conn->reactor;
epoll_mod(r, conn->fd, conn,
EPOLLIN | EPOLLET);
}
void zvfs_conn_close(struct zvfs_conn *conn){
struct zvfs_reactor *r = conn->reactor;
if (r->opts.on_close)
r->opts.on_close(conn, r->opts.cb_ctx);
epoll_ctl(r->epfd, EPOLL_CTL_DEL, conn->fd, NULL);
conn_destroy(conn);
}
/**
* AF_UNIX -> Unix Domain Socket
* SOCK_STREAM -> 类似 TCP
* path -> 通过某个文件进行通信
*/
static int create_listen_socket(const char *path, int backlog){
int fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0)
return -1;
struct sockaddr_un addr;
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, path, sizeof(addr.sun_path) - 1);
unlink(path);
if (bind(fd, (struct sockaddr*)&addr, sizeof(addr)) < 0)
return -1;
/*
* 避免 daemon 由 root 启动时socket 权限受 umask 影响导致
* 其它用户(如 postgresconnect() 被 EACCES 拒绝。
*/
if (chmod(path, 0666) < 0)
return -1;
if (listen(fd, backlog) < 0)
return -1;
set_nonblock(fd);
return fd;
}
struct zvfs_reactor *zvfs_reactor_create(const struct zvfs_reactor_opts *opts){
struct zvfs_reactor *r = calloc(1, sizeof(*r));
r->opts = *opts;
r->epfd = epoll_create1(0);
r->listen_fd = create_listen_socket(
opts->socket_path,
opts->backlog);
epoll_add(r, r->listen_fd, NULL, EPOLLIN);
return r;
}
static void handle_accept(struct zvfs_reactor *r){
for (;;) {
int fd = accept(r->listen_fd, NULL, NULL);
if (fd < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
return;
return;
}
set_nonblock(fd);
struct zvfs_conn *conn = calloc(1, sizeof(*conn));
conn->fd = fd;
conn->reactor = r;
epoll_add(r, fd, conn, EPOLLIN | EPOLLET);
if (r->opts.on_accept)
r->opts.on_accept(conn, r->opts.cb_ctx);
}
}
int
zvfs_reactor_run(struct zvfs_reactor *r){
struct epoll_event events[64];
r->running = 1;
while (r->running) {
int n = epoll_wait(r->epfd, events, 64, 0);
for (int i = 0; i < n; i++) {
if (events[i].data.ptr == NULL) {
handle_accept(r);
continue;
}
struct zvfs_conn *conn = events[i].data.ptr;
if (events[i].events & (EPOLLHUP | EPOLLERR)) {
zvfs_conn_close(conn);
continue;
}
if ((events[i].events & EPOLLIN) &&
r->opts.on_read) {
r->opts.on_read(conn, r->opts.cb_ctx);
}
if ((events[i].events & EPOLLOUT) &&
r->opts.on_write) {
r->opts.on_write(conn, r->opts.cb_ctx);
}
}
cq_consume_send(g_cq);
}
return 0;
}
void zvfs_reactor_stop(struct zvfs_reactor *r){
r->running = 0;
}
void zvfs_reactor_destroy(struct zvfs_reactor *r){
close(r->listen_fd);
close(r->epfd);
free(r);
}

118
src/daemon/ipc_reactor.h Normal file
View File

@@ -0,0 +1,118 @@
#ifndef __ZVFS_IPC_REACTOR_H__
#define __ZVFS_IPC_REACTOR_H__
#include <stdint.h>
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
struct zvfs_reactor_opts;
struct zvfs_conn;
struct zvfs_reactor;
/* callbacks */
typedef void (*zvfs_on_accept_fn)(
struct zvfs_conn *conn,
void *ctx);
typedef void (*zvfs_on_read_fn)(
struct zvfs_conn *conn,
void *ctx);
typedef void (*zvfs_on_write_fn)(
struct zvfs_conn *conn,
void *ctx);
typedef void (*zvfs_on_close_fn)(
struct zvfs_conn *conn,
void *ctx);
/* configuration */
struct zvfs_reactor_opts {
const char *socket_path;
int backlog;
int max_events;
zvfs_on_accept_fn on_accept;
zvfs_on_read_fn on_read;
zvfs_on_write_fn on_write;
zvfs_on_close_fn on_close;
void *cb_ctx;
};
struct zvfs_conn {
int fd;
int want_write;
void *user_ctx;
struct zvfs_reactor *reactor;
};
struct zvfs_reactor {
int epfd;
int listen_fd;
int running;
struct zvfs_reactor_opts opts;
};
/* reactor lifecycle */
struct zvfs_reactor *
zvfs_reactor_create(const struct zvfs_reactor_opts *opts);
int
zvfs_reactor_run(struct zvfs_reactor *reactor);
void
zvfs_reactor_stop(struct zvfs_reactor *reactor);
void
zvfs_reactor_destroy(struct zvfs_reactor *reactor);
/* connection helpers */
int
zvfs_conn_get_fd(struct zvfs_conn *conn);
void
zvfs_conn_close(struct zvfs_conn *conn);
void
zvfs_conn_enable_write(struct zvfs_conn *conn);
void
zvfs_conn_disable_write(struct zvfs_conn *conn);
void
zvfs_conn_set_ctx(struct zvfs_conn *conn, void *ctx);
void *
zvfs_conn_get_ctx(struct zvfs_conn *conn);
#ifdef __cplusplus
}
#endif
#endif

259
src/daemon/main.c Normal file
View File

@@ -0,0 +1,259 @@
#include "common/config.h"
#include "proto/ipc_proto.h"
#include "ipc_reactor.h"
#include "ipc_cq.h"
#include "spdk_engine_wrapper.h"
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/types.h>
#include <errno.h>
#include <stdlib.h>
// #define IPC_REACTOR_ECHO
#define IPC_REACTOR_ZVFS
extern struct zvfs_spdk_io_engine g_engine;
#ifdef IPC_REACTOR_ECHO
static void on_accept(struct zvfs_conn *conn, void *ctx)
{
printf("client connected fd=%d\n",
zvfs_conn_get_fd(conn));
}
static void on_read(struct zvfs_conn *c, void *ctx)
{
int fd = zvfs_conn_get_fd(c);
char buf[4096];
ssize_t n = read(fd, buf, sizeof(buf));
if (n == 0) {
zvfs_conn_close(c);
return;
}
if (n < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
return;
perror("read");
zvfs_conn_close(c);
return;
}
printf("recv %ld bytes: %.*s\n", n, (int)n, buf);
ssize_t w = write(fd, buf, n);
if (w < 0) {
perror("write");
zvfs_conn_close(c);
return;
}
}
static void on_write(struct zvfs_conn *conn, void *ctx)
{
/* echo server 不需要 write queue */
}
static void on_close(struct zvfs_conn *conn, void *ctx)
{
printf("connection closed fd=%d\n",
zvfs_conn_get_fd(conn));
}
int main()
{
struct zvfs_reactor_opts opts = {
.socket_path = "/tmp/zvfs.sock",
.backlog = 128,
.max_events = 64,
.on_accept = on_accept,
.on_read = on_read,
.on_write = on_write,
.on_close = on_close,
.cb_ctx = NULL
};
struct zvfs_reactor *r = zvfs_reactor_create(&opts);
printf("echo server started: %s\n", opts.socket_path);
zvfs_reactor_run(r);
return 0;
}
#else
static void on_accept(struct zvfs_conn *conn, void *ctx)
{
struct {
uint8_t *buf;
size_t len;
size_t cap;
} *rctx = calloc(1, sizeof(*rctx));
if (!rctx) {
fprintf(stderr, "[accept] alloc conn ctx failed\n");
zvfs_conn_close(conn);
return;
}
rctx->cap = ZVFS_IPC_BUF_SIZE;
rctx->buf = calloc(1, rctx->cap);
if (!rctx->buf) {
fprintf(stderr, "[accept] alloc conn rx buffer failed\n");
free(rctx);
zvfs_conn_close(conn);
return;
}
zvfs_conn_set_ctx(conn, rctx);
printf("client connected fd=%d\n",
zvfs_conn_get_fd(conn));
}
static void on_read(struct zvfs_conn *c, void *ctx)
{
int fd = zvfs_conn_get_fd(c);
struct {
uint8_t *buf;
size_t len;
size_t cap;
} *rctx = zvfs_conn_get_ctx(c);
if (!rctx || !rctx->buf || rctx->cap == 0) {
fprintf(stderr, "[read] invalid conn ctx fd=%d\n", fd);
zvfs_conn_close(c);
return;
}
for (;;) {
if (rctx->len >= rctx->cap) {
fprintf(stderr, "[read] rx buffer overflow fd=%d len=%zu cap=%zu\n",
fd, rctx->len, rctx->cap);
zvfs_conn_close(c);
return;
}
ssize_t n = read(fd, rctx->buf + rctx->len, rctx->cap - rctx->len);
if (n == 0) {
fprintf(stderr, "[read] fd=%d closed\n", fd);
zvfs_conn_close(c);
return;
}
if (n < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
perror("[read]");
zvfs_conn_close(c);
return;
}
break;
}
rctx->len += (size_t)n;
}
size_t offset = 0;
while (offset < rctx->len) {
struct zvfs_req *req = calloc(1, sizeof(*req));
if (!req) {
fprintf(stderr, "malloc failed\n");
break;
}
size_t consumed = zvfs_deserialize_req(rctx->buf + offset, rctx->len - offset, req);
if (consumed == 0) {
free(req);
break; /* 等待更多数据 */
}
printf("[req][%s]\n", cast_opcode2string(req->opcode));
req->conn = c;
offset += consumed;
if (dispatch_to_worker(req) < 0) {
fprintf(stderr, "[dispatcher] [fd:%d] dispatch error\n", c->fd);
}
}
if (offset > 0) {
size_t remain = rctx->len - offset;
if (remain > 0) {
memmove(rctx->buf, rctx->buf + offset, remain);
}
rctx->len = remain;
}
if (rctx->len == rctx->cap) {
fprintf(stderr, "[read] request too large or malformed fd=%d cap=%zu\n",
fd, rctx->cap);
zvfs_conn_close(c);
}
}
static void on_close(struct zvfs_conn *conn, void *ctx)
{
struct {
uint8_t *buf;
size_t len;
size_t cap;
} *rctx = zvfs_conn_get_ctx(conn);
if (rctx) {
free(rctx->buf);
free(rctx);
zvfs_conn_set_ctx(conn, NULL);
}
printf("connection closed fd=%d\n",
zvfs_conn_get_fd(conn));
}
int main(void){
const char *bdev_name = getenv("SPDK_BDEV_NAME") ? getenv("SPDK_BDEV_NAME") : ZVFS_BDEV;
const char *json_file = getenv("SPDK_JSON_CONFIG") ? getenv("SPDK_JSON_CONFIG") : SPDK_JSON_PATH;
g_cq = CQ_Create();
zvfs_engine_init(bdev_name, json_file, 4);
struct zvfs_reactor_opts opts = {
.socket_path = ZVFS_IPC_DEFAULT_SOCKET_PATH,
.backlog = 128,
.max_events = 64,
.on_accept = on_accept,
.on_read = on_read,
.on_write = NULL,
.on_close = on_close,
.cb_ctx = &g_engine
};
struct zvfs_reactor *r = zvfs_reactor_create(&opts);
zvfs_reactor_run(r);
if(g_cq) CQ_Destroy(g_cq);
}
#endif

1047
src/daemon/spdk_engine.c Normal file

File diff suppressed because it is too large Load Diff

68
src/daemon/spdk_engine.h Normal file
View File

@@ -0,0 +1,68 @@
#ifndef __ZVFS_SPDK_ENGINE_H__
#define __ZVFS_SPDK_ENGINE_H__
#include "common/uthash.h"
#include "proto/ipc_proto.h"
#include <stdint.h>
#include <sys/types.h>
#include <stdatomic.h>
#include <spdk/blob.h>
// blob_handle 结构体:底层 blob 信息,不含文件级 size上层维护
typedef struct zvfs_blob_handle {
spdk_blob_id blob_id;
struct spdk_blob *blob;
void *dma_buf;
uint64_t dma_buf_size;
atomic_uint ref_count;
} zvfs_blob_handle_t;
struct zvfs_io_thread {
struct spdk_thread *thread;
struct spdk_io_channel *channel; // 每个 io 线程独占一个 channel
pthread_t tid;
bool ready;
};
typedef uint64_t zvfs_handle_id_t;
struct zvfs_blob_cache_entry {
zvfs_handle_id_t handle_id; // key != blob_id
struct zvfs_blob_handle *handle;
UT_hash_handle hh;
};
typedef struct zvfs_spdk_io_engine {
struct spdk_bs_dev *bs_dev;
struct spdk_blob_store *bs;
/* 线程池thread_pool[0] 固定为 md 线程,其余为 io 线程 */
struct zvfs_io_thread *thread_pool; // 线程池
int thread_count; // 总线程数 (= CPU 核心数)
int io_thread_count; // 线程数量
struct zvfs_blob_cache_entry *handle_cache; // handle_id -> handle 映射
pthread_mutex_t cache_mu;
uint64_t io_unit_size;
uint64_t cluster_size;
} zvfs_spdk_io_engine_t;
int engine_cache_insert(struct zvfs_blob_handle *handle, zvfs_handle_id_t *out_id);
struct zvfs_blob_handle *engine_cache_lookup(zvfs_handle_id_t handle_id);
void engine_cache_remove(zvfs_handle_id_t handle_id);
int io_engine_init(const char *bdev_name, const char *json_file, int thread_num);
int blob_create(struct zvfs_req *req);
int blob_open(struct zvfs_req *req);
int blob_write(struct zvfs_req *req);
int blob_read(struct zvfs_req *req);
int blob_resize(struct zvfs_req *req);
int blob_sync_md(struct zvfs_req *req);
int blob_close(struct zvfs_req *req);
int blob_delete(struct zvfs_req *req);
#endif // __ZVFS_IO_ENGINE_H__

View File

@@ -0,0 +1,210 @@
#include "spdk_engine_wrapper.h"
#include "spdk_engine.h"
#include "ipc_cq.h"
#include <spdk/log.h>
extern struct zvfs_spdk_io_engine g_engine;
/** cq op */
static void push_err_resp(struct zvfs_req *req, int status) {
struct zvfs_resp *resp = calloc(1, sizeof(*resp));
if (!resp) {
SPDK_ERRLOG("push_err_resp: calloc failed, op_code=%u\n", req->opcode);
if (req->data) free(req->data);
if (req->add_ref_items) free(req->add_ref_items);
free(req);
return;
}
resp->opcode = req->opcode;
resp->conn = req->conn;
resp->status = status;
if (req->data) free(req->data);
if (req->add_ref_items) free(req->add_ref_items);
free(req);
CQ_Push(g_cq, resp);
}
static void push_ok_resp(struct zvfs_req *req) {
struct zvfs_resp *resp = calloc(1, sizeof(*resp));
if (!resp) {
SPDK_ERRLOG("push_ok_resp: calloc failed, op_code=%u\n", req->opcode);
if (req->data) free(req->data);
if (req->add_ref_items) free(req->add_ref_items);
free(req);
return;
}
resp->opcode = req->opcode;
resp->conn = req->conn;
resp->status = 0;
if (req->data) free(req->data);
if (req->add_ref_items) free(req->add_ref_items);
free(req);
CQ_Push(g_cq, resp);
}
/** hash map op */
int engine_cache_insert(struct zvfs_blob_handle *handle, zvfs_handle_id_t *out_id) {
struct zvfs_blob_cache_entry *entry = calloc(1, sizeof(*entry));
if (!entry) return -ENOMEM;
entry->handle_id = (zvfs_handle_id_t)(uintptr_t)handle;
entry->handle = handle;
pthread_mutex_lock(&g_engine.cache_mu);
HASH_ADD(hh, g_engine.handle_cache, handle_id, sizeof(zvfs_handle_id_t), entry);
pthread_mutex_unlock(&g_engine.cache_mu);
*out_id = entry->handle_id;
return 0;
}
struct zvfs_blob_handle *engine_cache_lookup(zvfs_handle_id_t handle_id) {
struct zvfs_blob_cache_entry *entry = NULL;
pthread_mutex_lock(&g_engine.cache_mu);
HASH_FIND(hh, g_engine.handle_cache, &handle_id, sizeof(zvfs_handle_id_t), entry);
pthread_mutex_unlock(&g_engine.cache_mu);
return entry ? entry->handle : NULL;
}
void engine_cache_remove(zvfs_handle_id_t handle_id) {
struct zvfs_blob_cache_entry *entry = NULL;
pthread_mutex_lock(&g_engine.cache_mu);
HASH_FIND(hh, g_engine.handle_cache, &handle_id, sizeof(zvfs_handle_id_t), entry);
if (entry) { HASH_DEL(g_engine.handle_cache, entry); free(entry); }
pthread_mutex_unlock(&g_engine.cache_mu);
}
static int fill_handle(struct zvfs_req *req, const char *op) {
struct zvfs_blob_handle *handle = engine_cache_lookup(req->handle_id);
if (!handle) {
SPDK_ERRLOG("%s: invalid handle_id=%lu\n", op, req->handle_id);
push_err_resp(req, -EBADF);
return -EBADF;
}
req->handle = handle;
return 0;
}
// zvfs wrapper
int zvfs_engine_init(const char *bdev_name, const char *json_file, int thread_num) {
return io_engine_init(bdev_name, json_file, thread_num);
}
/* create / openhandle 在 engine 回调里注册wrapper 直接透传 */
static int zvfs_create(struct zvfs_req *req) {
return blob_create(req);
}
static int zvfs_open(struct zvfs_req *req) {
return blob_open(req);
}
/* delete只需要 blob_id无需 handle */
static int zvfs_delete(struct zvfs_req *req) {
return blob_delete(req);
}
/* 以下操作需要先填充 handle */
static int zvfs_write(struct zvfs_req *req) {
if (fill_handle(req, "zvfs_write") != 0) return -EBADF;
return blob_write(req);
}
static int zvfs_read(struct zvfs_req *req) {
if (fill_handle(req, "zvfs_read") != 0) return -EBADF;
return blob_read(req);
}
static int zvfs_resize(struct zvfs_req *req) {
if (fill_handle(req, "zvfs_resize") != 0) return -EBADF;
return blob_resize(req);
}
static int zvfs_sync_md(struct zvfs_req *req) {
if (fill_handle(req, "zvfs_sync_md") != 0) return -EBADF;
return blob_sync_md(req);
}
/* closefill_handle 之后 engine 回调里会同步 cache_remove */
static int zvfs_close(struct zvfs_req *req) {
if (fill_handle(req, "zvfs_close") != 0) return -EBADF;
return blob_close(req);
}
static int zvfs_add_ref(struct zvfs_req *req) {
if (req->ref_delta == 0) {
push_err_resp(req, -EINVAL);
return -EINVAL;
}
if (fill_handle(req, "zvfs_add_ref") != 0) return -EBADF;
atomic_fetch_add(&req->handle->ref_count, req->ref_delta);
push_ok_resp(req);
return 0;
}
static int zvfs_add_ref_batch(struct zvfs_req *req) {
int rc = 0;
uint32_t i = 0;
if (req->add_ref_count == 0 || !req->add_ref_items) {
push_err_resp(req, -EINVAL);
return -EINVAL;
}
/* TODO: 当前为功能优先的非原子批量加引用实现。 */
for (i = 0; i < req->add_ref_count; i++) {
struct zvfs_add_ref_item *item = &req->add_ref_items[i];
struct zvfs_blob_handle *handle = NULL;
if (item->ref_delta == 0) {
rc = -EINVAL;
continue;
}
handle = engine_cache_lookup(item->handle_id);
if (!handle) {
rc = -EBADF;
continue;
}
atomic_fetch_add(&handle->ref_count, item->ref_delta);
}
if (rc != 0) {
push_err_resp(req, rc);
return rc;
}
push_ok_resp(req);
return 0;
}
int dispatch_to_worker(struct zvfs_req *req){
switch (req->opcode)
{
case ZVFS_OP_CREATE:
return zvfs_create(req);
case ZVFS_OP_OPEN:
return zvfs_open(req);
case ZVFS_OP_READ:
return zvfs_read(req);
case ZVFS_OP_WRITE:
return zvfs_write(req);
case ZVFS_OP_RESIZE:
return zvfs_resize(req);
case ZVFS_OP_SYNC_MD:
return zvfs_sync_md(req);
case ZVFS_OP_CLOSE:
return zvfs_close(req);
case ZVFS_OP_DELETE:
return zvfs_delete(req);
case ZVFS_OP_ADD_REF:
return zvfs_add_ref(req);
case ZVFS_OP_ADD_REF_BATCH:
return zvfs_add_ref_batch(req);
default:
break;
}
return -1;
}

View File

@@ -0,0 +1,13 @@
#ifndef __ZVFS_ENGINE_H__
#define __ZVFS_ENGINE_H__
#include "proto/ipc_proto.h"
int zvfs_engine_init(const char *bdev_name, const char *json_file, int thread_num);
int dispatch_to_worker(struct zvfs_req *req);
#endif

BIN
src/daemon/zvfs_daemon Executable file

Binary file not shown.

View File

@@ -1,7 +1,8 @@
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include "config.h"
#include "common/config.h"
#include "common/utils.h"
#include "fs/zvfs.h"
#include "fs/zvfs_inode.h"
@@ -10,6 +11,7 @@
#include <sys/xattr.h>
#include <sys/types.h>
#include <errno.h>
struct zvfs_fs g_fs = {0};
/* ------------------------------------------------------------------ */

View File

@@ -67,10 +67,11 @@ void inode_remove(uint64_t blob_id) {
/* size / timestamp helpers (调用方持有 inode->mu */
/* ------------------------------------------------------------------ */
void inode_update_size(struct zvfs_inode *inode, int real_fd, uint64_t new_size) {
int inode_update_size(struct zvfs_inode *inode, int real_fd, uint64_t new_size) {
inode->logical_size = new_size;
if (real_fd >= 0)
ftruncate(real_fd, (off_t)new_size); /* 同步 st_size忽略错误 */
return ftruncate(real_fd, (off_t)new_size); /* 同步 st_size忽略错误 */
return 0;
}
void inode_touch_atime(struct zvfs_inode *inode) {

View File

@@ -49,7 +49,7 @@ void inode_remove(uint64_t blob_id);
// 更新 logical_size同时负责调用 ftruncate 同步 st_size
// 需持有 inode->mu
void inode_update_size(struct zvfs_inode *inode, int real_fd, uint64_t new_size);
int inode_update_size(struct zvfs_inode *inode, int real_fd, uint64_t new_size);
// 更新时间戳(需持有 inode->mu
void inode_touch_atime(struct zvfs_inode *inode);

View File

@@ -15,7 +15,7 @@
struct zvfs_open_file *openfile_alloc(int fd,
struct zvfs_inode *inode,
int flags,
struct zvfs_blob_handle *handle)
uint64_t handle_id)
{
struct zvfs_open_file *of = calloc(1, sizeof(*of));
if (!of)
@@ -23,11 +23,10 @@ struct zvfs_open_file *openfile_alloc(int fd,
of->fd = fd;
of->inode = inode;
of->handle = handle;
of->handle_id = handle_id;
of->flags = flags;
of->fd_flags = 0;
of->offset = 0;
atomic_init(&of->ref_count, 1);
return of;
}

View File

@@ -3,33 +3,26 @@
#include "common/uthash.h"
#include "spdk_engine/io_engine.h"
#include <stdatomic.h>
#include <stdint.h>
#ifndef SPDK_BLOB_ID_DEFINED
typedef uint64_t spdk_blob_id;
#define SPDK_BLOB_ID_DEFINED
#endif
struct zvfs_open_file {
int fd; // key和真实 fd 1:1
struct zvfs_inode *inode;
struct zvfs_blob_handle *handle;
uint64_t handle_id;
int flags;
int fd_flags;
uint64_t offset; // 非 APPEND 模式的当前位置
atomic_int ref_count; // dup / close 用
UT_hash_handle hh;
};
// 分配 openfile不插入全局表ref_count 初始为 1
// 分配 openfile不插入全局表
struct zvfs_open_file *openfile_alloc(int fd, struct zvfs_inode *inode,
int flags, struct zvfs_blob_handle *handle);
int flags, uint64_t handle_id);
// 释放内存(调用前确保 ref_count == 0不负责 blob_close
// 释放内存
void openfile_free(struct zvfs_open_file *of);
// 插入全局表(需持有 fd_mu

View File

@@ -2,7 +2,8 @@
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include "config.h"
#include "common/config.h"
#include "zvfs_sys_init.h"
#include "fs/zvfs.h" // zvfs_fs_init
#include "spdk_engine/io_engine.h"
@@ -17,17 +18,6 @@ static int _init_ok = 0;
static void
do_init(void)
{
const char *bdev = getenv("ZVFS_BDEV");
if (!bdev) {
bdev = ZVFS_BDEV;
fprintf(stderr, "[zvfs] ZVFS_BDEV not set, set as (%s)\n", ZVFS_BDEV);
}
if (io_engine_init(bdev) != 0) {
fprintf(stderr, "[zvfs] FATAL: io_engine_init(%s) failed\n", bdev);
abort();
}
_init_ok = 1;
}

View File

@@ -68,9 +68,19 @@ zvfs_fcntl_impl(int fd, int cmd, va_list ap)
/* ---- dup 类 -------------------------------------------------- */
case F_DUPFD:
case F_DUPFD_CLOEXEC: {
(void)va_arg(ap, int);
errno = ENOTSUP;
int minfd = va_arg(ap, int);
int newfd = real_fcntl(fd, cmd, minfd);
if (newfd < 0)
return -1;
int new_fd_flags = (cmd == F_DUPFD_CLOEXEC) ? FD_CLOEXEC : 0;
if (zvfs_dup_attach_newfd(fd, newfd, new_fd_flags) < 0) {
int saved = errno;
real_close(newfd);
errno = saved;
return -1;
}
return newfd;
}
/* ---- 文件锁(不实现,假装无锁)-------------------------------- */

View File

@@ -19,6 +19,91 @@
#include <pthread.h>
#include <stdio.h>
/* ------------------------------------------------------------------ */
/* 内部:路径判定辅助 */
/* ------------------------------------------------------------------ */
/**
* openat 到达符号链接之后跳转到 /zvfs 下,导致捕获不了。
*
* 1. 判断路径是不是 /zvfs
* 2. 判断readpath是不是 /zvfs
* 3. 如果O_CREATE并且目标不存在realpath什么也拿不到。先解析父路径再拼接看是不是落在 /zvfs
*/
static int
zvfs_classify_path(const char *abspath, int may_create,
char *normalized_out, size_t out_size)
{
char resolved[PATH_MAX];
char tmp[PATH_MAX];
char parent[PATH_MAX];
char candidate[PATH_MAX];
const char *name;
char *slash;
int n;
if (!abspath || !normalized_out || out_size == 0) {
return 0;
}
strncpy(normalized_out, abspath, out_size);
normalized_out[out_size - 1] = '\0';
if (zvfs_is_zvfs_path(abspath)) {
return 1;
}
if (realpath(abspath, resolved) != NULL) {
if (zvfs_is_zvfs_path(resolved)) {
strncpy(normalized_out, resolved, out_size);
normalized_out[out_size - 1] = '\0';
return 1;
}
return 0;
}
if (!may_create) {
return 0;
}
strncpy(tmp, abspath, sizeof(tmp));
tmp[sizeof(tmp) - 1] = '\0';
slash = strrchr(tmp, '/');
if (!slash) {
return 0;
}
name = slash + 1;
if (*name == '\0') {
return 0;
}
if (slash == tmp) {
strcpy(parent, "/");
} else {
*slash = '\0';
strncpy(parent, tmp, sizeof(parent));
parent[sizeof(parent) - 1] = '\0';
}
if (realpath(parent, resolved) == NULL) {
return 0;
}
n = snprintf(candidate, sizeof(candidate), "%s/%s", resolved, name);
if (n <= 0 || (size_t)n >= sizeof(candidate)) {
return 0;
}
if (!zvfs_is_zvfs_path(candidate)) {
return 0;
}
strncpy(normalized_out, candidate, out_size);
normalized_out[out_size - 1] = '\0';
return 1;
}
/* ------------------------------------------------------------------ */
/* 内部open 的核心逻辑(路径已解析为绝对路径) */
/* ------------------------------------------------------------------ */
@@ -37,15 +122,14 @@ static int
zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
{
struct zvfs_inode *inode = NULL;
struct zvfs_blob_handle *handle = NULL;
uint64_t blob_id = 0;
uint64_t handle_id = 0;
if (flags & O_CREAT) {
/* ---- 创建路径 -------------------------------------------- */
/* 1. 创建 blob */
handle = blob_create(0);
if (!handle) {
if (blob_create(0, &blob_id, &handle_id) != 0) {
int saved = errno;
if (saved == 0) saved = EIO;
fprintf(stderr,
@@ -54,7 +138,6 @@ zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
errno = saved;
goto fail;
}
blob_id = handle->id;
/* 2. 把 blob_id 写入真实文件的 xattr */
if (zvfs_xattr_write_blob_id(real_fd, blob_id) < 0) goto fail;
@@ -88,8 +171,10 @@ zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
if (inode) {
/* path_cache 命中:直接用缓存的 inode重新 blob_open */
blob_id = inode->blob_id;
handle = blob_open(blob_id);
if (!handle) { if (errno == 0) errno = EIO; goto fail; }
if (blob_open(blob_id, &handle_id) != 0) {
if (errno == 0) errno = EIO;
goto fail;
}
/* 共享 inode增加引用 */
atomic_fetch_add(&inode->ref_count, 1);
@@ -106,6 +191,10 @@ zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
pthread_mutex_unlock(&g_fs.inode_mu);
if (inode) {
if (blob_open(blob_id, &handle_id) != 0) {
if (errno == 0) errno = EIO;
goto fail;
}
atomic_fetch_add(&inode->ref_count, 1);
} else {
/* 全新 inode需从真实文件 stat 获取 mode/size */
@@ -123,15 +212,16 @@ zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
pthread_mutex_lock(&g_fs.path_mu);
path_cache_insert(abspath, inode);
pthread_mutex_unlock(&g_fs.path_mu);
if (blob_open(blob_id, &handle_id) != 0) {
if (errno == 0) errno = EIO;
goto fail;
}
}
handle = blob_open(blob_id);
if (!handle) { if (errno == 0) errno = EIO; goto fail; }
}
}
/* ---- 分配 openfile插入 fd_table ---------------------------- */
struct zvfs_open_file *of = openfile_alloc(real_fd, inode, flags, handle);
struct zvfs_open_file *of = openfile_alloc(real_fd, inode, flags, handle_id);
if (!of) { errno = ENOMEM; goto fail_handle; }
pthread_mutex_lock(&g_fs.fd_mu);
@@ -141,7 +231,9 @@ zvfs_open_impl(int real_fd, const char *abspath, int flags, mode_t mode)
return real_fd;
fail_handle:
blob_close(handle);
if (handle_id != 0) {
blob_close(handle_id);
}
fail:
/* inode 若刚分配ref_count==1需要回滚 */
if (inode && atomic_load(&inode->ref_count) == 1) {
@@ -165,6 +257,10 @@ open(const char *path, int flags, ...)
{
ZVFS_HOOK_ENTER();
char abspath[PATH_MAX];
char normpath[PATH_MAX];
int is_zvfs_path = 0;
mode_t mode = 0;
if (flags & O_CREAT) {
va_list ap;
@@ -173,8 +269,13 @@ open(const char *path, int flags, ...)
va_end(ap);
}
if (zvfs_resolve_atpath(AT_FDCWD, path, abspath, sizeof(abspath)) == 0) {
is_zvfs_path = zvfs_classify_path(abspath, (flags & O_CREAT) != 0,
normpath, sizeof(normpath));
}
int ret;
if (ZVFS_IN_HOOK() || !zvfs_is_zvfs_path(path)) {
if (ZVFS_IN_HOOK() || !is_zvfs_path) {
ret = real_open(path, flags, mode);
ZVFS_HOOK_LEAVE();
return ret;
@@ -186,7 +287,7 @@ open(const char *path, int flags, ...)
int real_fd = real_open(path, flags, mode);
if (real_fd < 0) { ZVFS_HOOK_LEAVE(); return -1; }
ret = zvfs_open_impl(real_fd, path, flags, mode);
ret = zvfs_open_impl(real_fd, normpath, flags, mode);
if (ret < 0) {
int saved = errno;
real_close(real_fd);
@@ -217,6 +318,9 @@ openat(int dirfd, const char *path, int flags, ...)
{
ZVFS_HOOK_ENTER();
char normpath[PATH_MAX];
int is_zvfs_path = 0;
mode_t mode = 0;
if (flags & O_CREAT) {
va_list ap; va_start(ap, flags);
@@ -230,9 +334,11 @@ openat(int dirfd, const char *path, int flags, ...)
ZVFS_HOOK_LEAVE();
return -1;
}
is_zvfs_path = zvfs_classify_path(abspath, (flags & O_CREAT) != 0,
normpath, sizeof(normpath));
int ret;
if (ZVFS_IN_HOOK() || !zvfs_is_zvfs_path(abspath)) {
if (ZVFS_IN_HOOK() || !is_zvfs_path) {
ret = real_openat(dirfd, path, flags, mode);
ZVFS_HOOK_LEAVE();
return ret;
@@ -243,7 +349,7 @@ openat(int dirfd, const char *path, int flags, ...)
int real_fd = real_openat(dirfd, path, flags, mode);
if (real_fd < 0) { ZVFS_HOOK_LEAVE(); return -1; }
ret = zvfs_open_impl(real_fd, abspath, flags, mode);
ret = zvfs_open_impl(real_fd, normpath, flags, mode);
if (ret < 0) {
int saved = errno;
real_close(real_fd);
@@ -321,43 +427,23 @@ int __libc_open(const char *path, int flags, ...)
/* ------------------------------------------------------------------ */
/*
* zvfs_close_impl - zvfs fd 的关闭逻辑
*
* 调用方已持有 fd_mu。函数内部会释放 fd_mu 后再处理 inode。
* zvfs_release_openfile - 释放一个 openfile 对应的 zvfs 资源
* 这里只处理 zvfs bookkeeping不做 real_close(fd)。
*/
static int
zvfs_close_impl(int fd)
zvfs_release_openfile(struct zvfs_open_file *of, int do_sync_md)
{
/* 持 fd_mu 取出 openfile从表里摘除 */
pthread_mutex_lock(&g_fs.fd_mu);
struct zvfs_open_file *of = openfile_lookup(fd);
if (!of) {
pthread_mutex_unlock(&g_fs.fd_mu);
errno = EBADF;
return -1;
}
int new_ref = atomic_fetch_sub(&of->ref_count, 1) - 1;
if (new_ref == 0)
openfile_remove(fd);
pthread_mutex_unlock(&g_fs.fd_mu);
if (new_ref > 0) {
/*
* 还有其他 dup 出来的 fd 引用同一个 openfile
* 只关闭真实 fd不动 blob 和 inode。
*/
return real_close(fd);
}
/* ---- openfile 引用归零:先刷 metadata再关闭 blob handle ------ */
int saved_errno = 0;
struct zvfs_inode *inode = of->inode;
struct zvfs_blob_handle *handle = of->handle;
int sync_failed = 0;
uint64_t handle_id = of->handle_id;
openfile_free(of);
if (blob_sync_md(handle) < 0)
sync_failed = 1;
blob_close(handle);
if (do_sync_md && handle_id != 0 && blob_sync_md(handle_id) < 0) {
saved_errno = (errno != 0) ? errno : EIO;
}
if (handle_id != 0 && blob_close(handle_id) < 0 && saved_errno == 0) {
saved_errno = (errno != 0) ? errno : EIO;
}
/* ---- inode ref_count-- --------------------------------------- */
int inode_ref = atomic_fetch_sub(&inode->ref_count, 1) - 1;
@@ -372,8 +458,8 @@ zvfs_close_impl(int fd)
do_delete = inode->deleted;
pthread_mutex_unlock(&inode->mu);
if (do_delete)
blob_delete(inode->blob_id);
if (do_delete && blob_delete(inode->blob_id) < 0 && saved_errno == 0)
saved_errno = (errno != 0) ? errno : EIO;
pthread_mutex_lock(&g_fs.inode_mu);
inode_remove(inode->blob_id);
@@ -403,13 +489,52 @@ zvfs_close_impl(int fd)
inode_free(inode);
}
if (saved_errno != 0) {
errno = saved_errno;
return -1;
}
return 0;
}
/*
* zvfs_detach_fd_mapping - 仅摘除 fd -> openfile 映射并释放 zvfs 资源。
* 不调用 real_close(fd),用于 dup2/dup3 中 newfd 旧值清理。
*/
static int
zvfs_detach_fd_mapping(int fd, int do_sync_md)
{
pthread_mutex_lock(&g_fs.fd_mu);
struct zvfs_open_file *of = openfile_lookup(fd);
if (!of) {
pthread_mutex_unlock(&g_fs.fd_mu);
errno = EBADF;
return -1;
}
openfile_remove(fd);
pthread_mutex_unlock(&g_fs.fd_mu);
return zvfs_release_openfile(of, do_sync_md);
}
/*
* zvfs_close_impl - close(fd) 的 zvfs 路径:
* 先做 bookkeeping再做 real_close(fd)。
*/
static int
zvfs_close_impl(int fd)
{
int bk_rc = zvfs_detach_fd_mapping(fd, 1);
int bk_errno = (bk_rc < 0) ? errno : 0;
int rc = real_close(fd);
if (rc < 0)
return -1;
if (sync_failed) {
errno = EIO;
if (bk_rc < 0) {
errno = bk_errno;
return -1;
}
return 0;
}
@@ -436,6 +561,180 @@ close(int fd)
int __close(int fd) { return close(fd); }
int __libc_close(int fd) { return close(fd); }
/* ------------------------------------------------------------------ */
/* dup helper */
/* ------------------------------------------------------------------ */
int
zvfs_dup_attach_newfd(int oldfd, int newfd, int new_fd_flags)
{
struct zvfs_open_file *old_of, *new_of;
int fd_flags;
int rc;
int saved;
if (oldfd < 0 || newfd < 0) {
errno = EBADF;
return -1;
}
pthread_mutex_lock(&g_fs.fd_mu);
old_of = openfile_lookup(oldfd);
if (!old_of) {
pthread_mutex_unlock(&g_fs.fd_mu);
errno = EBADF;
return -1;
}
if (openfile_lookup(newfd) != NULL) {
pthread_mutex_unlock(&g_fs.fd_mu);
errno = EEXIST;
return -1;
}
rc = blob_add_ref(old_of->handle_id, 1);
if (rc != 0) {
pthread_mutex_unlock(&g_fs.fd_mu);
return -1;
}
new_of = openfile_alloc(newfd, old_of->inode, old_of->flags, old_of->handle_id);
if (!new_of) {
saved = (errno != 0) ? errno : ENOMEM;
(void)blob_close(old_of->handle_id);
pthread_mutex_unlock(&g_fs.fd_mu);
errno = saved;
return -1;
}
new_of->offset = old_of->offset;
fd_flags = (new_fd_flags >= 0) ? new_fd_flags : old_of->fd_flags;
new_of->fd_flags = fd_flags;
atomic_fetch_add(&old_of->inode->ref_count, 1);
openfile_insert(new_of);
pthread_mutex_unlock(&g_fs.fd_mu);
return 0;
}
static int
zvfs_add_ref_batch_or_fallback(const uint64_t *handle_ids,
const uint32_t *ref_deltas,
uint32_t count)
{
uint32_t i;
if (count == 0)
return 0;
if (blob_add_ref_batch(handle_ids, ref_deltas, count) == 0)
return 0;
for (i = 0; i < count; i++) {
if (blob_add_ref(handle_ids[i], ref_deltas[i]) != 0)
return -1;
}
return 0;
}
static void
zvfs_rollback_added_refs(const uint64_t *handle_ids, uint32_t count)
{
uint32_t i;
for (i = 0; i < count; i++) {
if (handle_ids[i] != 0)
(void)blob_close(handle_ids[i]);
}
}
static int
zvfs_snapshot_fd_handles(uint64_t **handle_ids_out,
uint32_t **ref_deltas_out,
uint32_t *count_out)
{
struct zvfs_open_file *of, *tmp;
uint32_t i = 0;
uint32_t count;
uint64_t *handle_ids = NULL;
uint32_t *ref_deltas = NULL;
*handle_ids_out = NULL;
*ref_deltas_out = NULL;
*count_out = 0;
pthread_mutex_lock(&g_fs.fd_mu);
count = (uint32_t)HASH_COUNT(g_fs.fd_table);
if (count == 0) {
pthread_mutex_unlock(&g_fs.fd_mu);
return 0;
}
handle_ids = calloc(count, sizeof(*handle_ids));
ref_deltas = calloc(count, sizeof(*ref_deltas));
if (!handle_ids || !ref_deltas) {
pthread_mutex_unlock(&g_fs.fd_mu);
free(handle_ids);
free(ref_deltas);
errno = ENOMEM;
return -1;
}
HASH_ITER(hh, g_fs.fd_table, of, tmp) {
if (i >= count)
break;
handle_ids[i] = of->handle_id;
ref_deltas[i] = 1;
i++;
}
pthread_mutex_unlock(&g_fs.fd_mu);
*handle_ids_out = handle_ids;
*ref_deltas_out = ref_deltas;
*count_out = i;
return 0;
}
static int
zvfs_snapshot_fds_in_range(unsigned int first, unsigned int last,
int **fds_out, uint32_t *count_out)
{
struct zvfs_open_file *of, *tmp;
uint32_t cap;
uint32_t n = 0;
int *fds = NULL;
*fds_out = NULL;
*count_out = 0;
pthread_mutex_lock(&g_fs.fd_mu);
cap = (uint32_t)HASH_COUNT(g_fs.fd_table);
if (cap == 0) {
pthread_mutex_unlock(&g_fs.fd_mu);
return 0;
}
fds = calloc(cap, sizeof(*fds));
if (!fds) {
pthread_mutex_unlock(&g_fs.fd_mu);
errno = ENOMEM;
return -1;
}
HASH_ITER(hh, g_fs.fd_table, of, tmp) {
if (of->fd < 0) {
continue;
}
if ((unsigned int)of->fd < first || (unsigned int)of->fd > last) {
continue;
}
fds[n++] = of->fd;
}
pthread_mutex_unlock(&g_fs.fd_mu);
*fds_out = fds;
*count_out = n;
return 0;
}
/* ------------------------------------------------------------------ */
/* close_range */
/* ------------------------------------------------------------------ */
@@ -452,32 +751,53 @@ close_range(unsigned int first, unsigned int last, int flags)
return ret;
}
if (first > last) {
errno = EINVAL;
ZVFS_HOOK_LEAVE();
return -1;
}
/*
* 遍历范围内所有 fdzvfs fd 单独走 zvfs_close_impl
* 其余统一交给 real_close_range如果内核支持)。
* 若内核不支持 close_range< 5.9),逐个 close。
* 只快照当前 zvfs fd_table 中命中的 fd避免对 [first,last] 做
* 全范围扫描last=UINT_MAX 时会非常慢,且旧逻辑存在回绕风险)。
*/
int any_err = 0;
int inited = 0;
for (unsigned int fd = first; fd <= last; fd++) {
if (zvfs_is_zvfs_fd((int)fd)) {
int *zvfs_fds = NULL;
uint32_t zvfs_fd_count = 0;
if (zvfs_snapshot_fds_in_range(first, last, &zvfs_fds, &zvfs_fd_count) < 0) {
ZVFS_HOOK_LEAVE();
return -1;
}
for (uint32_t i = 0; i < zvfs_fd_count; i++) {
if (!inited) {
zvfs_ensure_init();
inited = 1;
}
if (zvfs_close_impl((int)fd) < 0) any_err = 1;
if (zvfs_close_impl(zvfs_fds[i]) < 0) {
any_err = 1;
}
}
free(zvfs_fds);
/* 让内核处理剩余非 zvfs fdCLOEXEC 等 flags 也在这里生效) */
if (real_close_range) {
if (real_close_range(first, last, flags) < 0 && !any_err)
any_err = 1;
} else {
/* 降级:逐个 close 非 zvfs fd */
for (unsigned int fd = first; fd <= last; fd++) {
/* 降级:逐个 close 非 zvfs fd(按 open-max 做上界截断) */
unsigned int upper = last;
long open_max = sysconf(_SC_OPEN_MAX);
if (open_max > 0 && upper >= (unsigned int)open_max) {
upper = (unsigned int)open_max - 1;
}
for (unsigned int fd = first; fd <= upper; fd++) {
if (!zvfs_is_zvfs_fd((int)fd))
real_close((int)fd);
if (fd == upper)
break;
}
}
@@ -501,16 +821,26 @@ dup(int oldfd)
return ret;
}
/*
* 当前版本不支持在 zvfs fd 上做 dup。
* 先明确返回 ENOTSUP避免暴露错误的 offset 语义。
*/
zvfs_ensure_init();
errno = ENOTSUP;
int newfd = real_dup(oldfd);
if (newfd < 0) {
ZVFS_HOOK_LEAVE();
return -1;
}
if (zvfs_dup_attach_newfd(oldfd, newfd, 0) < 0) {
int saved = errno;
(void)real_close(newfd);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
ZVFS_HOOK_LEAVE();
return newfd;
}
/* ------------------------------------------------------------------ */
/* dup2 */
/* ------------------------------------------------------------------ */
@@ -534,11 +864,34 @@ dup2(int oldfd, int newfd)
}
zvfs_ensure_init();
errno = ENOTSUP;
int newfd_was_zvfs = zvfs_is_zvfs_fd(newfd);
int ret = real_dup2(oldfd, newfd);
if (ret < 0) {
ZVFS_HOOK_LEAVE();
return -1;
}
if (newfd_was_zvfs && zvfs_detach_fd_mapping(newfd, 1) < 0) {
int saved = errno;
(void)real_close(newfd);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
if (zvfs_dup_attach_newfd(oldfd, newfd, 0) < 0) {
int saved = errno;
(void)real_close(newfd);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
ZVFS_HOOK_LEAVE();
return ret;
}
/* ------------------------------------------------------------------ */
/* dup3 */
/* ------------------------------------------------------------------ */
@@ -561,8 +914,92 @@ dup3(int oldfd, int newfd, int flags)
return -1;
}
zvfs_ensure_init();
errno = ENOTSUP;
if ((flags & ~O_CLOEXEC) != 0) {
errno = EINVAL;
ZVFS_HOOK_LEAVE();
return -1;
}
zvfs_ensure_init();
int newfd_was_zvfs = zvfs_is_zvfs_fd(newfd);
int ret = real_dup3(oldfd, newfd, flags);
if (ret < 0) {
ZVFS_HOOK_LEAVE();
return -1;
}
if (newfd_was_zvfs && zvfs_detach_fd_mapping(newfd, 1) < 0) {
int saved = errno;
(void)real_close(newfd);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
int fd_flags = (flags & O_CLOEXEC) ? FD_CLOEXEC : 0;
if (zvfs_dup_attach_newfd(oldfd, newfd, fd_flags) < 0) {
int saved = errno;
(void)real_close(newfd);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
ZVFS_HOOK_LEAVE();
return ret;
}
/* ------------------------------------------------------------------ */
/* fork */
/* ------------------------------------------------------------------ */
pid_t
fork(void)
{
ZVFS_HOOK_ENTER();
if (ZVFS_IN_HOOK()) {
pid_t ret = real_fork();
ZVFS_HOOK_LEAVE();
return ret;
}
uint64_t *handle_ids = NULL;
uint32_t *ref_deltas = NULL;
uint32_t count = 0;
if (zvfs_snapshot_fd_handles(&handle_ids, &ref_deltas, &count) < 0) {
ZVFS_HOOK_LEAVE();
return -1;
}
if (count > 0) {
zvfs_ensure_init();
if (zvfs_add_ref_batch_or_fallback(handle_ids, ref_deltas, count) < 0) {
int saved = errno;
free(handle_ids);
free(ref_deltas);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
}
pid_t ret = real_fork();
if (ret < 0) {
int saved = errno;
if (count > 0)
zvfs_rollback_added_refs(handle_ids, count);
free(handle_ids);
free(ref_deltas);
errno = saved;
ZVFS_HOOK_LEAVE();
return -1;
}
free(handle_ids);
free(ref_deltas);
ZVFS_HOOK_LEAVE();
return ret;
}

View File

@@ -12,16 +12,17 @@
* 非 zvfs 路径 → 透传
*
* close
* zvfs fd → openfile ref_count--
* 归零:blob_close若 inode->deletedblob_delete + inode_free
* inode ref_count--归零path_cache_remove + inode_free
* zvfs fd → blob_sync_md + blob_close
* inode ref_count--归零:若 inode->deletedblob_delete,再 inode_free
* real_close
* 非 zvfs fd → 透传
*
* dup / dup2 / dup3
* zvfs fd → 新 fd 插入 fd_tableopenfile.ref_count++(共享同一 openfile
* real_dup* 同步执行(内核也要知道这个 fd
* zvfs fd → real_dup* + daemon ADD_REF + 本地 openfile/inode 引用维护
* 非 zvfs fd → 透传
*
* fork
* 子进程会对已继承的 zvfs handle 执行 ADD_REF_BATCH失败时退化为逐个 ADD_REF
*/
/* open 族 */
@@ -40,6 +41,10 @@ int close_range(unsigned int first, unsigned int last, int flags);
int dup(int oldfd);
int dup2(int oldfd, int newfd);
int dup3(int oldfd, int newfd, int flags);
pid_t fork(void);
/* 给 fcntl(F_DUPFD*) 复用的内部辅助接口 */
int zvfs_dup_attach_newfd(int oldfd, int newfd, int new_fd_flags);
/* glibc 内部别名(与 open/close 实现体共享逻辑,转发即可) */
int __open(const char *path, int flags, ...);

View File

@@ -114,6 +114,10 @@ extern void *(*real_mmap64)(void *addr, size_t length, int prot, int flags,
extern int (*real_munmap)(void *addr, size_t length);
extern int (*real_msync)(void *addr, size_t length, int flags);
/* 进程 */
extern pid_t (*real_fork)(void);
extern pid_t (*real_vfork)(void);
/* glibc 内部别名 */
extern int (*real___open)(const char *path, int flags, ...);

View File

@@ -7,6 +7,7 @@
#include "fs/zvfs.h"
#include "fs/zvfs_open_file.h"
#include "fs/zvfs_inode.h"
#include "proto/ipc_proto.h"
#include "spdk_engine/io_engine.h"
#include <errno.h>
@@ -50,7 +51,7 @@ zvfs_pread_impl(struct zvfs_open_file *of,
if (count == 0)
return 0;
if (blob_read(of->handle, offset, buf, count) < 0) {
if (blob_read(of->handle_id, offset, buf, count) < 0) {
errno = EIO;
return -1;
}
@@ -74,33 +75,15 @@ zvfs_pwrite_impl(struct zvfs_open_file *of,
uint64_t end = offset + count;
/*
* 若写入范围超出 blob 当前物理大小,先 resize。
* blob_resize 是 SPDK 侧的操作(可能分配新 cluster
*/
pthread_mutex_lock(&of->inode->mu);
uint64_t old_size = of->inode->logical_size;
pthread_mutex_unlock(&of->inode->mu);
if (end > old_size) {
if (blob_resize(of->handle, end) < 0) {
errno = EIO;
return -1;
}
}
if (blob_write(of->handle, offset, buf, count) < 0) {
errno = EIO;
if (blob_write_ex(of->handle_id, offset, buf, count, ZVFS_WRITE_F_AUTO_GROW) < 0) {
return -1;
}
/* 更新 logical_size持锁inode_update_size 负责 ftruncate */
if (end > old_size) {
pthread_mutex_lock(&of->inode->mu);
if (end > of->inode->logical_size) /* double-check */
inode_update_size(of->inode, of->fd, end);
pthread_mutex_unlock(&of->inode->mu);
}
return (ssize_t)count;
}
@@ -151,7 +134,7 @@ zvfs_iov_pread(struct zvfs_open_file *of,
char *tmp = malloc(total_len);
if (!tmp) { errno = ENOMEM; return -1; }
if (blob_read(of->handle, offset, tmp, total_len) < 0) {
if (blob_read(of->handle_id, offset, tmp, total_len) < 0) {
free(tmp);
errno = EIO;
return -1;
@@ -477,36 +460,16 @@ write(int fd, const void *buf, size_t count)
uint64_t write_off;
if (of->flags & O_APPEND) {
/*
* O_APPEND每次写入位置 = 当前 logical_size原子操作
* 持 inode->mu 保证 read-then-write 的原子性,
* 防止两个 O_APPEND fd 并发写时覆盖彼此数据。
*/
/* --- O_APPEND 内联写 -------------------------------------- */
/* O_APPEND每次写入位置 = 当前 logical_size。 */
pthread_mutex_lock(&of->inode->mu);
write_off = of->inode->logical_size; /* 重新取,防止 TOCTOU */
uint64_t end = write_off + count;
pthread_mutex_unlock(&of->inode->mu);
if (blob_resize(of->handle, end) < 0) {
errno = EIO;
ssize_t r = zvfs_pwrite_impl(of, buf, count, write_off);
if (r > 0)
of->offset = write_off + (uint64_t)r;
ZVFS_HOOK_LEAVE();
return -1;
}
if (blob_write(of->handle, write_off, buf, count) < 0) {
errno = EIO;
ZVFS_HOOK_LEAVE();
return -1;
}
pthread_mutex_lock(&of->inode->mu);
if (end > of->inode->logical_size)
inode_update_size(of->inode, of->fd, end);
pthread_mutex_unlock(&of->inode->mu);
ZVFS_HOOK_LEAVE();
return (ssize_t)count;
return r;
} else {
write_off = of->offset;
@@ -572,28 +535,14 @@ writev(int fd, const struct iovec *iov, int iovcnt)
ssize_t r;
if (of->flags & O_APPEND) {
/*
* O_APPEND + writev和 write 一样需要原子序列。
* 先计算总字节数,用 iov_pwrite 完成,整个过程持 inode->mu。
*/
size_t total_len = 0;
for (int i = 0; i < iovcnt; i++) total_len += iov[i].iov_len;
/* O_APPEND + writev以当前 logical_size 作为写入起点。 */
pthread_mutex_lock(&of->inode->mu);
uint64_t write_off = of->inode->logical_size;
uint64_t end = write_off + total_len;
pthread_mutex_unlock(&of->inode->mu);
if (blob_resize(of->handle, end) < 0) { errno = EIO; ZVFS_HOOK_LEAVE(); return -1; }
r = zvfs_iov_pwrite(of, iov, iovcnt, write_off);
if (r > 0) {
pthread_mutex_lock(&of->inode->mu);
uint64_t new_end = write_off + (uint64_t)r;
if (new_end > of->inode->logical_size)
inode_update_size(of->inode, of->fd, new_end);
pthread_mutex_unlock(&of->inode->mu);
}
if (r > 0)
of->offset = write_off + (uint64_t)r;
} else {
r = zvfs_iov_pwrite(of, iov, iovcnt, of->offset);
if (r > 0) of->offset += (uint64_t)r;

View File

@@ -69,21 +69,21 @@ off_t lseek64(int fd, off_t offset, int whence)
/*
* zvfs_truncate_by_inode - 对有 handle 的 openfile 做 truncate。
* 找到任意一个打开该 inode 的 openfile 取其 handle。
* zvfs_truncate_by_inode - 对有 handle_id 的 openfile 做 truncate。
* 找到任意一个打开该 inode 的 openfile 取其 handle_id
*/
static int
zvfs_truncate_inode_with_handle(struct zvfs_inode *inode,
int real_fd, uint64_t new_size)
{
/* 在 fd_table 里找一个指向该 inode 的 openfile 取 handle */
struct zvfs_blob_handle *handle = NULL;
/* 在 fd_table 里找一个指向该 inode 的 openfile 取 handle_id */
uint64_t handle_id = 0;
pthread_mutex_lock(&g_fs.fd_mu);
struct zvfs_open_file *of, *tmp;
HASH_ITER(hh, g_fs.fd_table, of, tmp) {
(void)tmp;
if (of->inode == inode) {
handle = of->handle;
handle_id = of->handle_id;
break;
}
}
@@ -93,20 +93,23 @@ zvfs_truncate_inode_with_handle(struct zvfs_inode *inode,
uint64_t old_size = inode->logical_size;
pthread_mutex_unlock(&inode->mu);
if (new_size != old_size && handle) {
if (blob_resize(handle, new_size) < 0) {
if (new_size != old_size && handle_id != 0) {
if (blob_resize(handle_id, new_size) < 0) {
errno = EIO;
return -1;
}
} else if (new_size != old_size && !handle) {
} else if (new_size != old_size && handle_id == 0) {
/*
* 文件未被打开:需要临时 blob_open。
* 这种情况下 truncate(path, ...) 被调用但文件没有 fd。
*/
handle = blob_open(inode->blob_id);
if (!handle) { errno = EIO; return -1; }
int rc = blob_resize(handle, new_size);
blob_close(handle);
uint64_t temp_handle_id = 0;
if (blob_open(inode->blob_id, &temp_handle_id) < 0) {
errno = EIO;
return -1;
}
int rc = blob_resize(temp_handle_id, new_size);
blob_close(temp_handle_id);
if (rc < 0) { errno = EIO; return -1; }
}

View File

@@ -39,7 +39,7 @@ fsync(int fd)
* zvfs 无写缓冲区,数据已在 blob_write 时落到 SPDK 存储。
* 调用 blob_sync_md 确保 blob 元数据size 等)持久化。
*/
int r = blob_sync_md(of->handle);
int r = blob_sync_md(of->handle_id);
if (r < 0) errno = EIO;
ZVFS_HOOK_LEAVE();
@@ -75,7 +75,7 @@ fdatasync(int fd)
* 对 zvfs数据已无缓冲blob_sync_md 同步 size 元数据即可。
* 与 fsync 实现相同——如果将来区分数据/元数据可在此分叉。
*/
int r = blob_sync_md(of->handle);
int r = blob_sync_md(of->handle_id);
if (r < 0) errno = EIO;
ZVFS_HOOK_LEAVE();

View File

1056
src/proto/ipc_proto.c Normal file

File diff suppressed because it is too large Load Diff

265
src/proto/ipc_proto.h Normal file
View File

@@ -0,0 +1,265 @@
#ifndef __IPC_PROTO_H__
#define __IPC_PROTO_H__
#include <stddef.h>
#include <stdint.h>
#ifdef __cplusplus
extern "C" {
#endif
struct zvfs_conn;
struct zvfs_blob_handle;
enum zvfs_opcode {
ZVFS_OP_CREATE = 1,
ZVFS_OP_OPEN,
ZVFS_OP_READ,
ZVFS_OP_WRITE,
ZVFS_OP_RESIZE,
ZVFS_OP_SYNC_MD,
ZVFS_OP_CLOSE,
ZVFS_OP_DELETE,
ZVFS_OP_ADD_REF,
ZVFS_OP_ADD_REF_BATCH
};
inline const char* cast_opcode2string(uint32_t op){
switch (op)
{
case ZVFS_OP_CREATE:
return "CREATE";
break;
case ZVFS_OP_OPEN:
return "OPEN";
break;
case ZVFS_OP_READ:
return "READ";
break;
case ZVFS_OP_WRITE:
return "WRITE";
break;
case ZVFS_OP_RESIZE:
return "RESIZE";
break;
case ZVFS_OP_SYNC_MD:
return "SYNC";
break;
case ZVFS_OP_CLOSE:
return "CLOSE";
break;
case ZVFS_OP_DELETE:
return "DELETE";
break;
default:
break;
}
return "ERROR";
}
#define ZVFS_WRITE_F_AUTO_GROW (1u << 0)
/* 最小固定头(同步阻塞场景,不含 request_id */
struct zvfs_req_header {
uint32_t opcode;
uint32_t payload_len;
};
struct zvfs_resp_header {
uint32_t opcode;
int32_t status;
uint32_t payload_len;
};
/* -------------------- per-op request body -------------------- */
struct zvfs_req_create_body {
uint64_t size_hint;
};
struct zvfs_req_open_body {
uint64_t blob_id;
};
struct zvfs_req_read_body {
uint64_t handle_id;
uint64_t offset;
uint64_t length;
};
struct zvfs_req_write_body {
uint64_t handle_id;
uint64_t offset;
uint64_t length;
uint32_t flags;
const void *data;
};
struct zvfs_req_resize_body {
uint64_t handle_id;
uint64_t new_size;
};
struct zvfs_req_sync_md_body {
uint64_t handle_id;
};
struct zvfs_req_close_body {
uint64_t handle_id;
};
struct zvfs_req_delete_body {
uint64_t blob_id;
};
struct zvfs_add_ref_item {
uint64_t handle_id;
uint32_t ref_delta;
};
struct zvfs_req_add_ref_body {
uint64_t handle_id;
uint32_t ref_delta;
};
struct zvfs_req_add_ref_batch_body {
uint32_t item_count;
const struct zvfs_add_ref_item *items;
};
/* -------------------- per-op response body -------------------- */
struct zvfs_resp_create_body {
uint64_t blob_id;
uint64_t handle_id;
};
struct zvfs_resp_open_body {
uint64_t handle_id;
uint64_t size;
};
struct zvfs_resp_read_body {
uint64_t length;
void *data;
};
struct zvfs_resp_write_body {
uint64_t bytes_written;
};
/* resize/sync_md/close/delete 成功时 body 为空 */
size_t zvfs_serialize_resp_resize(uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_resize(const uint8_t *buf, size_t buf_len);
size_t zvfs_serialize_resp_sync_md(uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_sync_md(const uint8_t *buf, size_t buf_len);
size_t zvfs_serialize_resp_close(uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_close(const uint8_t *buf, size_t buf_len);
size_t zvfs_serialize_resp_delete(uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_delete(const uint8_t *buf, size_t buf_len);
/* -------------------- 兼容旧接口 req/resp -------------------- */
struct zvfs_req {
uint32_t opcode;
uint64_t size_hint;
uint64_t blob_id;
uint64_t handle_id;
uint64_t offset;
uint64_t length;
uint32_t write_flags;
void *data;
uint32_t ref_delta;
uint32_t add_ref_count;
struct zvfs_add_ref_item *add_ref_items;
struct zvfs_conn *conn;
struct zvfs_blob_handle *handle;
};
struct zvfs_resp {
uint32_t opcode;
int32_t status;
uint64_t blob_id;
uint64_t handle_id;
uint64_t size;
uint64_t length;
void *data;
uint64_t bytes_written;
struct zvfs_conn *conn;
};
/* -------------------- 头部序列化/反序列化 -------------------- */
size_t zvfs_serialize_req_header(const struct zvfs_req_header *header, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_header(const uint8_t *buf, size_t buf_len, struct zvfs_req_header *header);
size_t zvfs_serialize_resp_header(const struct zvfs_resp_header *header, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_header(const uint8_t *buf, size_t buf_len, struct zvfs_resp_header *header);
/* -------------------- request body 序列化/反序列化 -------------------- */
size_t zvfs_serialize_req_create(const struct zvfs_req_create_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_create(const uint8_t *buf, size_t buf_len, struct zvfs_req_create_body *body);
size_t zvfs_serialize_req_open(const struct zvfs_req_open_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_open(const uint8_t *buf, size_t buf_len, struct zvfs_req_open_body *body);
size_t zvfs_serialize_req_read(const struct zvfs_req_read_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_read(const uint8_t *buf, size_t buf_len, struct zvfs_req_read_body *body);
size_t zvfs_serialize_req_write(const struct zvfs_req_write_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_write(const uint8_t *buf, size_t buf_len, struct zvfs_req_write_body *body);
size_t zvfs_serialize_req_resize(const struct zvfs_req_resize_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_resize(const uint8_t *buf, size_t buf_len, struct zvfs_req_resize_body *body);
size_t zvfs_serialize_req_sync_md(const struct zvfs_req_sync_md_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_sync_md(const uint8_t *buf, size_t buf_len, struct zvfs_req_sync_md_body *body);
size_t zvfs_serialize_req_close(const struct zvfs_req_close_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_close(const uint8_t *buf, size_t buf_len, struct zvfs_req_close_body *body);
size_t zvfs_serialize_req_delete(const struct zvfs_req_delete_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_delete(const uint8_t *buf, size_t buf_len, struct zvfs_req_delete_body *body);
size_t zvfs_serialize_req_add_ref(const struct zvfs_req_add_ref_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_add_ref(const uint8_t *buf, size_t buf_len, struct zvfs_req_add_ref_body *body);
size_t zvfs_serialize_req_add_ref_batch(const struct zvfs_req_add_ref_batch_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req_add_ref_batch(const uint8_t *buf, size_t buf_len, struct zvfs_req_add_ref_batch_body *body);
/* -------------------- response body 序列化/反序列化 -------------------- */
size_t zvfs_serialize_resp_create(const struct zvfs_resp_create_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_create(const uint8_t *buf, size_t buf_len, struct zvfs_resp_create_body *body);
size_t zvfs_serialize_resp_open(const struct zvfs_resp_open_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_open(const uint8_t *buf, size_t buf_len, struct zvfs_resp_open_body *body);
size_t zvfs_serialize_resp_read(const struct zvfs_resp_read_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_read(const uint8_t *buf, size_t buf_len, struct zvfs_resp_read_body *body);
size_t zvfs_serialize_resp_write(const struct zvfs_resp_write_body *body, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp_write(const uint8_t *buf, size_t buf_len, struct zvfs_resp_write_body *body);
/* -------------------- 兼容封装 -------------------- */
size_t zvfs_serialize_req(struct zvfs_req *req, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_req(uint8_t *buf, size_t buf_len, struct zvfs_req *req);
size_t zvfs_serialize_resp(struct zvfs_resp *resp, uint8_t *buf, size_t buf_len);
size_t zvfs_deserialize_resp(uint8_t *buf, size_t buf_len, struct zvfs_resp *resp);
#ifdef __cplusplus
}
#endif
#endif

File diff suppressed because it is too large Load Diff

View File

@@ -2,42 +2,20 @@
#define __ZVFS_IO_ENGINE_H__
#include <stdint.h>
#include <sys/types.h>
#include <spdk/blob.h>
#include <stddef.h>
// blob_handle 结构体:底层 blob 信息,不含文件级 size上层维护
typedef struct zvfs_blob_handle {
spdk_blob_id id;
struct spdk_blob *blob;
uint64_t size;
void *dma_buf;
uint64_t dma_buf_size;
} zvfs_blob_handle_t ;
int io_engine_init(void);
typedef struct zvfs_spdk_io_engine {
struct spdk_bs_dev *bs_dev;
struct spdk_blob_store *bs;
struct spdk_thread *md_thread;
uint64_t io_unit_size;
uint64_t cluster_size;
int reactor_count;
} zvfs_spdk_io_engine_t;
typedef struct zvfs_tls_ctx {
struct spdk_thread *thread;
struct spdk_io_channel *channel;
}zvfs_tls_ctx_t;
int io_engine_init(const char *bdev_name);
struct zvfs_blob_handle *blob_create(uint64_t size_hint); // 创建并 open返回 handle
struct zvfs_blob_handle *blob_open(uint64_t blob_id); // open 现有 blob返回 handle
int blob_write(struct zvfs_blob_handle *handle, uint64_t offset, const void *buf, size_t len);
int blob_read(struct zvfs_blob_handle *handle, uint64_t offset, void *buf, size_t len);
int blob_resize(struct zvfs_blob_handle *handle, uint64_t new_size);
int blob_sync_md(struct zvfs_blob_handle *handle);
int blob_close(struct zvfs_blob_handle *handle); // close 这个 handle 的 blob*
int blob_delete(uint64_t blob_id); // delete整个 blob不需 handle
int blob_create(uint64_t size_hint, uint64_t *blob_id_out, uint64_t *handle_id_out);
int blob_open(uint64_t blob_id, uint64_t *handle_id_out);
int blob_write_ex(uint64_t handle_id, uint64_t offset, const void *buf, size_t len, uint32_t write_flags);
int blob_write(uint64_t handle_id, uint64_t offset, const void *buf, size_t len);
int blob_read(uint64_t handle_id, uint64_t offset, void *buf, size_t len);
int blob_resize(uint64_t handle_id, uint64_t new_size);
int blob_sync_md(uint64_t handle_id);
int blob_close(uint64_t handle_id);
int blob_delete(uint64_t blob_id);
int blob_add_ref(uint64_t handle_id, uint32_t ref_delta);
int blob_add_ref_batch(const uint64_t *handle_ids, const uint32_t *ref_deltas, uint32_t count);
#endif // __ZVFS_IO_ENGINE_H__

View File

@@ -7,7 +7,7 @@
"method": "bdev_malloc_create",
"params": {
"name": "Malloc0",
"num_blocks": 262140,
"num_blocks": 1048576,
"block_size": 512
}
}

View File

@@ -1,4 +1,4 @@
SUBDIRS := ioengine_test hook
SUBDIRS := hook_test daemon_test
.PHONY: all clean $(SUBDIRS)

View File

@@ -0,0 +1,12 @@
BIN_DIR := $(abspath $(CURDIR)/../bin)
PROTO_DIR := $(abspath $(CURDIR)/../../src/proto)
CFLAGS := -I$(abspath $(CURDIR)/../../src)
all:
gcc -g -o $(BIN_DIR)/ipc_echo_test ipc_echo_test.c
gcc -g $(CFLAGS) -o $(BIN_DIR)/ipc_zvfs_test ipc_zvfs_test.c $(PROTO_DIR)/ipc_proto.c
clean:
rm -rf $(BIN_DIR)/ipc_echo_test $(BIN_DIR)/ipc_zvfs_test

View File

@@ -0,0 +1,33 @@
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/un.h>
int main()
{
int fd = socket(AF_UNIX, SOCK_STREAM, 0);
struct sockaddr_un addr;
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strcpy(addr.sun_path, "/tmp/zvfs.sock");
connect(fd, (struct sockaddr*)&addr, sizeof(addr));
char *msg = "hello reactor\n";
write(fd, msg, strlen(msg));
char buf[4096];
int n = read(fd, buf, sizeof(buf));
printf("echo: %.*s\n", n, buf);
close(fd);
return 0;
}

View File

@@ -0,0 +1,265 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <sys/un.h>
#include "proto/ipc_proto.h"
#define SOCKET_PATH "/tmp/zvfs.sock"
#define BUF_SIZE 4096
int connect_to_server() {
int fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
perror("socket");
return -1;
}
struct sockaddr_un addr;
memset(&addr, 0, sizeof(addr));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, SOCKET_PATH, sizeof(addr.sun_path)-1);
if (connect(fd, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
perror("connect");
close(fd);
return -1;
}
return fd;
}
// -------------------- 操作函数 --------------------
void do_create(int fd) {
struct zvfs_req req;
memset(&req, 0, sizeof(req));
req.opcode = ZVFS_OP_CREATE;
req.size_hint = 1024; // 1KB
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp, 0, sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received CREATE response: status=%d, blob_id=%lu, handle_id=%lu\n",
resp.status, resp.blob_id, resp.handle_id);
if(resp.data) free(resp.data);
}
void do_open(int fd, uint64_t blob_id) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_OPEN;
req.blob_id = blob_id;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received OPEN response: status=%d, handle_id=%lu, size=%lu\n",
resp.status, resp.handle_id, resp.size);
if(resp.data) free(resp.data);
}
void do_read(int fd, uint64_t handle_id, uint64_t offset, uint64_t length) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_READ;
req.handle_id = handle_id;
req.offset = offset;
req.length = length;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received READ response: status=%d, length=%lu\n",
resp.status, resp.length);
if(resp.data) {
printf("Data: ");
for(size_t i=0;i<resp.length;i++)
printf("%02x ", ((uint8_t*)resp.data)[i]);
printf("\n");
free(resp.data);
}
}
void do_write(int fd, uint64_t handle_id, uint64_t offset,
const char *data, size_t len, uint32_t write_flags) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_WRITE;
req.handle_id = handle_id;
req.offset = offset;
req.length = len;
req.write_flags = write_flags;
req.data = (void*)data;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received WRITE response: status=%d, bytes_written=%lu\n",
resp.status, resp.bytes_written);
if(resp.data) free(resp.data);
}
void do_close(int fd, uint64_t handle_id) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_CLOSE;
req.handle_id = handle_id;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received CLOSE response: status=%d\n", resp.status);
if(resp.data) free(resp.data);
}
void do_delete(int fd, uint64_t blob_id) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_DELETE;
req.blob_id = blob_id;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received DELETE response: status=%d\n", resp.status);
if(resp.data) free(resp.data);
}
void do_resize(int fd, uint64_t handle_id, uint64_t new_size) {
struct zvfs_req req;
memset(&req,0,sizeof(req));
req.opcode = ZVFS_OP_RESIZE;
req.handle_id = handle_id;
req.size_hint = new_size;
uint8_t buf[BUF_SIZE];
size_t n = zvfs_serialize_req(&req, buf, sizeof(buf));
if (n == 0) { fprintf(stderr,"serialize failed\n"); return; }
if (write(fd, buf, n) != n) { perror("write"); return; }
uint8_t resp_buf[BUF_SIZE];
ssize_t r = read(fd, resp_buf, sizeof(resp_buf));
if (r <= 0) { perror("read"); return; }
struct zvfs_resp resp;
memset(&resp,0,sizeof(resp));
size_t consumed = zvfs_deserialize_resp(resp_buf, r, &resp);
if (consumed == 0) { fprintf(stderr, "deserialize failed\n"); return; }
printf("Received RESIZE response: status=%d\n", resp.status);
if(resp.data) free(resp.data);
}
// -------------------- main --------------------
int main() {
int fd = connect_to_server();
if(fd < 0) return 1;
printf("Connected to server at %s\n", SOCKET_PATH);
printf("Commands:\n create\n open <blob>\n read <handle> <offset> <len>\n write <handle> <offset> <data>\n writeg <handle> <offset> <data>\n close <handle>\n delete <blob>\n resize <handle> <size>\n quit\n");
char line[256];
while (1) {
printf("> ");
if(!fgets(line, sizeof(line), stdin)) break;
char cmd[32];
uint64_t a,b,c;
char data[128];
if (sscanf(line, "%31s", cmd) != 1) continue;
if (strcmp(cmd,"quit")==0) break;
else if (strcmp(cmd,"create")==0) do_create(fd);
else if (strcmp(cmd,"open")==0 && sscanf(line,"%*s %lu",&a)==1) do_open(fd,a);
else if (strcmp(cmd,"read")==0 && sscanf(line,"%*s %lu %lu %lu",&a,&b,&c)==3) do_read(fd,a,b,c);
else if (strcmp(cmd,"write")==0 && sscanf(line,"%*s %lu %lu %127s",&a,&b,data)==3)
do_write(fd, a, b, data, strlen(data), 0);
else if (strcmp(cmd,"writeg")==0 && sscanf(line,"%*s %lu %lu %127s",&a,&b,data)==3)
do_write(fd, a, b, data, strlen(data), ZVFS_WRITE_F_AUTO_GROW);
else if (strcmp(cmd,"close")==0 && sscanf(line,"%*s %lu",&a)==1) do_close(fd,a);
else if (strcmp(cmd,"delete")==0 && sscanf(line,"%*s %lu",&a)==1) do_delete(fd,a);
else if (strcmp(cmd,"resize")==0 && sscanf(line,"%*s %lu %lu",&a,&b)==2) do_resize(fd,a,b);
else printf("Unknown or invalid command\n");
}
close(fd);
return 0;
}

View File

@@ -1,43 +0,0 @@
# SPDX-License-Identifier: BSD-3-Clause
SPDK_ROOT_DIR := $(abspath $(CURDIR)/../../spdk)
include $(SPDK_ROOT_DIR)/mk/spdk.common.mk
include $(SPDK_ROOT_DIR)/mk/spdk.modules.mk
include $(SPDK_ROOT_DIR)/mk/spdk.app_vars.mk
# 输出目录
BIN_DIR := $(abspath $(CURDIR)/../bin)
TEST_BINS := \
ioengine_single_blob_test \
ioengine_multi_blob_test \
ioengine_same_blob_mt_test
COMMON_SRCS := \
test_common.c \
../../src/spdk_engine/io_engine.c \
../../src/common/utils.c
SPDK_LIB_LIST = $(ALL_MODULES_LIST) event event_bdev
LIBS += $(SPDK_LIB_LINKER_ARGS)
CFLAGS += -I$(abspath $(CURDIR)/../../src) -I$(CURDIR)
.PHONY: all clean
all: $(BIN_DIR) $(addprefix $(BIN_DIR)/,$(TEST_BINS))
# 创建 bin 目录
$(BIN_DIR):
mkdir -p $(BIN_DIR)
$(BIN_DIR)/ioengine_single_blob_test: ioengine_single_blob_test.c $(COMMON_SRCS) $(SPDK_LIB_FILES) $(ENV_LIBS)
$(CC) $(CFLAGS) -o $@ $< $(COMMON_SRCS) $(LDFLAGS) $(LIBS) $(ENV_LDFLAGS) $(SYS_LIBS)
$(BIN_DIR)/ioengine_multi_blob_test: ioengine_multi_blob_test.c $(COMMON_SRCS) $(SPDK_LIB_FILES) $(ENV_LIBS)
$(CC) $(CFLAGS) -o $@ $< $(COMMON_SRCS) $(LDFLAGS) $(LIBS) $(ENV_LDFLAGS) $(SYS_LIBS)
$(BIN_DIR)/ioengine_same_blob_mt_test: ioengine_same_blob_mt_test.c $(COMMON_SRCS) $(SPDK_LIB_FILES) $(ENV_LIBS)
$(CC) $(CFLAGS) -o $@ $< $(COMMON_SRCS) $(LDFLAGS) $(LIBS) $(ENV_LDFLAGS) $(SYS_LIBS)
clean:
rm -f $(addprefix $(BIN_DIR)/,$(TEST_BINS))

View File

@@ -1,106 +0,0 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "spdk_engine/io_engine.h"
#include "test_common.h"
#define MULTI_BLOB_COUNT 3
int main(void) {
int rc = 0;
const char *bdev_name = getenv("SPDK_BDEV_NAME");
struct zvfs_blob_handle *handles[MULTI_BLOB_COUNT] = {0};
uint64_t ids[MULTI_BLOB_COUNT] = {0};
uint64_t cluster = 0;
void *wbuf = NULL;
void *rbuf = NULL;
int i = 0;
if (!bdev_name) {
bdev_name = "Malloc0";
}
if (io_engine_init(bdev_name) != 0) {
fprintf(stderr, "TEST2: io_engine_init failed (bdev=%s)\n", bdev_name);
return 1;
}
printf("[TEST2] single thread / multi blob\n");
handles[0] = blob_create(0);
if (!handles[0]) {
fprintf(stderr, "TEST2: create first blob failed\n");
return 1;
}
ids[0] = handles[0]->id;
cluster = handles[0]->size;
if (cluster == 0) {
fprintf(stderr, "TEST2: invalid cluster size\n");
rc = 1;
goto out;
}
if (blob_resize(handles[0], cluster * 2) != 0) {
fprintf(stderr, "TEST2: resize first blob failed\n");
rc = 1;
goto out;
}
for (i = 1; i < MULTI_BLOB_COUNT; i++) {
handles[i] = blob_create(cluster * 2);
if (!handles[i]) {
fprintf(stderr, "TEST2: create blob %d failed\n", i);
rc = 1;
goto out;
}
ids[i] = handles[i]->id;
}
if (alloc_aligned_buf(&wbuf, cluster) != 0 || alloc_aligned_buf(&rbuf, cluster) != 0) {
fprintf(stderr, "TEST2: alloc aligned buffer failed\n");
rc = 1;
goto out;
}
for (i = 0; i < MULTI_BLOB_COUNT; i++) {
fill_pattern((uint8_t *)wbuf, cluster, (uint8_t)(0x20 + i));
memset(rbuf, 0, cluster);
if (blob_write(handles[i], 0, wbuf, cluster) != 0) {
fprintf(stderr, "TEST2: blob_write[%d] failed\n", i);
rc = 1;
goto out;
}
if (blob_read(handles[i], 0, rbuf, cluster) != 0) {
fprintf(stderr, "TEST2: blob_read[%d] failed\n", i);
rc = 1;
goto out;
}
if (memcmp(wbuf, rbuf, cluster) != 0) {
fprintf(stderr, "TEST2: blob[%d] readback mismatch\n", i);
rc = 1;
goto out;
}
}
out:
for (i = 0; i < MULTI_BLOB_COUNT; i++) {
if (handles[i]) {
(void)blob_close(handles[i]);
}
}
for (i = 0; i < MULTI_BLOB_COUNT; i++) {
if (ids[i] != 0) {
(void)blob_delete(ids[i]);
}
}
free(wbuf);
free(rbuf);
if (rc == 0) {
printf("[TEST2] PASS\n");
return 0;
}
printf("[TEST2] FAIL\n");
return 1;
}

View File

@@ -1,147 +0,0 @@
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "spdk_engine/io_engine.h"
#include "test_common.h"
#define THREAD_COUNT 4
struct mt_case_arg {
struct zvfs_blob_handle *handle;
uint64_t cluster_size;
uint64_t offset;
uint8_t seed;
pthread_barrier_t *barrier;
int rc;
};
static void *mt_case_worker(void *arg) {
struct mt_case_arg *ctx = (struct mt_case_arg *)arg;
void *wbuf = NULL;
void *rbuf = NULL;
if (alloc_aligned_buf(&wbuf, ctx->cluster_size) != 0 ||
alloc_aligned_buf(&rbuf, ctx->cluster_size) != 0) {
free(wbuf);
free(rbuf);
ctx->rc = 1;
return NULL;
}
fill_pattern((uint8_t *)wbuf, ctx->cluster_size, ctx->seed);
(void)pthread_barrier_wait(ctx->barrier);
if (blob_write(ctx->handle, ctx->offset, wbuf, ctx->cluster_size) != 0) {
ctx->rc = 1;
goto out;
}
if (blob_read(ctx->handle, ctx->offset, rbuf, ctx->cluster_size) != 0) {
ctx->rc = 1;
goto out;
}
if (memcmp(wbuf, rbuf, ctx->cluster_size) != 0) {
ctx->rc = 1;
goto out;
}
ctx->rc = 0;
out:
free(wbuf);
free(rbuf);
return NULL;
}
int main(void) {
int rc = 0;
const char *bdev_name = getenv("SPDK_BDEV_NAME");
int i = 0;
struct zvfs_blob_handle *h = NULL;
uint64_t blob_id = 0;
uint64_t cluster = 0;
pthread_t tids[THREAD_COUNT];
struct mt_case_arg args[THREAD_COUNT];
pthread_barrier_t barrier;
int barrier_inited = 0;
if (!bdev_name) {
bdev_name = "Malloc0";
}
if (io_engine_init(bdev_name) != 0) {
fprintf(stderr, "TEST3: io_engine_init failed (bdev=%s)\n", bdev_name);
return 1;
}
printf("[TEST3] multi thread / same blob\n");
h = blob_create(0);
if (!h) {
fprintf(stderr, "TEST3: blob_create failed\n");
return 1;
}
blob_id = h->id;
cluster = h->size;
if (cluster == 0) {
fprintf(stderr, "TEST3: invalid cluster size\n");
rc = 1;
goto out;
}
if (blob_resize(h, cluster * THREAD_COUNT) != 0) {
fprintf(stderr, "TEST3: blob_resize failed\n");
rc = 1;
goto out;
}
if (pthread_barrier_init(&barrier, NULL, THREAD_COUNT) != 0) {
fprintf(stderr, "TEST3: barrier init failed\n");
rc = 1;
goto out;
}
barrier_inited = 1;
for (i = 0; i < THREAD_COUNT; i++) {
args[i].handle = h;
args[i].cluster_size = cluster;
args[i].offset = cluster * (uint64_t)i;
args[i].seed = (uint8_t)(0x40 + i);
args[i].barrier = &barrier;
args[i].rc = 1;
if (pthread_create(&tids[i], NULL, mt_case_worker, &args[i]) != 0) {
fprintf(stderr, "TEST3: pthread_create[%d] failed\n", i);
rc = 1;
while (--i >= 0) {
pthread_join(tids[i], NULL);
}
goto out;
}
}
for (i = 0; i < THREAD_COUNT; i++) {
pthread_join(tids[i], NULL);
if (args[i].rc != 0) {
fprintf(stderr, "TEST3: worker[%d] failed\n", i);
rc = 1;
}
}
out:
if (barrier_inited) {
(void)pthread_barrier_destroy(&barrier);
}
if (h) {
(void)blob_close(h);
}
if (blob_id != 0) {
(void)blob_delete(blob_id);
}
if (rc == 0) {
printf("[TEST3] PASS\n");
return 0;
}
printf("[TEST3] FAIL\n");
return 1;
}

View File

@@ -1,136 +0,0 @@
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "spdk_engine/io_engine.h"
#include "test_common.h"
int main(void) {
int rc = 0;
const char *bdev_name = getenv("SPDK_BDEV_NAME");
struct zvfs_blob_handle *h = NULL;
struct zvfs_blob_handle *reopen = NULL;
uint64_t blob_id = 0;
uint64_t cluster = 0;
void *wbuf = NULL;
void *rbuf = NULL;
if (!bdev_name) {
bdev_name = "Malloc0";
}
if (io_engine_init(bdev_name) != 0) {
fprintf(stderr, "TEST1: io_engine_init failed (bdev=%s)\n", bdev_name);
return 1;
}
printf("[TEST1] single thread / single blob\n");
h = blob_create(0);
if (!h) {
fprintf(stderr, "TEST1: blob_create failed\n");
return 1;
}
blob_id = h->id;
cluster = h->size;
if (cluster == 0) {
fprintf(stderr, "TEST1: invalid cluster size\n");
rc = 1;
goto out;
}
rc = blob_resize(h, cluster * 2);
if (rc != 0) {
fprintf(stderr, "TEST1: blob_resize failed: %d\n", rc);
rc = 1;
goto out;
}
rc = alloc_aligned_buf(&wbuf, cluster);
if (rc != 0) {
fprintf(stderr, "TEST1: alloc write buf failed: %d\n", rc);
rc = 1;
goto out;
}
rc = alloc_aligned_buf(&rbuf, cluster);
if (rc != 0) {
fprintf(stderr, "TEST1: alloc read buf failed: %d\n", rc);
rc = 1;
goto out;
}
fill_pattern((uint8_t *)wbuf, cluster, 0x11);
rc = blob_write(h, 0, wbuf, cluster);
if (rc != 0) {
fprintf(stderr, "TEST1: blob_write failed: %d\n", rc);
rc = 1;
goto out;
}
rc = blob_read(h, 0, rbuf, cluster);
if (rc != 0) {
fprintf(stderr, "TEST1: blob_read failed: %d\n", rc);
rc = 1;
goto out;
}
if (memcmp(wbuf, rbuf, cluster) != 0) {
fprintf(stderr, "TEST1: readback mismatch\n");
rc = 1;
goto out;
}
rc = blob_sync_md(h);
if (rc != 0) {
fprintf(stderr, "TEST1: blob_sync_md failed: %d\n", rc);
rc = 1;
goto out;
}
rc = blob_close(h);
if (rc != 0) {
fprintf(stderr, "TEST1: blob_close failed: %d\n", rc);
rc = 1;
goto out;
}
h = NULL;
reopen = blob_open(blob_id);
if (!reopen) {
fprintf(stderr, "TEST1: blob_open(reopen) failed\n");
rc = 1;
goto out;
}
memset(rbuf, 0, cluster);
rc = blob_read(reopen, 0, rbuf, cluster);
if (rc != 0) {
fprintf(stderr, "TEST1: reopen blob_read failed: %d\n", rc);
rc = 1;
goto out;
}
if (memcmp(wbuf, rbuf, cluster) != 0) {
fprintf(stderr, "TEST1: reopen readback mismatch\n");
rc = 1;
goto out;
}
out:
if (reopen) {
(void)blob_close(reopen);
}
if (h) {
(void)blob_close(h);
}
if (blob_id != 0) {
(void)blob_delete(blob_id);
}
free(wbuf);
free(rbuf);
if (rc == 0) {
printf("[TEST1] PASS\n");
return 0;
}
printf("[TEST1] FAIL\n");
return 1;
}

View File

@@ -1,20 +0,0 @@
#include "test_common.h"
#include <stdlib.h>
#include <string.h>
int alloc_aligned_buf(void **buf, size_t len) {
int rc = posix_memalign(buf, 4096, len);
if (rc != 0) {
return -rc;
}
memset(*buf, 0, len);
return 0;
}
void fill_pattern(uint8_t *buf, size_t len, uint8_t seed) {
size_t i = 0;
for (i = 0; i < len; i++) {
buf[i] = (uint8_t)(seed + (uint8_t)i);
}
}

View File

@@ -1,10 +0,0 @@
#ifndef __IOENGINE_TEST_COMMON_H__
#define __IOENGINE_TEST_COMMON_H__
#include <stddef.h>
#include <stdint.h>
int alloc_aligned_buf(void **buf, size_t len);
void fill_pattern(uint8_t *buf, size_t len, uint8_t seed);
#endif // __IOENGINE_TEST_COMMON_H__

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 95 KiB