postgres \c 调用 fscanf 导致的 bug。

This commit is contained in:
2026-03-14 18:52:28 +08:00
parent cf13b0228e
commit 1732163cbf
15 changed files with 107 additions and 175 deletions

172
README.md
View File

@@ -113,15 +113,15 @@ App (PostgreSQL / RocksDB / db_bench / pgbench)
### 4.1 构建
```bash
cd /home/lian/try/zvfs
git submodule update --init --recursive
cd zvfs
git submodule update --init --recursive --progress
cd spdk
./scripts/pkgdep.sh
./configure --with-shared
make -j"$(nproc)"
cd /home/lian/try/zvfs
cd ..
make -j"$(nproc)"
make test -j"$(nproc)"
```
@@ -135,7 +135,7 @@ make test -j"$(nproc)"
### 4.2 启动 daemon
```bash
cd /home/lian/try/zvfs
cd zvfs
./src/daemon/zvfs_daemon
```
@@ -181,109 +181,75 @@ LD_PRELOAD=./src/libzvfs.so ZVFS_TEST_ROOT=/zvfs ./tests/bin/hook_api_test
### 5.4 fio
```shell
root@ubuntu20:/home/lian/try/zvfs# fio ./fio_script/psync.fio
root@ubuntu20-129:/home/lian/share/zvfs# LD_PRELOAD=/home/lian/share/zvfs/src/libzvfs.so fio ./fio_script/zvfs.fio
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=64
fio-3.16
Starting 1 thread
Jobs: 1 (f=1): [w(1)][100.0%][w=53.0MiB/s][w=3455 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=23035: Fri Mar 13 13:25:32 2026
test: Laying out IO file (1 file / 0MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=13.4MiB/s][w=857 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=16519: Sat Mar 14 14:11:27 2026
Description : ["variable bs"]
write: IOPS=3644, BW=56.9MiB/s (59.7MB/s)(570MiB/10001msec); 0 zone resets
clat (usec): min=127, max=3496, avg=272.07, stdev=91.62
lat (usec): min=128, max=3497, avg=272.41, stdev=91.69
write: IOPS=829, BW=12.0MiB/s (13.6MB/s)(130MiB/10001msec); 0 zone resets
clat (usec): min=778, max=4000, avg=1199.89, stdev=377.74
lat (usec): min=779, max=4001, avg=1200.38, stdev=377.78
clat percentiles (usec):
| 1.00th=[ 155], 5.00th=[ 169], 10.00th=[ 184], 20.00th=[ 202],
| 30.00th=[ 225], 40.00th=[ 249], 50.00th=[ 262], 60.00th=[ 277],
| 70.00th=[ 293], 80.00th=[ 322], 90.00th=[ 371], 95.00th=[ 420],
| 99.00th=[ 545], 99.50th=[ 611], 99.90th=[ 881], 99.95th=[ 1467],
| 99.99th=[ 2409]
bw ( KiB/s): min=49376, max=70387, per=99.97%, avg=58295.30, stdev=6828.37, samples=20
iops : min= 3086, max= 4399, avg=3643.40, stdev=426.71, samples=20
lat (usec) : 250=41.44%, 500=56.78%, 750=1.59%, 1000=0.10%
lat (msec) : 2=0.07%, 4=0.01%
cpu : usr=1.82%, sys=37.19%, ctx=54169, majf=0, minf=0
| 1.00th=[ 848], 5.00th=[ 898], 10.00th=[ 922], 20.00th=[ 955],
| 30.00th=[ 979], 40.00th=[ 1004], 50.00th=[ 1029], 60.00th=[ 1074],
| 70.00th=[ 1221], 80.00th=[ 1500], 90.00th=[ 1614], 95.00th=[ 1975],
| 99.00th=[ 2606], 99.50th=[ 2966], 99.90th=[ 3359], 99.95th=[ 3425],
| 99.99th=[ 4015]
bw ( KiB/s): min=10048, max=15520, per=99.91%, avg=13258.32, stdev=1465.96, samples=19
iops : min= 628, max= 970, avg=828.63, stdev=91.62, samples=19
lat (usec) : 1000=38.46%
lat (msec) : 2=56.79%, 4=4.74%, 10=0.01%
cpu : usr=5.27%, sys=0.00%, ctx=8499, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,36448,0,0 short=0,0,0,0 dropped=0,0,0,0
issued rwts: total=0,8295,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=56.9MiB/s (59.7MB/s), 56.9MiB/s-56.9MiB/s (59.7MB/s-59.7MB/s), io=570MiB (597MB), run=10001-10001msec
WRITE: bw=12.0MiB/s (13.6MB/s), 12.0MiB/s-12.0MiB/s (13.6MB/s-13.6MB/s), io=130MiB (136MB), run=10001-10001msec
Disk stats (read/write):
dm-0: ios=122/36085, merge=0/0, ticks=32/7748, in_queue=7780, util=99.13%, aggrios=109/36455, aggrmerge=13/12, aggrticks=28/7260, aggrin_queue=0, aggrutil=98.95%
sda: ios=109/36455, merge=13/12, ticks=28/7260, in_queue=0, util=98.95%
root@ubuntu20:/home/lian/try/zvfs# LD_PRELOAD=/home/lian/try/zvfs/src/libzvfs.so fio ./fio_script/zvfs.fio
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=64
fio-3.16
Starting 1 thread
Jobs: 1 (f=1): [w(1)][100.0%][w=10.2MiB/s][w=650 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=23891: Fri Mar 13 13:26:54 2026
Description : ["variable bs"]
write: IOPS=521, BW=8345KiB/s (8545kB/s)(81.5MiB/10001msec); 0 zone resets
clat (usec): min=529, max=52465, avg=1909.16, stdev=1181.14
lat (usec): min=530, max=52467, avg=1909.72, stdev=1181.21
clat percentiles (usec):
| 1.00th=[ 734], 5.00th=[ 922], 10.00th=[ 1037], 20.00th=[ 1237],
| 30.00th=[ 1418], 40.00th=[ 1500], 50.00th=[ 1614], 60.00th=[ 1860],
| 70.00th=[ 2024], 80.00th=[ 2311], 90.00th=[ 3130], 95.00th=[ 3982],
| 99.00th=[ 5669], 99.50th=[ 5932], 99.90th=[ 6456], 99.95th=[ 6849],
| 99.99th=[52691]
bw ( KiB/s): min= 4704, max=11200, per=99.95%, avg=8339.75, stdev=2577.19, samples=20
iops : min= 294, max= 700, avg=521.10, stdev=161.03, samples=20
lat (usec) : 750=1.23%, 1000=7.07%
lat (msec) : 2=60.10%, 4=26.84%, 10=4.74%, 100=0.02%
cpu : usr=0.00%, sys=3.84%, ctx=5461, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,5216,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=8345KiB/s (8545kB/s), 8345KiB/s-8345KiB/s (8545kB/s-8545kB/s), io=81.5MiB (85.5MB), run=10001-10001msec
Disk stats (read/write):
dm-0: ios=39/7287, merge=0/0, ticks=12/7888, in_queue=7900, util=1.72%, aggrios=39/478, aggrmerge=0/6818, aggrticks=12/352, aggrin_queue=0, aggrutil=1.70%
sda: ios=39/478, merge=0/6818, ticks=12/352, in_queue=0, util=1.70%
sda: ios=0/118, merge=0/104, ticks=0/66, in_queue=67, util=0.24%
```
#### psync
```shell
root@ubuntu20:/home/lian/try/zvfs# fio ./fio_script/psync.fio
root@ubuntu20-129:/home/lian/share/zvfs# fio ./fio_script/psync.fio
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=64
fio-3.16
Starting 1 thread
Jobs: 1 (f=1): [w(1)][100.0%][w=53.0MiB/s][w=3455 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=23035: Fri Mar 13 13:25:32 2026
Jobs: 1 (f=1): [w(1)][100.0%][w=39.5MiB/s][w=2528 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=16831: Sat Mar 14 14:13:39 2026
Description : ["variable bs"]
write: IOPS=3644, BW=56.9MiB/s (59.7MB/s)(570MiB/10001msec); 0 zone resets
clat (usec): min=127, max=3496, avg=272.07, stdev=91.62
lat (usec): min=128, max=3497, avg=272.41, stdev=91.69
write: IOPS=2468, BW=38.6MiB/s (40.4MB/s)(386MiB/10001msec); 0 zone resets
clat (usec): min=184, max=16636, avg=401.35, stdev=153.93
lat (usec): min=184, max=16636, avg=401.86, stdev=153.98
clat percentiles (usec):
| 1.00th=[ 155], 5.00th=[ 169], 10.00th=[ 184], 20.00th=[ 202],
| 30.00th=[ 225], 40.00th=[ 249], 50.00th=[ 262], 60.00th=[ 277],
| 70.00th=[ 293], 80.00th=[ 322], 90.00th=[ 371], 95.00th=[ 420],
| 99.00th=[ 545], 99.50th=[ 611], 99.90th=[ 881], 99.95th=[ 1467],
| 99.99th=[ 2409]
bw ( KiB/s): min=49376, max=70387, per=99.97%, avg=58295.30, stdev=6828.37, samples=20
iops : min= 3086, max= 4399, avg=3643.40, stdev=426.71, samples=20
lat (usec) : 250=41.44%, 500=56.78%, 750=1.59%, 1000=0.10%
lat (msec) : 2=0.07%, 4=0.01%
cpu : usr=1.82%, sys=37.19%, ctx=54169, majf=0, minf=0
| 1.00th=[ 247], 5.00th=[ 273], 10.00th=[ 289], 20.00th=[ 318],
| 30.00th=[ 338], 40.00th=[ 355], 50.00th=[ 375], 60.00th=[ 400],
| 70.00th=[ 429], 80.00th=[ 469], 90.00th=[ 545], 95.00th=[ 611],
| 99.00th=[ 783], 99.50th=[ 873], 99.90th=[ 1106], 99.95th=[ 1287],
| 99.99th=[ 1942]
bw ( KiB/s): min=31136, max=51008, per=100.00%, avg=39550.32, stdev=4617.82, samples=19
iops : min= 1946, max= 3188, avg=2471.79, stdev=288.55, samples=19
lat (usec) : 250=1.32%, 500=83.72%, 750=13.61%, 1000=1.11%
lat (msec) : 2=0.23%, 4=0.01%, 20=0.01%
cpu : usr=2.83%, sys=28.58%, ctx=24699, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,36448,0,0 short=0,0,0,0 dropped=0,0,0,0
issued rwts: total=0,24690,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=56.9MiB/s (59.7MB/s), 56.9MiB/s-56.9MiB/s (59.7MB/s-59.7MB/s), io=570MiB (597MB), run=10001-10001msec
WRITE: bw=38.6MiB/s (40.4MB/s), 38.6MiB/s-38.6MiB/s (40.4MB/s-40.4MB/s), io=386MiB (405MB), run=10001-10001msec
Disk stats (read/write):
dm-0: ios=122/36085, merge=0/0, ticks=32/7748, in_queue=7780, util=99.13%, aggrios=109/36455, aggrmerge=13/12, aggrticks=28/7260, aggrin_queue=0, aggrutil=98.95%
sda: ios=109/36455, merge=13/12, ticks=28/7260, in_queue=0, util=98.95%
sda: ios=0/24384, merge=0/4, ticks=0/7764, in_queue=7764, util=99.10%
```
### 5.5 pgbench
@@ -329,39 +295,22 @@ tps = 92.206743 (excluding connections establishing)
## 6. 关键工程难点与踩坑复盘(重点)
这一节是项目最有价值的部分,记录了从“能跑”到“可用于数据库 workload”过程中遇到的关键问题。
### SPDK 元数据回调线程模型
### 6.1 SPDK 元数据回调线程模型
问题:把 metadata 操作随意派发到任意线程容易卡住或回调不回来。metadata的回调默认派发给初始化blobstore的线程。
问题:把 metadata 操作随意派发到任意线程,容易卡住或回调不回来
blobstore metadata 操作与创建线程/通道绑定
根因:
需要明确 metadata thread 和 io thread 分工。
- blobstore metadata 操作与创建线程/通道绑定。
- `resize/delete/unload` 内部会走 `spdk_for_each_channel()` barrier。
### resize 导致程序卡死
- `resize/delete/unload` 内部会走 `spdk_for_each_channel()` barrier。大概是让其他的spdk线程同步resize之后的状态才能返回。所以如果要做要尽可能少做resize。
修复策略:
- 明确 metadata thread 和 io thread 分工。
如果其他spdk线程持有iochannel并且没有持续poll就会导致卡死。
- 保证持有 channel 的线程持续 poll。
- 线程退出时严格释放 channel避免 barrier 永久等待。
### 6.2 Daemon 卡住(请求已收但流程停滞)
现象:请求日志打印到一半后卡住,压测进程阻塞。
根因:
- UDS 流式读取没有完整分帧处理。
- 固定小缓冲导致回包序列化失败(`serialize resp failed`)。
修复:
- 改为连接级接收缓冲,循环读到 `EAGAIN`
- 按“完整包”消费,残包保留到下一轮。
- 回包序列化改为动态缓冲 + `send_all`
### 6.3 PostgreSQL Tablespace 无法命中 Hook
### PostgreSQL Tablespace 无法命中 Hook
现象:建表空间后文件操作路径是 `pg_tblspc/...`daemon 无请求日志。
@@ -375,7 +324,7 @@ tps = 92.206743 (excluding connections establishing)
- 路径判定增加 `realpath()` 后再判断。
- `O_CREAT` 且文件尚不存在时,使用 `realpath(parent)+basename` 判定。
### 6.4 PostgreSQL 报 `Permission denied`(跨用户连接 daemon
### PostgreSQL 报 `Permission denied`(跨用户连接 daemon
现象:`CREATE DATABASE ... TABLESPACE ...` 报权限错误。
@@ -388,7 +337,7 @@ tps = 92.206743 (excluding connections establishing)
- daemon `bind` 后显式 `chmod(socket, 0666)`
### 6.5 PostgreSQL 报 `Message too long`
### PostgreSQL 报 `Message too long`
现象:部分 SQL尤其 `CREATE DATABASE` 路径)失败,错误为 `Message too long`
@@ -405,7 +354,7 @@ tps = 92.206743 (excluding connections establishing)
- 在 client `blob_write_ex` 做透明分片发送(保持同步阻塞语义)。
### 6.6 dup/dup2/fork 语义一致性
### dup/dup2/fork 语义一致性
问题:多个 fd 指向同一 open file description 时,如何保证 handle 引用计数一致。
@@ -415,6 +364,19 @@ tps = 92.206743 (excluding connections establishing)
- 在 hook 中对 `dup/dup2/dup3/fork` 明确执行引用增加。
- `close_range` 增加边界保护(避免 `UINT_MAX` 场景死循环)。
### pg \c 调用链条
1. psql \c ...
2. fork backend
3. InitPostgres -> ValidatePgVersion ...
4. libc fopen
5. libc 内部 -> __IO_file_fopen -> _IO_file_open -> __open64
6. kernel openat
7. fscanf
8. libc __isoc99_fscanf -> \_IO\_* -> __read
9. kernel read
glic 内部调用走的不是 动态符号定位可能是一些隐藏别名。可能会绕过hook。需要拦截非常多变体
---
## 7. 当前限制与下一步