Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230
ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Hello everyone,
I've recently set up a 250TB storage system for a file server. I've done some speed testing a few weeks back and everything looked great, now I have some first production data on the system and I'm not 100% happy with the performance. I should mention, I got some consulting for the general setup of the system, but since their last invoice I try to avoid asking there for more help - we had quite some discussion about what they charged me.
Hardware: Xeon Silver 4214 + 256GB RAM + 2x 480GB SSD for OS + 20x 14TB HDD SAS + 1x 6,4TB SSD Samsung PM1735
It's a single pool with RaidZ2 and the SSD is currently 50% for SLOG and 50% for L2Arc.
There are no databases running and 80%+ of the files are between 10MB and 200MB, so generally a lot of small files.
I'm aware of the risk of a non-redundant SLOG. If this Disk/Partition dies, we will loose data in case of power loss during write processes.
I've not tested anything with sync=always yet.
Our network connection is a 10G fiber. So I'd love to get close to the networks limit when writing to the storage.
Internally I get about 150-200MB/s when moving files. I was expecting/hoping for more.
Any suggestions how to improve the setup?
Thanks in advance
sappel
I've recently set up a 250TB storage system for a file server. I've done some speed testing a few weeks back and everything looked great, now I have some first production data on the system and I'm not 100% happy with the performance. I should mention, I got some consulting for the general setup of the system, but since their last invoice I try to avoid asking there for more help - we had quite some discussion about what they charged me.
Hardware: Xeon Silver 4214 + 256GB RAM + 2x 480GB SSD for OS + 20x 14TB HDD SAS + 1x 6,4TB SSD Samsung PM1735
It's a single pool with RaidZ2 and the SSD is currently 50% for SLOG and 50% for L2Arc.
There are no databases running and 80%+ of the files are between 10MB and 200MB, so generally a lot of small files.
I'm aware of the risk of a non-redundant SLOG. If this Disk/Partition dies, we will loose data in case of power loss during write processes.
I've not tested anything with sync=always yet.
Our network connection is a 10G fiber. So I'd love to get close to the networks limit when writing to the storage.
Internally I get about 150-200MB/s when moving files. I was expecting/hoping for more.
Any suggestions how to improve the setup?
Thanks in advance
sappel
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
I can't help you with ZFS, but it should be a *lot* faster. I'm running LizardFS on 1 Gbit/s networks and get 700..900 MB/s file transfer rates.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
with 150-200MB/s I mean 150-200 MByte/s. x8 = 1600Mbit/s, meaning we are above the 1GBit/s Limit. But still it is way to slow, I had tests with 700-800MByte/s which is more close to 10GBit.
And the above speed of 150-200MByte/s was internally, without any network.
And the above speed of 150-200MByte/s was internally, without any network.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
He's running a 10G fiber network.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
First of all I, personally, I would not have recommended using an SSD with the amount of RAM available at all
Returning to performance, the very first thing to do is to understand if it is an "internal" problem, or an "external "one
Try writing large amounts of incompressible data directly to the machine
There you will see the "maximum" speed that you can realistically achieve
I hardly use zfs on Debian, but I would really test the speed first WITHOUT the network
Returning to performance, the very first thing to do is to understand if it is an "internal" problem, or an "external "one
Try writing large amounts of incompressible data directly to the machine
There you will see the "maximum" speed that you can realistically achieve
I hardly use zfs on Debian, but I would really test the speed first WITHOUT the network
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
ok, thanks for your reply.
So my first attempts:
Is that a sufficient test in your understanding?
If I do speed tests with real-world files (e.g. 150GB of approx 120MB tif Files) I'm not sure how to measure speed the best way. Any ideas?
So my first attempts:
Code: Select all
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.96411 s, 1.4 GB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=8k count=512k
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.58741 s, 936 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=4k count=512k
524288+0 records in
524288+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 3.41273 s, 629 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=128k count=512k
^C343403+0 records in
343403+0 records out
45010518016 bytes (45 GB, 42 GiB) copied, 34.7899 s, 1.3 GB/s
If I do speed tests with real-world files (e.g. 150GB of approx 120MB tif Files) I'm not sure how to measure speed the best way. Any ideas?
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Well.. no
No because at 99% you have zfs compression enabled (do you? what kind?)
Do not use zero, but random
As you can see "zero" is way too easy for ZFS
No because at 99% you have zfs compression enabled (do you? what kind?)
Do not use zero, but random
As you can see "zero" is way too easy for ZFS
Code: Select all
root@aserver:/temporaneo/ugo # dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 2.097144 secs (4096015943 bytes/sec)
root@aserver:/temporaneo/ugo #
Code: Select all
root@aserver:/temporaneo/ugo # dd if=/dev/urandom of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 79.356579 secs (108244769 bytes/sec)
I never use the GUI on BSD, Linux and Solaris. Ever.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
And a
output, please
Code: Select all
zfs get all /tank/data2
zfs list
zpool status
I never use the GUI on BSD, Linux and Solaris. Ever.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
I already thought /dev/random might be better..
anyways:
thanks for your support!
anyways:
Code: Select all
bps-storage:/tank/data3# zfs get all /tank/data2
NAME PROPERTY VALUE SOURCE
tank/data2 type filesystem -
tank/data2 creation Thu Jun 2 10:33 2022 -
tank/data2 used 13.9T -
tank/data2 available 36.1T -
tank/data2 referenced 11.8T -
tank/data2 compressratio 1.00x -
tank/data2 mounted yes -
tank/data2 quota 50T local
tank/data2 reservation none default
tank/data2 recordsize 128K default
tank/data2 mountpoint /tank/data2 default
tank/data2 sharenfs off default
tank/data2 checksum on default
tank/data2 compression off default
tank/data2 atime on default
tank/data2 devices on default
tank/data2 exec on default
tank/data2 setuid on default
tank/data2 readonly off default
tank/data2 zoned off default
tank/data2 snapdir hidden default
tank/data2 aclmode discard default
tank/data2 aclinherit restricted default
tank/data2 createtxg 468426 -
tank/data2 canmount on default
tank/data2 xattr sa local
tank/data2 copies 1 default
tank/data2 version 5 -
tank/data2 utf8only off -
tank/data2 normalization none -
tank/data2 casesensitivity mixed -
tank/data2 vscan off default
tank/data2 nbmand off default
tank/data2 sharesmb off local
tank/data2 refquota none default
tank/data2 refreservation none default
tank/data2 guid 13804224936539947431 -
tank/data2 primarycache all default
tank/data2 secondarycache all default
tank/data2 usedbysnapshots 2.09T -
tank/data2 usedbydataset 11.8T -
tank/data2 usedbychildren 0B -
tank/data2 usedbyrefreservation 0B -
tank/data2 logbias latency default
tank/data2 objsetid 1284 -
tank/data2 dedup off default
tank/data2 mlslabel none default
tank/data2 sync standard default
tank/data2 dnodesize auto local
tank/data2 refcompressratio 1.00x -
tank/data2 written 9.00T -
tank/data2 logicalused 13.8T -
tank/data2 logicalreferenced 11.8T -
tank/data2 volmode default default
tank/data2 filesystem_limit none default
tank/data2 snapshot_limit none default
tank/data2 filesystem_count none default
tank/data2 snapshot_count none default
tank/data2 snapdev hidden default
tank/data2 acltype off default
tank/data2 context none default
tank/data2 fscontext none default
tank/data2 defcontext none default
tank/data2 rootcontext none default
tank/data2 relatime off default
tank/data2 redundant_metadata all default
tank/data2 overlay on default
tank/data2 encryption off default
tank/data2 keylocation none default
tank/data2 keyformat none default
tank/data2 pbkdf2iters 0 default
tank/data2 special_small_blocks 0 default
tank/data2 org.debian:periodic-trim enable inherited from tank
Code: Select all
bps-storage:/tank/data3# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 179T 40.1T 299K /tank
tank/team 768K 5.00T 277K /tank/team
tank/data1 36.8T 40.1T 31.6T /tank/data1
tank/data2 13.9T 36.1T 11.8T /tank/data2
tank/data3 77.9T 40.1T 77.9T /tank/data3
tank/testdata 50.6T 40.1T 24.4T /tank/testdata
Code: Select all
bps-storage:/tank/data3# zpool status
pool: tank
state: ONLINE
scan: scrub in progress since Sun Aug 14 00:24:02 2022
151T scanned at 1.09G/s, 137T issued at 1017M/s, 179T total
0B repaired, 76.66% done, 11:59:11 to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
scsi-35000c500d84a8b3b ONLINE 0 0 0
scsi-35000c500d84aefbf ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
scsi-35000c500d84c065f ONLINE 0 0 0
scsi-35000c500d84b490b ONLINE 0 0 0
scsi-35000c500d849b517 ONLINE 0 0 0
scsi-35000c500d848a82b ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
scsi-35000c500d84b41bf ONLINE 0 0 0
scsi-35000c500d84aeab7 ONLINE 0 0 0
sds ONLINE 0 0 0
scsi-35000c500d849ada3 ONLINE 0 0 0
scsi-35000c500d84b1203 ONLINE 0 0 0
sdv ONLINE 0 0 0
logs
nvme0n1p1 ONLINE 0 0 0
cache
nvme0n1p2 ONLINE 0 0 0
errors: No known data errors
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
you are welcome
1) a scrub is running, performance degraded. Wait until completed
2) you have no compression, I suggest to turn on (LZ4)
3) I suggest to turn atime off
4) look how big is the ARC (I am sorry, I use FreeBSD, not sure where it is on Debian)
1) a scrub is running, performance degraded. Wait until completed
2) you have no compression, I suggest to turn on (LZ4)
3) I suggest to turn atime off
4) look how big is the ARC (I am sorry, I use FreeBSD, not sure where it is on Debian)
I never use the GUI on BSD, Linux and Solaris. Ever.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Scrub is done by now, I'll do some checks again later today.
Will look into compression and atime later as well.
ARC output:
Will look into compression and atime later as well.
ARC output:
Code: Select all
root@bps-storage:/tank# arc_summary -s arc
------------------------------------------------------------------------
ZFS Subsystem Report Tue Aug 16 03:18:51 2022
Linux 5.10.0-14-amd64 2.0.3-9
Machine: bps-storage.bps.local (x86_64) 2.0.3-9
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 100.0 % 125.3 GiB
Target size (adaptive): 100.0 % 125.2 GiB
Min size (hard limit): 6.2 % 7.8 GiB
Max size (high water): 16:1 125.2 GiB
Most Frequently Used (MFU) cache size: 40.0 % 40.2 GiB
Most Recently Used (MRU) cache size: 60.0 % 60.3 GiB
Metadata cache size (hard limit): 75.0 % 93.9 GiB
Metadata cache size (current): 53.0 % 49.8 GiB
Dnode cache size (hard limit): 10.0 % 9.4 GiB
Dnode cache size (current): 99.3 % 9.3 GiB
ARC hash breakdown:
Elements max: 29.3M
Elements current: 95.4 % 27.9M
Collisions: 1.5G
Chain max: 11
Chains: 6.8M
ARC misc:
Deleted: 2.0G
Mutex misses: 12.1k
Eviction skips: 9.9M
root@bps-storage:/tank#
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
OK, just a little update:
I'll enable compression later.
Is disabling atime of any use? My files are not particularly slow and I guess I/O is not the big deal in my scenario, what do you think?
I'll enable compression later.
Is disabling atime of any use? My files are not particularly slow and I guess I/O is not the big deal in my scenario, what do you think?
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
It's an R/W access the system doesn't have to do. In all my years of working with Debian, I have never come across a situation where I wanted to know what the last access timestamp of a file was. So I always disable atime on all my file systems, local or networked.
-
- df -h | grep > 20TiB
- Posts: 1418
- Joined: 2012-10-06 05:31
- Location: /dev/chair
- Has thanked: 79 times
- Been thanked: 191 times
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Some random 2c (IANAE):
dd is a terrible benchmark. dd from /dev/zero is worse than terrible.
Enable compression, there's really no downside with modern CPUs.
Disable atime unless you actually need it.
Experiment with disabling sync, especially if using NFS.
Experiment with recordsize. The 128k default is just a "doesn't suck too hard at anything" compromise. I find 1M pretty good for general-purpose fileserver workloads myself, but YMMV.
I'd call 20 drives in a single RAIDZ2 vdev too wide for optimal performance, but that's my speed/space/redundancy tradeoff and only you can find yours.
SV: RAIDZ IOPS are limited by the the slowest drive in a vdev, but vdevs stripe. Resilvers on wide RAIDZ vdevs can also be pretty slow, during which time both redundancy and performance are compromised.
L2ARC on an SSD may or may not be a good idea. I have no experience with that particular samsung drive, but I can say that it will kill most consumer and light-commercial SSDs fairly quickly while providing questionable benefit.
Ideally you want something like optane here... Yes, I know, I want a pony too.
Referencing drives by /dev/sd[x] makes me cringe a little. Maybe that's just me being weird, but using /dev/disk/by-id or /dev/disk/by-path really helps identifying physical units if nothing else.
Presumably you have read the OpenZFS wiki as a starting point?
There are plenty of performance questions masquerading as "issues" on ZoL's github as well, and the ZoL guys don't seem to mind. It's worth a browse if nothing else.
This is all just bikeshedding without proper benchmarks of course, and I suspect fcorbelli knows more on this topic than I. I'll be watching in the hopes I learn something too.
dd is a terrible benchmark. dd from /dev/zero is worse than terrible.
Enable compression, there's really no downside with modern CPUs.
Disable atime unless you actually need it.
Experiment with disabling sync, especially if using NFS.
Experiment with recordsize. The 128k default is just a "doesn't suck too hard at anything" compromise. I find 1M pretty good for general-purpose fileserver workloads myself, but YMMV.
I'd call 20 drives in a single RAIDZ2 vdev too wide for optimal performance, but that's my speed/space/redundancy tradeoff and only you can find yours.
SV: RAIDZ IOPS are limited by the the slowest drive in a vdev, but vdevs stripe. Resilvers on wide RAIDZ vdevs can also be pretty slow, during which time both redundancy and performance are compromised.
L2ARC on an SSD may or may not be a good idea. I have no experience with that particular samsung drive, but I can say that it will kill most consumer and light-commercial SSDs fairly quickly while providing questionable benefit.
Ideally you want something like optane here... Yes, I know, I want a pony too.
Referencing drives by /dev/sd[x] makes me cringe a little. Maybe that's just me being weird, but using /dev/disk/by-id or /dev/disk/by-path really helps identifying physical units if nothing else.
Presumably you have read the OpenZFS wiki as a starting point?
There are plenty of performance questions masquerading as "issues" on ZoL's github as well, and the ZoL guys don't seem to mind. It's worth a browse if nothing else.
This is all just bikeshedding without proper benchmarks of course, and I suspect fcorbelli knows more on this topic than I. I'll be watching in the hopes I learn something too.
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Sorry everyone I didn't reply anymore - I was actually unexpected away for a week.
Thanks for further tipps + hints.
I'll disable atime later. Right now, I feel like the 20 drivers in a pool as RAIDz2 is also too big, but that was actually the recommendation of the consulting company. I might choose 2x 14 or something similar if we choose to upgrade some time.
I'm aware of the potential issue with the l2arc on SSD. I chose a SSD with 3 drive writes per day for starters. Let's see how it performs.
My goal is/was that we get a setup for our storage that max out our 10G fiber as good as possible in a scenario with 5-10 clients writing data to the storage or reading from it. 90% of the time only big 100-200MB media files.
Regarding the references /dev/sdx vs /dev/id: the system actually put it that way when I first created the pool with only using sdX. I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?
Thanks for further tipps + hints.
I'll disable atime later. Right now, I feel like the 20 drivers in a pool as RAIDz2 is also too big, but that was actually the recommendation of the consulting company. I might choose 2x 14 or something similar if we choose to upgrade some time.
I'm aware of the potential issue with the l2arc on SSD. I chose a SSD with 3 drive writes per day for starters. Let's see how it performs.
My goal is/was that we get a setup for our storage that max out our 10G fiber as good as possible in a scenario with 5-10 clients writing data to the storage or reading from it. 90% of the time only big 100-200MB media files.
Regarding the references /dev/sdx vs /dev/id: the system actually put it that way when I first created the pool with only using sdX. I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?
-
- df -h | grep > 20TiB
- Posts: 1418
- Joined: 2012-10-06 05:31
- Location: /dev/chair
- Has thanked: 79 times
- Been thanked: 191 times
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
The by-id and by-path nodes are just symlinks to /dev/sd[x] anyway (thanks udev), so it makes no difference in terms of functionality what you use. I find /dev/sd[x] a bit fat-finger prone when dealing with a large number of drives is all, that and it's an extra step to figure out which tray to pull when you want to hotswap a disk.
A bit of naughty ls parsing works for "which physical drive is /dev/sd[x]" questions as well, e.g.
Code: Select all
alias diskids='ls -l /dev/disk/by-id/ata-* | grep -v part | awk "{print \$11,\$9;}" | cut -d"/" -f3- | sort'
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
One more thing I forgot to ask:
If using dd with /dev/zero and /dev/random is bad for performance testing, can anyone suggest me a better way to check read/write performance for internal only data transfers and external connections as well? fio?
If using dd with /dev/zero and /dev/random is bad for performance testing, can anyone suggest me a better way to check read/write performance for internal only data transfers and external connections as well? fio?
-
- df -h | grep > 20TiB
- Posts: 1418
- Joined: 2012-10-06 05:31
- Location: /dev/chair
- Has thanked: 79 times
- Been thanked: 191 times
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Sure.
Piping from /dev/zero falls into the trap of being infinitely compressible, and piping from /dev/urandom will probably just end up benchmarking the random number generator and CPU.
Creating a large file from /dev/urandom and then using that for benchmarking (controlling for caching of course) would be better, but personally I'd use tools designed for the job like FIO or IOZone.
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
Thanks for your reply steve_v.
I'm straight disappointed when looking at the fio tests.
I feel a bit undereducated and seriously bad consulted when looking more into the topic of write speeds:
Is it correct, the NVMe ZIL as SLOG doesn't make any sense for asynchronous writes? So there is no way to speed up asynchronous writes? Will more RAM help? God, I feel stupid right now.
I'm straight disappointed when looking at the fio tests.
I feel a bit undereducated and seriously bad consulted when looking more into the topic of write speeds:
Code: Select all
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=128k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --sync=1
Run status group 0 (all jobs):
WRITE: bw=292MiB/s (306MB/s), 17.0MiB/s-18.4MiB/s (18.8MB/s-19.3MB/s), io=17.4GiB (18.7GB), run=61162-61166msec
-
- df -h | grep > 20TiB
- Posts: 1418
- Joined: 2012-10-06 05:31
- Location: /dev/chair
- Has thanked: 79 times
- Been thanked: 191 times
Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers
As far as I am aware, and from my own experience (on rather wussier hardware), that is correct. I don't think that fio run was async though...
The primary purpose of the ZIL (whether a dedicated SLOG device or spread amongst the data disks) is to protect synchronous writes while they're sitting in RAM waiting to be flushed to disk, such that they aren't lost if the power goes down.
Asynchronous writes return right away whether they've actually been flushed to stable storage or not, so the ZIL isn't involved at all.
For synchronous writes however, a fast SLOG device can make a massive difference to performance because ZIL operations aren't causing write-duplication and preempting other activity on the pool. That's a big deal if you're e.g. storing databases or serving NFS.
Here's a pretty easy to grok writeup that's worth a read, my explaining tends to suck kinda hard.
Don't sweat it, there's a reason "The ZIL and SLOG are two of the most misunderstood concepts in ZFS". The first time I built a pool I put SLOG and L2ARC drives in it... Then pulled them again because all my writes were async and all my cache hits were coming from ARC in RAM.
To properly see the impact of your SLOG, you'll want to play with fio in both sync and async mode (dense manual warning, or google-foo cheat-sheet like here), and/or compare "sync" write performance with sync=always vs sync=standard vs sync=disabled on the filesystem.
I don't have a comparable pool to play with (kinda wussy, remember ), so if you want something to compare to you'll want to go fishing for someone with deeper pockets than mine. Shouldn't be too hard to find though.
Last edited by steve_v on 2022-08-22 15:26, edited 1 time in total.
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.