ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Message

sappel · #1 Post by **sappel** » 2022-08-15 07:30

Hello everyone,
I've recently set up a 250TB storage system for a file server. I've done some speed testing a few weeks back and everything looked great, now I have some first production data on the system and I'm not 100% happy with the performance. I should mention, I got some consulting for the general setup of the system, but since their last invoice I try to avoid asking there for more help - we had quite some discussion about what they charged me.
Hardware: Xeon Silver 4214 + 256GB RAM + 2x 480GB SSD for OS + 20x 14TB HDD SAS + 1x 6,4TB SSD Samsung PM1735

It's a single pool with RaidZ2 and the SSD is currently 50% for SLOG and 50% for L2Arc.

There are no databases running and 80%+ of the files are between 10MB and 200MB, so generally a lot of small files.

I'm aware of the risk of a non-redundant SLOG. If this Disk/Partition dies, we will loose data in case of power loss during write processes.

I've not tested anything with sync=always yet.

Our network connection is a 10G fiber. So I'd love to get close to the networks limit when writing to the storage.
Internally I get about 150-200MB/s when moving files. I was expecting/hoping for more.

Any suggestions how to improve the setup?

Thanks in advance
sappel

Bloom · #2 Post by **Bloom** » 2022-08-15 08:40

I can't help you with ZFS, but it should be a *lot* faster. I'm running LizardFS on 1 Gbit/s networks and get 700..900 MB/s file transfer rates.

sappel · #3 Post by **sappel** » 2022-08-15 08:52

with 150-200MB/s I mean 150-200 MByte/s. x8 = 1600Mbit/s, meaning we are above the 1GBit/s Limit. But still it is way to slow, I had tests with 700-800MByte/s which is more close to 10GBit.
And the above speed of 150-200MByte/s was internally, without any network.

Bloom · #4 Post by **Bloom** » 2022-08-15 09:26

He's running a 10G fiber network.

fcorbelli · #5 Post by **fcorbelli** » 2022-08-15 10:35

First of all I, personally, I would not have recommended using an SSD with the amount of RAM available at all

Returning to performance, the very first thing to do is to understand if it is an "internal" problem, or an "external "one
Try writing large amounts of incompressible data directly to the machine
There you will see the "maximum" speed that you can realistically achieve

I hardly use zfs on Debian, but I would really test the speed first WITHOUT the network

sappel · #6 Post by **sappel** » 2022-08-15 11:08

ok, thanks for your reply.

So my first attempts:

Code: Select all

root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.96411 s, 1.4 GB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=8k count=512k
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.58741 s, 936 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=4k count=512k
524288+0 records in
524288+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 3.41273 s, 629 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=128k count=512k
^C343403+0 records in
343403+0 records out
45010518016 bytes (45 GB, 42 GiB) copied, 34.7899 s, 1.3 GB/s

Is that a sufficient test in your understanding?
If I do speed tests with real-world files (e.g. 150GB of approx 120MB tif Files) I'm not sure how to measure speed the best way. Any ideas?

fcorbelli · #7 Post by **fcorbelli** » 2022-08-15 12:52

Well.. no
No because at 99% you have zfs compression enabled (do you? what kind?)
Do not use zero, but random

As you can see "zero" is way too easy for ZFS

Code: Select all

root@aserver:/temporaneo/ugo # dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 2.097144 secs (4096015943 bytes/sec)
root@aserver:/temporaneo/ugo #

Code: Select all

root@aserver:/temporaneo/ugo # dd if=/dev/urandom of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 79.356579 secs (108244769 bytes/sec)

fcorbelli · #8 Post by **fcorbelli** » 2022-08-15 12:59

And a

Code: Select all

zfs get all /tank/data2
zfs list
zpool status

output, please

sappel · #9 Post by **sappel** » 2022-08-15 13:47

I already thought /dev/random might be better..
anyways:

Code: Select all

bps-storage:/tank/data3#   zfs get all /tank/data2
NAME        PROPERTY                  VALUE                     SOURCE
tank/data2  type                      filesystem                -
tank/data2  creation                  Thu Jun  2 10:33 2022     -
tank/data2  used                      13.9T                     -
tank/data2  available                 36.1T                     -
tank/data2  referenced                11.8T                     -
tank/data2  compressratio             1.00x                     -
tank/data2  mounted                   yes                       -
tank/data2  quota                     50T                       local
tank/data2  reservation               none                      default
tank/data2  recordsize                128K                      default
tank/data2  mountpoint                /tank/data2               default
tank/data2  sharenfs                  off                       default
tank/data2  checksum                  on                        default
tank/data2  compression               off                       default
tank/data2  atime                     on                        default
tank/data2  devices                   on                        default
tank/data2  exec                      on                        default
tank/data2  setuid                    on                        default
tank/data2  readonly                  off                       default
tank/data2  zoned                     off                       default
tank/data2  snapdir                   hidden                    default
tank/data2  aclmode                   discard                   default
tank/data2  aclinherit                restricted                default
tank/data2  createtxg                 468426                    -
tank/data2  canmount                  on                        default
tank/data2  xattr                     sa                        local
tank/data2  copies                    1                         default
tank/data2  version                   5                         -
tank/data2  utf8only                  off                       -
tank/data2  normalization             none                      -
tank/data2  casesensitivity           mixed                     -
tank/data2  vscan                     off                       default
tank/data2  nbmand                    off                       default
tank/data2  sharesmb                  off                       local
tank/data2  refquota                  none                      default
tank/data2  refreservation            none                      default
tank/data2  guid                      13804224936539947431      -
tank/data2  primarycache              all                       default
tank/data2  secondarycache            all                       default
tank/data2  usedbysnapshots           2.09T                     -
tank/data2  usedbydataset             11.8T                     -
tank/data2  usedbychildren            0B                        -
tank/data2  usedbyrefreservation      0B                        -
tank/data2  logbias                   latency                   default
tank/data2  objsetid                  1284                      -
tank/data2  dedup                     off                       default
tank/data2  mlslabel                  none                      default
tank/data2  sync                      standard                  default
tank/data2  dnodesize                 auto                      local
tank/data2  refcompressratio          1.00x                     -
tank/data2  written                   9.00T                     -
tank/data2  logicalused               13.8T                     -
tank/data2  logicalreferenced         11.8T                     -
tank/data2  volmode                   default                   default
tank/data2  filesystem_limit          none                      default
tank/data2  snapshot_limit            none                      default
tank/data2  filesystem_count          none                      default
tank/data2  snapshot_count            none                      default
tank/data2  snapdev                   hidden                    default
tank/data2  acltype                   off                       default
tank/data2  context                   none                      default
tank/data2  fscontext                 none                      default
tank/data2  defcontext                none                      default
tank/data2  rootcontext               none                      default
tank/data2  relatime                  off                       default
tank/data2  redundant_metadata        all                       default
tank/data2  overlay                   on                        default
tank/data2  encryption                off                       default
tank/data2  keylocation               none                      default
tank/data2  keyformat                 none                      default
tank/data2  pbkdf2iters               0                         default
tank/data2  special_small_blocks      0                         default
tank/data2  org.debian:periodic-trim  enable                    inherited from tank

Code: Select all

bps-storage:/tank/data3# zfs list
NAME            USED  AVAIL     REFER  MOUNTPOINT
tank            179T  40.1T      299K  /tank
tank/team    768K  5.00T      277K  /tank/team
tank/data1     36.8T  40.1T     31.6T  /tank/data1
tank/data2     13.9T  36.1T     11.8T  /tank/data2
tank/data3     77.9T  40.1T     77.9T  /tank/data3
tank/testdata  50.6T  40.1T     24.4T  /tank/testdata

Code: Select all

bps-storage:/tank/data3# zpool status
  pool: tank
 state: ONLINE
  scan: scrub in progress since Sun Aug 14 00:24:02 2022
        151T scanned at 1.09G/s, 137T issued at 1017M/s, 179T total
        0B repaired, 76.66% done, 11:59:11 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            scsi-35000c500d84a8b3b  ONLINE       0     0     0
            scsi-35000c500d84aefbf  ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0
            sdi                     ONLINE       0     0     0
            scsi-35000c500d84c065f  ONLINE       0     0     0
            scsi-35000c500d84b490b  ONLINE       0     0     0
            scsi-35000c500d849b517  ONLINE       0     0     0
            scsi-35000c500d848a82b  ONLINE       0     0     0
            sdn                     ONLINE       0     0     0
            sdo                     ONLINE       0     0     0
            sdp                     ONLINE       0     0     0
            scsi-35000c500d84b41bf  ONLINE       0     0     0
            scsi-35000c500d84aeab7  ONLINE       0     0     0
            sds                     ONLINE       0     0     0
            scsi-35000c500d849ada3  ONLINE       0     0     0
            scsi-35000c500d84b1203  ONLINE       0     0     0
            sdv                     ONLINE       0     0     0
        logs
          nvme0n1p1                 ONLINE       0     0     0
        cache
          nvme0n1p2                 ONLINE       0     0     0

errors: No known data errors

thanks for your support!

fcorbelli · #10 Post by **fcorbelli** » 2022-08-15 14:11

you are welcome

1) a scrub is running, performance degraded. Wait until completed
2) you have no compression, I suggest to turn on (LZ4)
3) I suggest to turn atime off
4) look how big is the ARC (I am sorry, I use FreeBSD, not sure where it is on Debian)

sappel · #11 Post by **sappel** » 2022-08-16 01:20

Scrub is done by now, I'll do some checks again later today.

Will look into compression and atime later as well.

ARC output:

Code: Select all

root@bps-storage:/tank# arc_summary -s arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Aug 16 03:18:51 2022
Linux 5.10.0-14-amd64                                            2.0.3-9
Machine: bps-storage.bps.local (x86_64)                          2.0.3-9

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.0 %  125.3 GiB
        Target size (adaptive):                       100.0 %  125.2 GiB
        Min size (hard limit):                          6.2 %    7.8 GiB
        Max size (high water):                           16:1  125.2 GiB
        Most Frequently Used (MFU) cache size:         40.0 %   40.2 GiB
        Most Recently Used (MRU) cache size:           60.0 %   60.3 GiB
        Metadata cache size (hard limit):              75.0 %   93.9 GiB
        Metadata cache size (current):                 53.0 %   49.8 GiB
        Dnode cache size (hard limit):                 10.0 %    9.4 GiB
        Dnode cache size (current):                    99.3 %    9.3 GiB

ARC hash breakdown:
        Elements max:                                              29.3M
        Elements current:                              95.4 %      27.9M
        Collisions:                                                 1.5G
        Chain max:                                                    11
        Chains:                                                     6.8M

ARC misc:
        Deleted:                                                    2.0G
        Mutex misses:                                              12.1k
        Eviction skips:                                             9.9M

root@bps-storage:/tank#

sappel · #12 Post by **sappel** » 2022-08-16 06:56

OK, just a little update:
I'll enable compression later.
Is disabling atime of any use? My files are not particularly slow and I guess I/O is not the big deal in my scenario, what do you think?

Bloom · #13 Post by **Bloom** » 2022-08-16 09:18

It's an R/W access the system doesn't have to do. In all my years of working with Debian, I have never come across a situation where I wanted to know what the last access timestamp of a file was. So I always disable atime on all my file systems, local or networked.

steve_v · #14 Post by **steve_v** » 2022-08-18 10:05

Some random 2c (IANAE):

dd is a terrible benchmark. dd from /dev/zero is worse than terrible.

Enable compression, there's really no downside with modern CPUs.

Disable atime unless you actually need it.

Experiment with disabling sync, especially if using NFS.

Experiment with recordsize. The 128k default is just a "doesn't suck too hard at anything" compromise. I find 1M pretty good for general-purpose fileserver workloads myself, but YMMV.

I'd call 20 drives in a single RAIDZ2 vdev too wide for optimal performance, but that's my speed/space/redundancy tradeoff and only you can find yours.
SV: RAIDZ IOPS are limited by the the slowest drive in a vdev, but vdevs stripe. Resilvers on wide RAIDZ vdevs can also be pretty slow, during which time both redundancy and performance are compromised.

L2ARC on an SSD may or may not be a good idea. I have no experience with that particular samsung drive, but I can say that it will kill most consumer and light-commercial SSDs fairly quickly while providing questionable benefit.
Ideally you want something like optane here... Yes, I know, I want a pony too.

Referencing drives by /dev/sd[x] makes me cringe a little. Maybe that's just me being weird, but using /dev/disk/by-id or /dev/disk/by-path really helps identifying physical units if nothing else.

Presumably you have read the OpenZFS wiki as a starting point?
There are plenty of performance questions masquerading as "issues" on ZoL's github as well, and the ZoL guys don't seem to mind. It's worth a browse if nothing else.

This is all just bikeshedding without proper benchmarks of course, and I suspect fcorbelli knows more on this topic than I. I'll be watching in the hopes I learn something too.

sappel · #15 Post by **sappel** » 2022-08-22 06:28

Sorry everyone I didn't reply anymore - I was actually unexpected away for a week.

Thanks for further tipps + hints.

I'll disable atime later. Right now, I feel like the 20 drivers in a pool as RAIDz2 is also too big, but that was actually the recommendation of the consulting company. I might choose 2x 14 or something similar if we choose to upgrade some time.

I'm aware of the potential issue with the l2arc on SSD. I chose a SSD with 3 drive writes per day for starters. Let's see how it performs.

My goal is/was that we get a setup for our storage that max out our 10G fiber as good as possible in a scenario with 5-10 clients writing data to the storage or reading from it. 90% of the time only big 100-200MB media files.

Regarding the references /dev/sdx vs /dev/id: the system actually put it that way when I first created the pool with only using sdX. I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?

steve_v · #16 Post by **steve_v** » 2022-08-22 06:51

sappel wrote: ↑2022-08-22 06:28I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?

The by-id and by-path nodes are just symlinks to /dev/sd[x] anyway (thanks udev), so it makes no difference in terms of functionality what you use. I find /dev/sd[x] a bit fat-finger prone when dealing with a large number of drives is all, that and it's an extra step to figure out which tray to pull when you want to hotswap a disk.

A bit of naughty ls parsing works for "which physical drive is /dev/sd[x]" questions as well, e.g.

Code: Select all

alias diskids='ls -l /dev/disk/by-id/ata-* | grep -v part | awk "{print \$11,\$9;}" | cut -d"/" -f3- | sort'

Yes shell-police, I know. I should use readlink or ask udev directly, but I'm lazy and it's only for informational purposes anyway.

sappel · #17 Post by **sappel** » 2022-08-22 08:46

One more thing I forgot to ask:

If using dd with /dev/zero and /dev/random is bad for performance testing, can anyone suggest me a better way to check read/write performance for internal only data transfers and external connections as well? fio?

steve_v · #18 Post by **steve_v** » 2022-08-22 12:29

sappel wrote: ↑2022-08-22 08:46fio

Sure.
Piping from /dev/zero falls into the trap of being infinitely compressible, and piping from /dev/urandom will probably just end up benchmarking the random number generator and CPU.
Creating a large file from /dev/urandom and then using that for benchmarking (controlling for caching of course) would be better, but personally I'd use tools designed for the job like FIO or IOZone.

sappel · #19 Post by **sappel** » 2022-08-22 14:01

Thanks for your reply steve_v.

I'm straight disappointed when looking at the fio tests.
I feel a bit undereducated and seriously bad consulted when looking more into the topic of write speeds:

Code: Select all

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=128k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --sync=1

Run status group 0 (all jobs):
  WRITE: bw=292MiB/s (306MB/s), 17.0MiB/s-18.4MiB/s (18.8MB/s-19.3MB/s), io=17.4GiB (18.7GB), run=61162-61166msec

Is it correct, the NVMe ZIL as SLOG doesn't make any sense for asynchronous writes? So there is no way to speed up asynchronous writes? Will more RAM help? God, I feel stupid right now.

steve_v · #20 Post by **steve_v** » 2022-08-22 14:59

sappel wrote: ↑2022-08-22 14:01SLOG doesn't make any sense for asynchronous writes?

As far as I am aware, and from my own experience (on rather wussier hardware), that is correct. I don't think that fio run was async though...

The primary purpose of the ZIL (whether a dedicated SLOG device or spread amongst the data disks) is to protect synchronous writes while they're sitting in RAM waiting to be flushed to disk, such that they aren't lost if the power goes down.
Asynchronous writes return right away whether they've actually been flushed to stable storage or not, so the ZIL isn't involved at all.
For synchronous writes however, a fast SLOG device can make a massive difference to performance because ZIL operations aren't causing write-duplication and preempting other activity on the pool. That's a big deal if you're e.g. storing databases or serving NFS.

Here's a pretty easy to grok writeup that's worth a read, my explaining tends to suck kinda hard.

sappel wrote: ↑2022-08-22 14:01I feel stupid right now

Don't sweat it, there's a reason "The ZIL and SLOG are two of the most misunderstood concepts in ZFS". The first time I built a pool I put SLOG and L2ARC drives in it... Then pulled them again because all my writes were async and all my cache hits were coming from ARC in RAM.

To properly see the impact of your SLOG, you'll want to play with fio in both sync and async mode (dense manual warning, or google-foo cheat-sheet like here), and/or compare "sync" write performance with sync=always vs sync=standard vs sync=disabled on the filesystem.

I don't have a comparable pool to play with (kinda wussy, remember

), so if you want something to compare to you'll want to go fishing for someone with deeper pockets than mine. Shouldn't be too hard to find though.

Debian User Forums

ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers