Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

Linux Kernel, Network, and Services configuration.
Message
Author
sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#1 Post by sappel »

Hello everyone,
I've recently set up a 250TB storage system for a file server. I've done some speed testing a few weeks back and everything looked great, now I have some first production data on the system and I'm not 100% happy with the performance. I should mention, I got some consulting for the general setup of the system, but since their last invoice I try to avoid asking there for more help - we had quite some discussion about what they charged me.
Hardware: Xeon Silver 4214 + 256GB RAM + 2x 480GB SSD for OS + 20x 14TB HDD SAS + 1x 6,4TB SSD Samsung PM1735

It's a single pool with RaidZ2 and the SSD is currently 50% for SLOG and 50% for L2Arc.

There are no databases running and 80%+ of the files are between 10MB and 200MB, so generally a lot of small files.

I'm aware of the risk of a non-redundant SLOG. If this Disk/Partition dies, we will loose data in case of power loss during write processes.

I've not tested anything with sync=always yet.


Our network connection is a 10G fiber. So I'd love to get close to the networks limit when writing to the storage.
Internally I get about 150-200MB/s when moving files. I was expecting/hoping for more.

Any suggestions how to improve the setup?

Thanks in advance
sappel

User avatar
Bloom
df -h | grep > 90TiB
df -h | grep > 90TiB
Posts: 504
Joined: 2017-11-11 12:23
Been thanked: 26 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#2 Post by Bloom »

I can't help you with ZFS, but it should be a *lot* faster. I'm running LizardFS on 1 Gbit/s networks and get 700..900 MB/s file transfer rates.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#3 Post by sappel »

with 150-200MB/s I mean 150-200 MByte/s. x8 = 1600Mbit/s, meaning we are above the 1GBit/s Limit. But still it is way to slow, I had tests with 700-800MByte/s which is more close to 10GBit.
And the above speed of 150-200MByte/s was internally, without any network.

User avatar
Bloom
df -h | grep > 90TiB
df -h | grep > 90TiB
Posts: 504
Joined: 2017-11-11 12:23
Been thanked: 26 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#4 Post by Bloom »

He's running a 10G fiber network.

User avatar
fcorbelli
Posts: 40
Joined: 2022-08-15 09:03
Location: Italy
Has thanked: 2 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#5 Post by fcorbelli »

First of all I, personally, I would not have recommended using an SSD with the amount of RAM available at all

Returning to performance, the very first thing to do is to understand if it is an "internal" problem, or an "external "one
Try writing large amounts of incompressible data directly to the machine
There you will see the "maximum" speed that you can realistically achieve

I hardly use zfs on Debian, but I would really test the speed first WITHOUT the network

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#6 Post by sappel »

ok, thanks for your reply.

So my first attempts:

Code: Select all

root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.96411 s, 1.4 GB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=8k count=512k
524288+0 records in
524288+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.58741 s, 936 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=4k count=512k
524288+0 records in
524288+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 3.41273 s, 629 MB/s
root@bps-storage:/tank/data2# dd if=/dev/zero of=test.img bs=128k count=512k
^C343403+0 records in
343403+0 records out
45010518016 bytes (45 GB, 42 GiB) copied, 34.7899 s, 1.3 GB/s
Is that a sufficient test in your understanding?
If I do speed tests with real-world files (e.g. 150GB of approx 120MB tif Files) I'm not sure how to measure speed the best way. Any ideas?

User avatar
fcorbelli
Posts: 40
Joined: 2022-08-15 09:03
Location: Italy
Has thanked: 2 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#7 Post by fcorbelli »

Well.. no
No because at 99% you have zfs compression enabled (do you? what kind?)
Do not use zero, but random

As you can see "zero" is way too easy for ZFS

Code: Select all

root@aserver:/temporaneo/ugo # dd if=/dev/zero of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 2.097144 secs (4096015943 bytes/sec)
root@aserver:/temporaneo/ugo #

Code: Select all

root@aserver:/temporaneo/ugo # dd if=/dev/urandom of=test.img bs=1M count=8k
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 79.356579 secs (108244769 bytes/sec)
I never use the GUI on BSD, Linux and Solaris. Ever.

User avatar
fcorbelli
Posts: 40
Joined: 2022-08-15 09:03
Location: Italy
Has thanked: 2 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#8 Post by fcorbelli »

And a

Code: Select all

zfs get all /tank/data2
zfs list
zpool status
output, please
I never use the GUI on BSD, Linux and Solaris. Ever.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#9 Post by sappel »

I already thought /dev/random might be better..
anyways:

Code: Select all

bps-storage:/tank/data3#   zfs get all /tank/data2
NAME        PROPERTY                  VALUE                     SOURCE
tank/data2  type                      filesystem                -
tank/data2  creation                  Thu Jun  2 10:33 2022     -
tank/data2  used                      13.9T                     -
tank/data2  available                 36.1T                     -
tank/data2  referenced                11.8T                     -
tank/data2  compressratio             1.00x                     -
tank/data2  mounted                   yes                       -
tank/data2  quota                     50T                       local
tank/data2  reservation               none                      default
tank/data2  recordsize                128K                      default
tank/data2  mountpoint                /tank/data2               default
tank/data2  sharenfs                  off                       default
tank/data2  checksum                  on                        default
tank/data2  compression               off                       default
tank/data2  atime                     on                        default
tank/data2  devices                   on                        default
tank/data2  exec                      on                        default
tank/data2  setuid                    on                        default
tank/data2  readonly                  off                       default
tank/data2  zoned                     off                       default
tank/data2  snapdir                   hidden                    default
tank/data2  aclmode                   discard                   default
tank/data2  aclinherit                restricted                default
tank/data2  createtxg                 468426                    -
tank/data2  canmount                  on                        default
tank/data2  xattr                     sa                        local
tank/data2  copies                    1                         default
tank/data2  version                   5                         -
tank/data2  utf8only                  off                       -
tank/data2  normalization             none                      -
tank/data2  casesensitivity           mixed                     -
tank/data2  vscan                     off                       default
tank/data2  nbmand                    off                       default
tank/data2  sharesmb                  off                       local
tank/data2  refquota                  none                      default
tank/data2  refreservation            none                      default
tank/data2  guid                      13804224936539947431      -
tank/data2  primarycache              all                       default
tank/data2  secondarycache            all                       default
tank/data2  usedbysnapshots           2.09T                     -
tank/data2  usedbydataset             11.8T                     -
tank/data2  usedbychildren            0B                        -
tank/data2  usedbyrefreservation      0B                        -
tank/data2  logbias                   latency                   default
tank/data2  objsetid                  1284                      -
tank/data2  dedup                     off                       default
tank/data2  mlslabel                  none                      default
tank/data2  sync                      standard                  default
tank/data2  dnodesize                 auto                      local
tank/data2  refcompressratio          1.00x                     -
tank/data2  written                   9.00T                     -
tank/data2  logicalused               13.8T                     -
tank/data2  logicalreferenced         11.8T                     -
tank/data2  volmode                   default                   default
tank/data2  filesystem_limit          none                      default
tank/data2  snapshot_limit            none                      default
tank/data2  filesystem_count          none                      default
tank/data2  snapshot_count            none                      default
tank/data2  snapdev                   hidden                    default
tank/data2  acltype                   off                       default
tank/data2  context                   none                      default
tank/data2  fscontext                 none                      default
tank/data2  defcontext                none                      default
tank/data2  rootcontext               none                      default
tank/data2  relatime                  off                       default
tank/data2  redundant_metadata        all                       default
tank/data2  overlay                   on                        default
tank/data2  encryption                off                       default
tank/data2  keylocation               none                      default
tank/data2  keyformat                 none                      default
tank/data2  pbkdf2iters               0                         default
tank/data2  special_small_blocks      0                         default
tank/data2  org.debian:periodic-trim  enable                    inherited from tank

Code: Select all

bps-storage:/tank/data3# zfs list
NAME            USED  AVAIL     REFER  MOUNTPOINT
tank            179T  40.1T      299K  /tank
tank/team    768K  5.00T      277K  /tank/team
tank/data1     36.8T  40.1T     31.6T  /tank/data1
tank/data2     13.9T  36.1T     11.8T  /tank/data2
tank/data3     77.9T  40.1T     77.9T  /tank/data3
tank/testdata  50.6T  40.1T     24.4T  /tank/testdata

Code: Select all

bps-storage:/tank/data3# zpool status
  pool: tank
 state: ONLINE
  scan: scrub in progress since Sun Aug 14 00:24:02 2022
        151T scanned at 1.09G/s, 137T issued at 1017M/s, 179T total
        0B repaired, 76.66% done, 11:59:11 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            scsi-35000c500d84a8b3b  ONLINE       0     0     0
            scsi-35000c500d84aefbf  ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
            sdg                     ONLINE       0     0     0
            sdh                     ONLINE       0     0     0
            sdi                     ONLINE       0     0     0
            scsi-35000c500d84c065f  ONLINE       0     0     0
            scsi-35000c500d84b490b  ONLINE       0     0     0
            scsi-35000c500d849b517  ONLINE       0     0     0
            scsi-35000c500d848a82b  ONLINE       0     0     0
            sdn                     ONLINE       0     0     0
            sdo                     ONLINE       0     0     0
            sdp                     ONLINE       0     0     0
            scsi-35000c500d84b41bf  ONLINE       0     0     0
            scsi-35000c500d84aeab7  ONLINE       0     0     0
            sds                     ONLINE       0     0     0
            scsi-35000c500d849ada3  ONLINE       0     0     0
            scsi-35000c500d84b1203  ONLINE       0     0     0
            sdv                     ONLINE       0     0     0
        logs
          nvme0n1p1                 ONLINE       0     0     0
        cache
          nvme0n1p2                 ONLINE       0     0     0

errors: No known data errors
thanks for your support!

User avatar
fcorbelli
Posts: 40
Joined: 2022-08-15 09:03
Location: Italy
Has thanked: 2 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#10 Post by fcorbelli »

you are welcome

1) a scrub is running, performance degraded. Wait until completed
2) you have no compression, I suggest to turn on (LZ4)
3) I suggest to turn atime off
4) look how big is the ARC (I am sorry, I use FreeBSD, not sure where it is on Debian)
I never use the GUI on BSD, Linux and Solaris. Ever.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#11 Post by sappel »

Scrub is done by now, I'll do some checks again later today.

Will look into compression and atime later as well.

ARC output:

Code: Select all

root@bps-storage:/tank# arc_summary -s arc

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Aug 16 03:18:51 2022
Linux 5.10.0-14-amd64                                            2.0.3-9
Machine: bps-storage.bps.local (x86_64)                          2.0.3-9

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.0 %  125.3 GiB
        Target size (adaptive):                       100.0 %  125.2 GiB
        Min size (hard limit):                          6.2 %    7.8 GiB
        Max size (high water):                           16:1  125.2 GiB
        Most Frequently Used (MFU) cache size:         40.0 %   40.2 GiB
        Most Recently Used (MRU) cache size:           60.0 %   60.3 GiB
        Metadata cache size (hard limit):              75.0 %   93.9 GiB
        Metadata cache size (current):                 53.0 %   49.8 GiB
        Dnode cache size (hard limit):                 10.0 %    9.4 GiB
        Dnode cache size (current):                    99.3 %    9.3 GiB

ARC hash breakdown:
        Elements max:                                              29.3M
        Elements current:                              95.4 %      27.9M
        Collisions:                                                 1.5G
        Chain max:                                                    11
        Chains:                                                     6.8M

ARC misc:
        Deleted:                                                    2.0G
        Mutex misses:                                              12.1k
        Eviction skips:                                             9.9M

root@bps-storage:/tank#

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#12 Post by sappel »

OK, just a little update:
I'll enable compression later.
Is disabling atime of any use? My files are not particularly slow and I guess I/O is not the big deal in my scenario, what do you think?

User avatar
Bloom
df -h | grep > 90TiB
df -h | grep > 90TiB
Posts: 504
Joined: 2017-11-11 12:23
Been thanked: 26 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#13 Post by Bloom »

It's an R/W access the system doesn't have to do. In all my years of working with Debian, I have never come across a situation where I wanted to know what the last access timestamp of a file was. So I always disable atime on all my file systems, local or networked.

steve_v
df -h | grep > 20TiB
df -h | grep > 20TiB
Posts: 1400
Joined: 2012-10-06 05:31
Location: /dev/chair
Has thanked: 79 times
Been thanked: 175 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#14 Post by steve_v »

Some random 2c (IANAE):

dd is a terrible benchmark. dd from /dev/zero is worse than terrible.

Enable compression, there's really no downside with modern CPUs.

Disable atime unless you actually need it.

Experiment with disabling sync, especially if using NFS.

Experiment with recordsize. The 128k default is just a "doesn't suck too hard at anything" compromise. I find 1M pretty good for general-purpose fileserver workloads myself, but YMMV.

I'd call 20 drives in a single RAIDZ2 vdev too wide for optimal performance, but that's my speed/space/redundancy tradeoff and only you can find yours.
SV: RAIDZ IOPS are limited by the the slowest drive in a vdev, but vdevs stripe. Resilvers on wide RAIDZ vdevs can also be pretty slow, during which time both redundancy and performance are compromised.

L2ARC on an SSD may or may not be a good idea. I have no experience with that particular samsung drive, but I can say that it will kill most consumer and light-commercial SSDs fairly quickly while providing questionable benefit.
Ideally you want something like optane here... Yes, I know, I want a pony too. :P

Referencing drives by /dev/sd[x] makes me cringe a little. Maybe that's just me being weird, but using /dev/disk/by-id or /dev/disk/by-path really helps identifying physical units if nothing else.

Presumably you have read the OpenZFS wiki as a starting point?
There are plenty of performance questions masquerading as "issues" on ZoL's github as well, and the ZoL guys don't seem to mind. It's worth a browse if nothing else.


This is all just bikeshedding without proper benchmarks of course, and I suspect fcorbelli knows more on this topic than I. I'll be watching in the hopes I learn something too. ;)
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#15 Post by sappel »

Sorry everyone I didn't reply anymore - I was actually unexpected away for a week.

Thanks for further tipps + hints.

I'll disable atime later. Right now, I feel like the 20 drivers in a pool as RAIDz2 is also too big, but that was actually the recommendation of the consulting company. I might choose 2x 14 or something similar if we choose to upgrade some time.

I'm aware of the potential issue with the l2arc on SSD. I chose a SSD with 3 drive writes per day for starters. Let's see how it performs.

My goal is/was that we get a setup for our storage that max out our 10G fiber as good as possible in a scenario with 5-10 clients writing data to the storage or reading from it. 90% of the time only big 100-200MB media files.

Regarding the references /dev/sdx vs /dev/id: the system actually put it that way when I first created the pool with only using sdX. I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?

steve_v
df -h | grep > 20TiB
df -h | grep > 20TiB
Posts: 1400
Joined: 2012-10-06 05:31
Location: /dev/chair
Has thanked: 79 times
Been thanked: 175 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#16 Post by steve_v »

sappel wrote: 2022-08-22 06:28I read (iirc in a proxmox forum) nowadays with modern udev it doesn't make any difference?
The by-id and by-path nodes are just symlinks to /dev/sd[x] anyway (thanks udev), so it makes no difference in terms of functionality what you use. I find /dev/sd[x] a bit fat-finger prone when dealing with a large number of drives is all, that and it's an extra step to figure out which tray to pull when you want to hotswap a disk.

A bit of naughty ls parsing works for "which physical drive is /dev/sd[x]" questions as well, e.g.

Code: Select all

alias diskids='ls -l /dev/disk/by-id/ata-* | grep -v part | awk "{print \$11,\$9;}" | cut -d"/" -f3- | sort'
Yes shell-police, I know. I should use readlink or ask udev directly, but I'm lazy and it's only for informational purposes anyway. :P
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#17 Post by sappel »

One more thing I forgot to ask:

If using dd with /dev/zero and /dev/random is bad for performance testing, can anyone suggest me a better way to check read/write performance for internal only data transfers and external connections as well? fio?

steve_v
df -h | grep > 20TiB
df -h | grep > 20TiB
Posts: 1400
Joined: 2012-10-06 05:31
Location: /dev/chair
Has thanked: 79 times
Been thanked: 175 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#18 Post by steve_v »

sappel wrote: 2022-08-22 08:46fio
Sure.
Piping from /dev/zero falls into the trap of being infinitely compressible, and piping from /dev/urandom will probably just end up benchmarking the random number generator and CPU.
Creating a large file from /dev/urandom and then using that for benchmarking (controlling for caching of course) would be better, but personally I'd use tools designed for the job like FIO or IOZone.
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.

sappel
Posts: 11
Joined: 2022-08-15 07:07
Been thanked: 1 time

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#19 Post by sappel »

Thanks for your reply steve_v.

I'm straight disappointed when looking at the fio tests.
I feel a bit undereducated and seriously bad consulted when looking more into the topic of write speeds:

Code: Select all

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=128k --numjobs=16 --size=256m --iodepth=16 --runtime=60 --time_based --sync=1

Run status group 0 (all jobs):
  WRITE: bw=292MiB/s (306MB/s), 17.0MiB/s-18.4MiB/s (18.8MB/s-19.3MB/s), io=17.4GiB (18.7GB), run=61162-61166msec
Is it correct, the NVMe ZIL as SLOG doesn't make any sense for asynchronous writes? So there is no way to speed up asynchronous writes? Will more RAM help? God, I feel stupid right now.

steve_v
df -h | grep > 20TiB
df -h | grep > 20TiB
Posts: 1400
Joined: 2012-10-06 05:31
Location: /dev/chair
Has thanked: 79 times
Been thanked: 175 times

Re: ZFS Setup + Performance for Snapshoting/Remote Snapshots + File Transfers

#20 Post by steve_v »

sappel wrote: 2022-08-22 14:01SLOG doesn't make any sense for asynchronous writes?
As far as I am aware, and from my own experience (on rather wussier hardware), that is correct. I don't think that fio run was async though...

The primary purpose of the ZIL (whether a dedicated SLOG device or spread amongst the data disks) is to protect synchronous writes while they're sitting in RAM waiting to be flushed to disk, such that they aren't lost if the power goes down.
Asynchronous writes return right away whether they've actually been flushed to stable storage or not, so the ZIL isn't involved at all.
For synchronous writes however, a fast SLOG device can make a massive difference to performance because ZIL operations aren't causing write-duplication and preempting other activity on the pool. That's a big deal if you're e.g. storing databases or serving NFS.

Here's a pretty easy to grok writeup that's worth a read, my explaining tends to suck kinda hard.
sappel wrote: 2022-08-22 14:01I feel stupid right now
Don't sweat it, there's a reason "The ZIL and SLOG are two of the most misunderstood concepts in ZFS". The first time I built a pool I put SLOG and L2ARC drives in it... Then pulled them again because all my writes were async and all my cache hits were coming from ARC in RAM.

To properly see the impact of your SLOG, you'll want to play with fio in both sync and async mode (dense manual warning, or google-foo cheat-sheet like here), and/or compare "sync" write performance with sync=always vs sync=standard vs sync=disabled on the filesystem.

I don't have a comparable pool to play with (kinda wussy, remember :(), so if you want something to compare to you'll want to go fishing for someone with deeper pockets than mine. Shouldn't be too hard to find though.
Last edited by steve_v on 2022-08-22 15:26, edited 1 time in total.
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.

Post Reply