zfs junkie? c'est moi charbonneau zfs

Logo CIRRELT
Le Centre Interuniversitaire de Recherche sur les Réseaux d'Entreprise, la Logistique et le Transport
Groupe d'études et de recherche
en analyse des décisions



Logo GERAD
  Download Solaris 10 including ZFS,
Free Sun Studio(compilers),
Free Sun Cluster.

ZFS Management and Troubleshootings Guide, by Princeton University

Does ZFS Obsolete Expensive NAS/SANs? No, it does not!
from the fast-secure-reliable-cheap dept.

ZFS Best Practices Guide, by Solaris Internals <<-- Un must!

ZFS video demo on usb sticks and instructions to play ZFS in a /root/Desktop environnement

::::::::::::::
.bashrc
::::::::::::::
PS1='`uname -n`[`pwd`]# '
export PATH=/usr/ccs/bin:$PATH:/usr/sfw/bin:/usr/sfw/sbin:/usr/local/sbin
alias ll="ls -l"
alias rm="rm -i"
alias mv="mv -i"
alias cp="cp -i"

::::::::::::::
ZFS.instructions
::::::::::::::

/usr/sbin/smcwebserver start
/usr/sbin/smcwebserver enable

https://atlantide:6789/

# Create the pool raid-Z
/usr/sbin/zpool create -f atlantide raidz c4t8d0 c4t9d0 c4t10d0 c4t11d0

# Create the file system
/usr/sbin/zfs create atlantide/pierrega

# Set the quota
/usr/sbin/zfs set quota=20MB atlantide/pierrega

# Set the reservation
/usr/sbin/zfs set reservation=20MB atlantide/danielc

zfs set sharenfs=on atlantide
zfs set sharenfs=rw=@10.100.1.0/24,rw=@10.100.2.0/24 atlantide
zfs get sharenfs atlantide


ZFS Snapshots
=============
zfs snapshot atlantide/danielc@mardi

ZFS Rool back
=============
zfs roolback atlantide/danielc@jeudi

Regarde la version de foo.c de mercredi

% cat ~danielc/.zfs/snapshot/mercredi/foo.c
Je me suis planté ce matin avec rsync --DELETE, j'ai tout perdu
mon compte sur w1. Mais qu'à cela ne tienne...

w1[/]# zfs rollback home/daniel@heure_09
cannot rollback to 'home/daniel@heure_09': more recent snapshots exist
use '-r' to force deletion of the following snapshots:
home/daniel@heure_10

w1[/]# zfs rollback -r home/daniel@heure_09
cannot unmount '/home/daniel': Device busy

w1[/]# zfs rollback -rf home/daniel@heure_09

Et voilà tout est OK en une fraction de seconde...
ZFS Data Migration
==================
old# zpool export atlantide
new# zpool import atlantide

Les backups et les restores sur rubans
=========================
atlantide# time zfs send atlantide/daniel@TMP_BackuP |dd ibs=258048 of=/dev/rmt/1un obs=2064384
atlantide# time dd if=/dev/rmt/0n bs=2064384 | zfs receive atlantide/danieltest

MAN PAGE
=============
atlantide.crt.umontreal.ca[/]# zfs -?
usage: zfs command args ...
where 'command' is one of the following:

create <filesystem>
create [-s] [-b blocksize] -V <size> <volume>
destroy [-rRf] <filesystem|volume|snapshot>

snapshot <filesystem@name|volume@name>
rollback [-rRf] <snapshot>
clone <snapshot> <filesystem|volume>
rename <filesystem|volume|snapshot> <filesystem|volume|snapshot>

list [-rH] [-o property[,property]...] [-t type[,type]...]
[filesystem|volume|snapshot] ...

set <property=value> <filesystem|volume> ...


============================
post-mortem de w1 zfs mirror
..... Voici le suivi ...
============================

w1[~]# /sbin/zpool status -v
pool: home
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Tue Apr 17 15:56:02 2007
config:

NAME STATE READ WRITE CKSUM
home DEGRADED 0 0 0
mirror DEGRADED 0 0 0
c0d0s7 UNAVAIL 0 0 0 cannot open
c0d0s7 ONLINE 0 0 0

errors: No known data errors
w1[~]# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c0d0
/pci@0,0/pci-ide@7/ide@0/cmdk@0,0
1. c1d0
/pci@0,0/pci-ide@7/ide@1/cmdk@0,0
Specify disk (enter its number): 0


w1[~]# zpool remove home c0d0s7
cannot remove c0d0s7: only hot spares can be removed
w1[~]# zpool detach home c0d0s7
w1[~]# zpool status -x
all pools are healthy
w1[~]# zpool status -v
pool: home
state: ONLINE
scrub: resilver completed with 0 errors on Tue Apr 17 16:23:27 2007
config:

NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
c0d0s7 ONLINE 0 0 0

errors: No known data errors
w1[~]# zpool add home mirror c1d0s7
invalid vdev specification: mirror requires at least 2 devices
w1[~]# zpool add home mirror c0d0s7 c1d0s7
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c0d0s7 is part of active ZFS pool home. Please see zpool(1M).
w1[~]# zpool attach home c0d0s7 c1d0s7
w1[~]# zpool status -v
pool: home
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 3.37% done, 0h5m to go
config:

NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0s7 ONLINE 0 0 0
c1d0s7 ONLINE 0 0 0

errors: No known data errors
w1[~]# zpool status -v
pool: home
state: ONLINE
scrub: resilver completed with 0 errors on Tue Apr 17 16:36:25 2007
config:

NAME STATE READ WRITE CKSUM
home ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0s7 ONLINE 0 0 0
c1d0s7 ONLINE 0 0 0

errors: No known data errors


MINI cours...
Getting Started
===========

Everything you hate about managing filesystems and volumes is gone: you don't have to format, newfs, mount, edit /etc/vfstab, fsck, growfs, metadb, metainit, etc.

Meet your new best friends: zpool(1M) and zfs(1M).

ZFS is easy, so let's get on with it! It's time to create your first pool:

    # zpool create tank c1t2d0

You now have a single-disk storage pool named tank, with a single filesystem mounted at /tank. There is nothing else to do.

If you want mirrored storage for mail and home directories, that's easy too:

Create the pool:

    # zpool create tank mirror c1t2d0 c2t2d0

Create the /var/mail filesystem:

    # zfs create tank/mail

    # zfs set mountpoint=/var/mail tank/mail

Create home directories, and mount them all in /export/home/<username>:

    # zfs create tank/home

    # zfs set mountpoint=/export/home tank/home

    # zfs create tank/home/ahrens

    # zfs create tank/home/billm

    # zfs create tank/home/bonwick

    # zfs create tank/home/eschrock

Filesystems in ZFS are hierarchical: each one inherits properties from above. In this example, the mountpoint property is inherited as a pathname prefix. That is, tank/home/ahrens is automatically mounted at /export/home/ahrens because tank/home is mounted at /export/home. You don't have to specify the mountpoint for each individual user, you just tell ZFS the pattern.

This is how we actually set up home directory and mail service on zion.eng, which has been running ZFS for over a year and a half.

But wait, there's more!

ZFS provides built-in compression. To compress all home directories:

    # zfs set compression=on tank/home

To give ahrens a 10G quota:

    # zfs set quota=10g tank/home/ahrens

To give bonwick a 100G reservation (membership has its privileges):

    # zfs set reservation=100g tank/home/bonwick

To automatically NFS-export all home directories read/write:

    # zfs set sharenfs=rw tank/home

To scrub all disks and verify the integrity of all data in the pool:

    # zpool scrub tank

To replace a flaky disk:

    # zpool replace tank c2t2d0 c4t1d0

To add more space:

    # zpool add tank mirror c5t1d0 c6t1d0

To move your pool from SPARC machine 'sparky' to AMD machine 'amdy':

[on sparky]
    # zpool export tank

Physically move your disks from sparky to amdy.

[on amdy]
    # zpool import tank

Everything will just work, ZFS has 'adaptive endianness' to cope with different byte order on different platforms.

You get the idea: it's simple. Any common ZFS operation can be done with a single short command.
For more information

You can find more information at the docuemtation section. You can also join the ZFS discussion at zfs-discuss AT opensolaris DOT org.


http://blogs.sun.com/roch/entry/when_to_and_not_to
======================================

WHEN TO (AND NOT TO) USE RAID-Z

RAID-Z is the technology  used by ZFS  to implement a data-protection  scheme
which is less  costly  than  mirroring  in  terms  of  block
overhead.

Here,  I'd  like  to go  over,    from a theoretical standpoint,   the
performance implication of using RAID-Z.   The goal of this technology
is to allow a storage subsystem to be able  to deliver the stored data
in  the face of one  or more disk   failures.  This is accomplished by
joining  multiple disks into  a  N-way RAID-Z  group. Multiple  RAID-Z
groups can be dynamically striped to form a larger storage pool.

To store file data onto  a RAID-Z group, ZFS  will spread a filesystem
(FS) block onto the N devices that make up the  group.  So for each FS
block,  (N - 1) devices  will  hold file  data  and 1 device will hold
parity  information.   This information  would eventually   be used to
reconstruct (or  resilver) data in the face  of any device failure. We
thus  have 1 / N  of the available disk  blocks that are used to store
the parity  information.   A 10-disk  RAID-Z group  has 9/10th of  the
blocks effectively available to applications.

A common alternative for data protection, is  the use of mirroring. In
this technology, a filesystem block is  stored onto 2 (or more) mirror
copies.  Here again,  the system will  survive single disk failure (or
more with N-way mirroring).  So 2-way mirror actually delivers similar
data-protection at   the expense of   providing applications access to
only one half of the disk blocks.

Now  let's look at this from  the performance angle in particular that
of  delivered filesystem  blocks  per second  (FSBPS).  A N-way RAID-Z
group  achieves it's protection  by spreading a  ZFS block  onto the N
underlying devices.  That means  that a single  ZFS block I/O must  be
converted to N device I/Os.  To be more precise,  in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.

Now after a request for a  ZFS block has been spread  this way, the IO
scheduling code will take control of all the device  IOs that needs to
be  issued.  At this  stage,  the ZFS  code  is capable of aggregating
adjacent  physical   I/Os  into   fewer ones.     Because of  the  ZFS
Copy-On-Write (COW) design, we   actually do expect this  reduction in
number of device level I/Os to work extremely well  for just about any
write intensive workloads.  We also expect  it to help streaming input
loads significantly.  The situation of random inputs is one that needs
special attention when considering RAID-Z.

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  delivered    random input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.  This is the price to
pay to achieve proper data  protection without  the 2X block  overhead
associated with mirroring.

With 2-way mirroring, each FS block output must  be sent to 2 devices.
Half of the available IOPS  are thus lost  to mirroring.  However, for
Inputs each side of a mirror can service read calls independently from
one another  since each  side   holds the full information.    Given a
proper software implementation that balances  the inputs between sides
of a mirror, the  FS blocks delivered by a  mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking  at random access input  load, the number  of FS blocks per
second (FSBPS), Given N devices to be grouped  either in RAID-Z, 2-way
mirrored or simply striped  (a.k.a RAID-0, no  data protection !), the
equation would  be (where dev  represents   the capacity in  terms  of
blocks of IOPS of a single device):

                             Random
            Blocks Available  FS Blocks / sec
            ----------------  --------------
RAID-Z            (N - 1) * dev           1 * dev          
Mirror            (N / 2) * dev           N * dev          
Stripe            N * dev                 N * dev          


Now lets take 100 disks of  100 GB, each each  capable of 200 IOPS and
look  at different  possible configurations;  In the   table below the
configuration labeled:
     
      "Z 5 x (19+1)"

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.


                                   Random
       Config           Blocks Available  FS Blocks /sec
       ------------     ----------------  ---------
       Z 1  x (99+1)    9900 GB                   200  
       Z 2  x (49+1)    9800 GB                   400  
       Z 5  x (19+1)    9500 GB                 1000  
       Z 10 x (9+1)     9000 GB                 2000  
       Z 20 x (4+1)     8000 GB                 4000  
       Z 33 x (2+1)     6600 GB                 6600  

       M  2 x (50)      5000 GB                 20000  
       S  1 x (100)   10000 GB           20000  


So RAID-Z  gives you  at most 2X  the number  of blocks that mirroring
provides  but hits you  with  much fewer  delivered IOPS.  That  means
that, as the number of  devices in a  group N increases, the  expected
gain over mirroring (disk blocks)  is bounded (to  at most 2X) but the
expected cost  in IOPS is not  bounded (cost in  the range of [N/2, N]
fewer IOPS). 

Note  that for wide RAID-Z configurations,  ZFS takes into account the
sector  size of devices  (typically 512 Bytes)  and dynamically adjust
the effective number of columns in a stripe.  So even if you request a
99+1  configuration, the actual data  will probably be  stored on much
fewer data columns than that.   Hopefully this article will contribute
to steering deployments away from those types of configuration.

In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups    should be restrained  to smaller   sizes and one must
accept some level of disk block overhead.

When performance matters most, mirroring should be highly favored.  If
mirroring  is considered too   costly but performance  is nevertheless
required, one could proceed like this:

      Given N devices each capable of X IOPS.

      Given a target of delivered  Y FS blocks per second
      for the storage pool.

      Build your storage using dynamically  striped RAID-Z groups of
      (Y / X) devices.

For instance:

      Given 50 devices each capable of 200 IOPS.

      Given a target of delivered 1000 FS blocks per second
      for the storage pool.

      Build your storage using dynamically striped RAID-Z groups of
      (1000 / 200) = 5 devices.

In that system we then would have  20% block overhead lost to maintain
RAID-Z level parity.

RAID-Z is a great  technology not only  when disk blocks are your most
precious resources but also  when your available  IOPS far exceed your
expected needs.  But beware  that if you  get your hands on fewer very
large  disks, the IOPS capacity  can  easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a  dynamic stripe of RAID-Z  groups each made up of a
small number of devices.


Comments:

This is a very important article, and thank you for writing it. I have to admit that I did not understand everything. In particular, it always seemed to me that in a first approximation, raid5 (and raidz) was just (n-1) striped disks and the last one used for parity (except that the disk used for parity is not always the same). But if that were the case, there would not be such a huge performance hit between raid0 and raidz, at least for input. If you get the time some day, could you expand a bit on this please? Anyway, thank you, this will definitely change the way I am using raidz.

Posted by Marc on May 31, 2006 at 01:54 PM MEST #
Because in raid5, a file system block goes to a single disk. So a re-read hits only that disk. The downside of that design is that to write the block to that one disk, you need to read the previous block, read the parity, update the parity, then write both the new block and parity. And since those 2 writes do not cannot usually be done atomically, you have the write hole mess and potential silent data corruption.

Posted by Roch on May 31, 2006 at 04:48 PM MEST #
Great article, simply great! As soon as 10 6/06 is out, I will dearly need this information, since I have an enterprise storage server to configure. This is exactly what was needed; I even printed this article, just in case. Keep 'em coming!

Posted by ux-admin on May 31, 2006 at 06:00 PM MEST #
So in order to avoid the need for no more than 4 disk I/Os on each block write (two reads - one or both of which may be satisfied by cached data - plus two writes), you instead force a write to *all* the disks in the stripe (often more than 4 disk I/O operations, though all can be performed in parallel) *plus* degrade parallel read performance from N * dev IOPS to 1 * dev IOPS. Some people would suggest that that's a lousy trade-off, especially given the option to capture the modifications efficiently elsewhere (say, mirrored, with the new locations logged as look-aside addresses) and perform the stripe update lazily later. ZFS has some really neat innovations, but this does not appear to be one of them. - bill

Posted by Bill Todd on June 01, 2006 at 01:33 AM MEST #
ZFS has other technologies that will affect the number of IOPS that effectively happens (the mitigating factors) both for Inputs and Outputs. For Streaming purposes RAID-Z performance will be on par with anything else. Now, the article highlighted the tradeoffs one is faced given a bunch of disks and the need for data protection: mirror or RAID-Z. That is the question that many of the small people will be facing out there.The question that Bill Todd raises is a different interesting issue: Given a RAID-5 controller, 1GB of NVRAM and a bunch of disk, should I throw away the controller or keep it. That is a much more complex question...

Posted by Roch on June 01, 2006 at 09:31 AM MEST #
On : zfs-discuss-AT-opensolaris-DOT-org There's an important caveat I want to add to this. When you're doing sequential I/Os, or have a write-mostly workload, the issues that Roch explained so clearly won't come into play. The trade-off between space-efficient RAID-Z and IOP-efficient mirroring only exists when you're doing lots of small random reads. If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler will aggregate them in such a way that you'll get very efficient use of the disks regardless of the data replication model. It's only when you're doing small random reads that the difference between RAID-Z and mirroring becomes significant.... Jeff

Posted by Roch quoting Jeff B on June 01, 2006 at 10:27 AM MEST #
While streaming/large-sequential writes won't be any worse than in more conventional RAID-5, a write-mostly workload using *small* (especially single-block) writes will be if the stripe width exceeds 4 disks: all the disks will be written, vs. just two disk writes (plus at most two disk reads) in a conventional RAID-5 implementation - at least if the updates must be applied synchronously. If the updates can be deferred until several accumulate and can be written out together (assuming that revectoring them to new disk locations - even if they are updates to existing file data rather than appended data - is part of ZFS's bag of tricks), then Jeff's explanation makes more sense. And ISTR some mention of a 'log' for small synchronous updates that might function in the manner I suggested (temporarily capturing the update until it could be applied lazily - of course, the log would need to be mirrored to offer equivalent data protection until that occurred). The impact on small, parallel reads ramains a major drawback and suggesting that this is necessary for data integrity seems to be a red herring if indeed ZFS can revector new writes at will, since it can just delay logging the new locations until both data and parity updates have completed. If there's some real problem doing this, I'm curious as to what it might be. - bill

Posted by Bill Todd on June 02, 2006 at 12:04 AM MEST #
So yes, the synchronous writes go through the ZFS Intent log (zil) and Jeff mentioned this week that mirroring those tmp blocks N-way seems a good idea. ZFS does revector all writes to new locations (even block update) and it allows to stream just about any write intensive workload at top speed. It does seem possible as you suggest to implement raid-5 with ZFS; I suggest that would lead to a 4X degradation to all output intensive workloads; Since we won' t overwrite live data, to output X MB of data, ZFS would have to read X MB of freed blocks, X MB of parity, then write those blocks with new data. Maybe some of the read could already be cached but that's not quite clear that they would commonly be. Maybe what this saying is that RAID-5 works nicely for filesystems that allow themselves to overwrite live data. That maybe ok but does seem to require NVRAM to work. This is just not the design point of ZFS. It appears that for ZFS, RAID-5 would pessimise all write intensive workloads and RAID-Z pessimises non-cached random read type load.

Posted by Roch on June 03, 2006 at 03:20 PM MEST #
Don't be silly. You'd output X MB in full-stripe writes (possibly handling the last stripe specially), just as you do now, save that instead of smearing each block across all the drives you'd write them to individual drives (preferably in contiguous groups within a single file) such that they could be read back efficiently later. In typical cases (new file creation or appends to files rather than in-file updates to existing data) you wouldn't even experience any free-space checkerboarding, but where you did it could be efficiently rectified during your subsequent scrubbing passes (especially within large files where all the relevant metadata would be immediately at hand; in small files you'd need to remember some information about what needed reconsolidating, though given that in typical environments large files consume most of the space - though not necessarily most of the access activity - using parity RAID for small files is of questionable utility anyway). And there's no need for NVRAM with the full-stripe writes you'd still be using. (Hope this doesn't show up as a double post - there seemed to be some problem with the first attempt.) - bill

Posted by Bill Todd on June 05, 2006 at 01:43 PM MEST #
On free-space checkerboarding and file defragmentation: You don't have to remember anything if you're really doing not-too-infrequent integrity scrubs, since as you scrub each file you're reading it all in anyway and have a golden opportunity to write it back out in large, efficient, contiguous chunks, whereas with small files that are fragmenting the free space you can just accumulate them until you've got a large enough batch to write out (again, efficiently) to a location better suited to their use. As long as you don't get obsessive about consolidating the last few small chunks of free space (i.e., are willing to leave occasions unusable free holes because they're negligible in total size), this should work splendidly. - bill

Posted by Bill Todd on June 06, 2006 at 01:26 AM MEST #
Rats: given your snapshot/cloning mechanisms, you can't rearrange even current data as cavalierly as that, I guess. In fact, how *do* you avoid rather significant free-space fragmentation issues (and, though admittedly in atypical files, severe performance degradation in file access due to extreme small-update-induced fragmentation) in the presence of (possibly numerous and long-lived) snapshots (or do you just let performance degrade as large areas of free space become rare and make up what would have been large writes - even of single blocks, in the extreme case - out of small pieces of free space)? Possibly an incentive to have explored metadata-level snapshoting similar to, e.g., Interbase's versioning, since adding an entire new level of indirection at the block level just to promote file and free-space contiguity gets expensive as the system size increases beyond the point where all the indirection information can be kept memory-resident.. - bill

Posted by Bill Todd on June 06, 2006 at 05:46 AM MEST #

http://www.sun.com/cgi-bin/sun/bigadmin/xpertApp.cgi?session=21_zfs&xpert=cgerhard&action=questions




::::::::::::::
backup.sh
::::::::::::::
#!/bin/sh
echo "y"|newfs /dev/rdsk/c0t3d0s0 #controler?target?disk?slice? c?t?d?s?
fsck /dev/rdsk/c0t3d0s0           #Checking the future root partition
mount /dev/dsk/c0t3d0s0 /mnt2    #Mount the future root partition
svcadm disable ntp      # a cause conflit entre ntp et snapshot
sleep 2                 # because that is life
ufsdump 0ufN - /dev/rdsk/c0t2d0s0 `fssnap -F ufs -o raw,bs=/snapshot,unlink /` | ( cd /mnt2; ufsrestore rf - )
fssnap -d /
svcadm enable ntp
sleep 2                 # because that is life
cd /mnt2; rm restoresymtable
cd /; umount /mnt2
fsck /dev/rdsk/c0t3d0s0         #Check it again with all the stuff on it
/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c0t3d0s0


::::::::::::::
crontab.atlantide
::::::::::::::
#
# backup de disque a disque pour le slash (/)
#
12 16 * * 5 nice -2 /root/backup.sh 2>&1 | /usr/ucb/mail -s "Backup disque `uname -n` /" admin@crt.umontreal.ca
#  5,10,15,20,25,30,35,40,45,50,55 * * * * /root/snapshotZFS minute
15 * * * * /root/snapshotZFS heure
10 0 * * * /root/snapshotZFS jour
5 0 1 * * /root/snapshotZFS mois
::::::::::::::
snapshotZFS
::::::::::::::
#!/bin/ksh -p

function take_snap
{
        if zfs list -H -o name $1 >/dev/null 2>&1
        then
                zfs destroy $1
        fi
        zfs snapshot ${1}
}

case ${1:-boot} in
        "boot")
                snap=$(date '+%F-%T')
                ;;
        "minute")
                snap=minute_$(date +%M)
                ;;
        "heure")
                snap=heure_$(date +%H)
                ;;
        "jour")
                snap=jour_$(date +%d)
                ;;
        "mois")
                snap=mois_$(date +%m)
                ;;
esac

for fs in $(zfs list -H -o name -t filesystem |tail +2)
do
        take_snap ${fs}@${snap}
done
 


HOW TO get the VXA-3 working at full speed with ZFS on
an opteron with solaris 10.

# df -k

atlantide 1147797504 79 1116634681 1% /atlantide
atlantide/daniel 1147797504 9036281 1116634681 1% /atlantide/daniel
atlantide/danielc 1147797504 49 1116634681 1% /atlantide/danielc
atlantide/pierreg 1147797504 8500911 1116634681 1% /atlantide/pierreg
atlantide/pierrega 1147797504 1166 1116634681 1% /atlantide/pierrega
atlantide/testlr 1147797504 1416308 1116634681 1% /atlantide/testlr
atlantide/theo 1147797504 6681586 1116634681 1% /atlantide/theo

atlantide.crt.umontreal.ca[/]# zfs list |grep Back
atlantide/daniel@TMP_BackuP 33K - 8.63G -
atlantide/danielc@TMP_BackuP 0 - 49K -
atlantide/pierreg@TMP_BackuP 120K - 8.16G -
atlantide/pierrega@TMP_BackuP 0 - 1.43M -

ZFS on atlantide is on 3150 ( 4 scsi disks 300G raidz )
=======================================================
Tape drive VXA-3 from exabyte $2,600cnd for 12MB/s transfert rate.
Around 30GB/hour is our best on 230m tapes (VXAtape X23)

I forgot to say : we have 2 VXA-320 running in parallele on the same scsi-cable.
(for a total os 60GB/hour backup window.

Here are various tests...
=========================

time ufsdump 0fb /dev/rmt/0un 512 /
DUMP: Date of this level 0 dump: Fri Jan 05 13:49:51 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 256 Kilobyte records
DUMP: Estimated 7704328 blocks (3761.88MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7704062 blocks (3761.75MB) on 1 volume at 15599 KB/sec
DUMP: DUMP IS DONE
real 4:11.8
user 8.0
sys 26.8

time ufsdump 0fb /dev/rmt/0un 480 /
DUMP: Date of this level 0 dump: Fri Jan 05 13:54:31 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 240 Kilobyte records
DUMP: Estimated 7704466 blocks (3761.95MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7703998 blocks (3761.72MB) on 1 volume at 16463 KB/sec
DUMP: DUMP IS DONE
real 4:00.1
user 8.0
sys 26.7

time ufsdump 0fb /dev/rmt/0un 448 /
DUMP: Date of this level 0 dump: Fri Jan 05 13:59:00 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 224 Kilobyte records
DUMP: Estimated 7704556 blocks (3761.99MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7704254 blocks (3761.84MB) on 1 volume at 16387 KB/sec
DUMP: DUMP IS DONE
real 4:00.4
user 7.9
sys 26.9

time ufsdump 0fb /dev/rmt/0un 384 /
DUMP: Date of this level 0 dump: Fri Jan 05 14:07:57 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 192 Kilobyte records
DUMP: Estimated 7704616 blocks (3762.02MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7704574 blocks (3762.00MB) on 1 volume at 16304 KB/sec
DUMP: DUMP IS DONE
real 4:02.3
user 7.9
sys 26.8

time ufsdump 0fb /dev/rmt/0un 256 /
DUMP: Date of this level 0 dump: Fri Jan 05 14:26:17 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 128 Kilobyte records
DUMP: Estimated 7704904 blocks (3762.16MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7704830 blocks (3762.12MB) on 1 volume at 15474 KB/sec
DUMP: DUMP IS DONE
real 4:14.0
user 8.0
sys 26.4

time ufsdump 0fb /dev/rmt/0un 192 /
DUMP: Date of this level 0 dump: Fri Jan 05 14:35:46 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 96 Kilobyte records
DUMP: Estimated 7705056 blocks (3762.23MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7704958 blocks (3762.19MB) on 1 volume at 14850 KB/sec
DUMP: DUMP IS DONE
real 4:24.5
user 8.1
sys 26.5

time ufsdump 0fb /dev/rmt/0un 128 /
DUMP: Date of this level 0 dump: Fri Jan 05 14:46:01 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 64 Kilobyte records
DUMP: Estimated 7705240 blocks (3762.32MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7705214 blocks (3762.31MB) on 1 volume at 13020 KB/sec
DUMP: DUMP IS DONE
real 5:02.4
user 8.4
sys 27.3

time ufsdump 0fb /dev/rmt/0un 96 /
DUMP: Date of this level 0 dump: Fri Jan 05 14:51:32 2007
DUMP: Date of last level 0 dump: the epoch
DUMP: Dumping /dev/rdsk/c0t2d0s0 (atlantide.crt.umontreal.ca:/) to /dev/rmt/0un
DUMP: Mapping (Pass I) [regular files]
DUMP: Mapping (Pass II) [directories]
DUMP: Writing 48 Kilobyte records
DUMP: Estimated 7705332 blocks (3762.37MB).
DUMP: Dumping (Pass III) [directories]
DUMP: Dumping (Pass IV) [regular files]
DUMP: 7705342 blocks (3762.37MB) on 1 volume at 10523 KB/sec
DUMP: DUMP IS DONE
real 6:12.5
user 7.9
sys 21.9


Interesting: blocking factor of 480 is better than 512 ??? WHY ???

Because we use 63 for 1K blocks, 126 for 512bytes blocks, so what....
Hummm... 63 * 2
126 * 2
252
504
1008
2016
4032
8064
16128
32256
64512
129024 128K or 131072
258048 256K or 262144
516096
1032192 1024K or 1048576
2064384 instead of 2048K or 2097152

So we make test with bs=1024K and bs=1032192

atlantide.crt.umontreal.ca[/]# mt -f /dev/rmt/1u rew
atlantide.crt.umontreal.ca[/]# time zfs send atlantide/daniel@TMP_BackuP |dd ibs=192024 of=/dev/rmt/1un obs=2064384
0+514814 records in
3370+1 records out
real 10m11.970s
user 0m6.568s
sys 0m28.695s

atlantide.crt.umontreal.ca[/]# time zfs send atlantide/daniel@TMP_BackuP |dd ibs=192024 of=/dev/rmt/1un obs=1032192
0+506400 records in
6741+1 records out
real 10m27.031s
user 0m6.659s
sys 0m29.105s

atlantide.crt.umontreal.ca[/]# mt -f /dev/rmt/1u rew
atlantide.crt.umontreal.ca[/]# time zfs send atlantide/daniel@TMP_BackuP |dd ibs=258048 of=/dev/rmt/1un obs=2064384
0+514417 records in
3370+1 records out
real 8m48.880s
user 0m6.491s
sys 0m27.837s

atlantide.crt.umontreal.ca[/]# mt -f /dev/rmt/1u rew
atlantide.crt.umontreal.ca[/]# time zfs send atlantide/pierreg@TMP_BackuP |dd ibs=258048 of=/dev/rmt/1un obs=1032192
0+526160 records in
6393+1 records out
real 9m24.162s
user 0m6.369s
sys 0m28.549s

atlantide.crt.umontreal.ca[/]# time zfs send atlantide/pierreg@TMP_BackuP |dd ibs=258048 of=/dev/rmt/1un obs=2064384
0+535184 records in
3196+1 records out
real 8m33.434s
user 0m6.253s
sys 0m27.483s

atlantide.crt.umontreal.ca[/]# mt -f /dev/rmt/1u rew
atlantide.crt.umontreal.ca[/]# time zfs send atlantide/pierreg@TMP_BackuP |dd ibs=1024k of=/dev/rmt/1un obs=2048k
0+547371 records in
3147+1 records out
real 9m5.103s
user 0m6.329s
sys 0m28.294s

atlantide.crt.umontreal.ca[/]# time zfs send atlantide/pierreg@TMP_BackuP |dd ibs=512k of=/dev/rmt/1un obs=2048k
0+533319 records in
3147+1 records out
real 9m13.835s
user 0m6.289s
sys 0m27.802s

atlantide.crt.umontreal.ca[/]# mt -f /dev/rmt/1u rew
atlantide.crt.umontreal.ca[/]# time zfs send atlantide/pierreg@TMP_BackuP |dd ibs=245k of=/dev/rmt/1un obs=2048k
0+542066 records in
3147+1 records out
real 8m56.045s
user 0m6.401s
sys 0m28.876s

atlantide.crt.umontreal.ca[/]# nice iostat -xcnMCXTdz 5
Mon Jan 8 12:49:07 2007
cpu
us sy wt id
8 3 0 89
extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  555.1   60.4   19.8    0.1  0.0  1.2    0.0    2.0   0  83 c5
  143.5   16.2    5.0    0.0  0.0  0.3    0.0    2.1   0  23 c5t8d0
  121.5   15.8    4.9    0.0  0.0  0.3    0.0    2.1   0  20 c5t9d0
  146.3   16.4    5.0    0.0  0.0  0.3    0.0    1.9   0  20 c5t10d0
  143.7   12.0    5.0    0.0  0.0  0.3    0.0    2.0   0  21 c5t11d0
      0.0   44.0    0.0    9.8  0.0  0.6    0.0   13.6  0  60 rmt/0
      0.0   44.8    0.0    9.9  0.0  0.6    0.0   13.3  0  59 rmt/1


Vous vous êtes rendu jusqu'ici alors un petit cadeau d'extra.
Daniel Charbonneau
Tél: 514-343-6111 poste: 1-5478
Fax: 514-343-7121
Email: daniel.charbonneau  arrobe  cirrelt   point ca
Web: http://w1.cirrelt.ca/Daniel.Charbonneau
Adresse postale:
CIRRELT
Université de Montréal
C.P. 6128, succ. Centre-ville
Montréal (Québec), Canada
H3C 3J7