Find the answer to your Linux question:
Results 1 to 6 of 6
Hi Forums, I've got a problem that my google-fu can't solve: - I've got a large file (15GB) on server A. - I want anywhere between two and eight indentical ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jan 2010
    Posts
    2

    efficiently create multiple copies of one file


    Hi Forums,

    I've got a problem that my google-fu can't solve:

    - I've got a large file (15GB) on server A.
    - I want anywhere between two and eight indentical copies of this file on server B.

    I see a couple of slow easy ways to do this:
    1) scp the file 8 times. (booo...)
    2) scp the file once and copy it locally however many times it's needed. (better, since the HDD speed is much greater than the network speed)

    But I want to do it the hard fast way:
    -simultaneously write the scp'd file to many different files on server B as the data is coming in.

    Is there a utility to do this? Maybe I could pipe it through tar somehow? I could probably write something in python to do this, but why re-invent the wheel if it already exists

    Rock on!

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,345
    Do you really need 8 copies?
    imho, this only makes sense, if each of them shall be changed individually.

    If this is the case, antipasti might help:
    Antipasti - Gitorious
    Just found it on freshmeat, but the description looks promising


    However, if the copies will stay unmodified, you have multiple ways to create them without significant ressource usage:
    - softlinks
    - hardlinks
    - bindmounts of the containing directory
    You must always face the curtain with a bow.

  3. #3
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,345
    sorry, antipasti doesnt seem to create identical copies.

    but the command pee, which is included in moreutils, should work:
    Code:
    cat original_file |pee "cat > copy_1" "cat > copy_2"
    I havenīt benchmarked it against a simple series of cp.
    And reading one 15GB file and writing it 8x at once is surely not so beneficial,
    if the destination files are on the same disc or same raid.
    You must always face the curtain with a bow.

  4. #4
    Linux Engineer rcgreen's Avatar
    Join Date
    May 2006
    Location
    the hills
    Posts
    1,134
    Hard or soft links seem to be best, unless you have multiple
    users who need their own copies to manipulate individually.

    UNIX man pages : ln ()

  5. #5
    Just Joined!
    Join Date
    Jan 2010
    Posts
    2
    Hi guys,

    Cool thanks for the help. I checked out antipasti, and it's not quite what I want. It takes a file and splits it into multiple parts, rather than creating clones.

    pee, however seems to work well. I'll have to do some more serious benchmarking, but just on my desktop I did the following test:

    Code:
    time gunzip -c some3gbFile.gz | pee "cat > copy1" "cat>copy2" "cat>copy3"
    and it took 2m09s

    doing it linearly, i.e timing the execution of:
    Code:
    #!/bin/bash
    gunzip some3gbFile.gz
    cp some3gbFile copy1
    cp some3gbFile copy2
    mv some3gbFile copy3
    took 3m18s. (I checksummed all the resulting files too, all of the copies were created okay.) Sweet!

    Oh, and the reason I need real copies instead of links is this: the files are virtual machine images. I want to boot 8 identical copies of the VM on my 8 core worker node, so that *requires* 8 individual files, not links. I'm copying a single image that's hosted on a repository, so if I can create the 8 copies of the file as it's streaming in, I can save some time between the call to create and the boot.

    Thanks!

  6. #6
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,345
    Some thoughts:
    - You can avoid saving and reading the compressed image by piping it from server A to server B over the net via netcat and from netcat to pee
    - using UDP for transmitting should maximize net performance. But do a checksum of the VMs on server B by piping the stream not only to the cats, but also to a md5sum and compare it to a checksum of the source file on server A.
    - 8 concurrent writes *will* result in some overhead/write performance decrease on a single disc/raid. Spread the copies to as many individual discs/raids as possible.

    - Also you could compress/decompress the image on the fly. But I doubt the output stream of gzip would be close or above the 100MByte/s that you can expect from a UDP gigabit/s connection.
    You could compress the image beforehand on A, pipe it over the net, and decompress on the fly to pee.

    Something like this:
    Server A: cat vm-image.gz | nc -u <IP> 10000
    Server B: nc -l -u -p 10000 | gunzip -c | pee "cat > /data1/copy1" "cat > /data2/copy2" "cat > /data3/copy3" "md5sum"


    To get some numbers,
    iptraf for network throughput
    and of course vmstat or iostat for IO come to mind.


    dd if=/dev/zero of=/data1/delete_me bs=1M
    should show the maximum sequential write speed for your disc/raid.

    Do 8 of them concurrently to the same disc/raid and you will see the numbers drop.
    Last edited by Irithori; 01-26-2010 at 10:30 PM.
    You must always face the curtain with a bow.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •