Results 1 to 6 of 6
Hi Forums,
I've got a problem that my google-fu can't solve:
- I've got a large file (15GB) on server A.
- I want anywhere between two and eight indentical ...
- 01-22-2010 #1Just Joined!
- Join Date
- Jan 2010
- Posts
- 2
efficiently create multiple copies of one file
Hi Forums,
I've got a problem that my google-fu can't solve:
- I've got a large file (15GB) on server A.
- I want anywhere between two and eight indentical copies of this file on server B.
I see a couple of slow easy ways to do this:
1) scp the file 8 times. (booo...)
2) scp the file once and copy it locally however many times it's needed. (better, since the HDD speed is much greater than the network speed)
But I want to do it the hard fast way:
-simultaneously write the scp'd file to many different files on server B as the data is coming in.
Is there a utility to do this? Maybe I could pipe it through tar somehow? I could probably write something in python to do this, but why re-invent the wheel if it already exists
Rock on!
- 01-23-2010 #2
Do you really need 8 copies?
imho, this only makes sense, if each of them shall be changed individually.
If this is the case, antipasti might help:
Antipasti - Gitorious
Just found it on freshmeat, but the description looks promising
However, if the copies will stay unmodified, you have multiple ways to create them without significant ressource usage:
- softlinks
- hardlinks
- bindmounts of the containing directoryYou must always face the curtain with a bow.
- 01-23-2010 #3
sorry, antipasti doesnt seem to create identical copies.
but the command pee, which is included in moreutils, should work:
I haven´t benchmarked it against a simple series of cp.Code:cat original_file |pee "cat > copy_1" "cat > copy_2"
And reading one 15GB file and writing it 8x at once is surely not so beneficial,
if the destination files are on the same disc or same raid.You must always face the curtain with a bow.
- 01-23-2010 #4
Hard or soft links seem to be best, unless you have multiple
users who need their own copies to manipulate individually.
UNIX man pages : ln ()
- 01-26-2010 #5Just Joined!
- Join Date
- Jan 2010
- Posts
- 2
Hi guys,
Cool thanks for the help. I checked out antipasti, and it's not quite what I want. It takes a file and splits it into multiple parts, rather than creating clones.
pee, however seems to work well. I'll have to do some more serious benchmarking, but just on my desktop I did the following test:
and it took 2m09sCode:time gunzip -c some3gbFile.gz | pee "cat > copy1" "cat>copy2" "cat>copy3"
doing it linearly, i.e timing the execution of:
took 3m18s. (I checksummed all the resulting files too, all of the copies were created okay.) Sweet!Code:#!/bin/bash gunzip some3gbFile.gz cp some3gbFile copy1 cp some3gbFile copy2 mv some3gbFile copy3
Oh, and the reason I need real copies instead of links is this: the files are virtual machine images. I want to boot 8 identical copies of the VM on my 8 core worker node, so that *requires* 8 individual files, not links. I'm copying a single image that's hosted on a repository, so if I can create the 8 copies of the file as it's streaming in, I can save some time between the call to create and the boot.
Thanks!
- 01-26-2010 #6
Some thoughts:
- You can avoid saving and reading the compressed image by piping it from server A to server B over the net via netcat and from netcat to pee
- using UDP for transmitting should maximize net performance. But do a checksum of the VMs on server B by piping the stream not only to the cats, but also to a md5sum and compare it to a checksum of the source file on server A.
- 8 concurrent writes *will* result in some overhead/write performance decrease on a single disc/raid. Spread the copies to as many individual discs/raids as possible.
- Also you could compress/decompress the image on the fly. But I doubt the output stream of gzip would be close or above the 100MByte/s that you can expect from a UDP gigabit/s connection.
You could compress the image beforehand on A, pipe it over the net, and decompress on the fly to pee.
Something like this:
Server A: cat vm-image.gz | nc -u <IP> 10000
Server B: nc -l -u -p 10000 | gunzip -c | pee "cat > /data1/copy1" "cat > /data2/copy2" "cat > /data3/copy3" "md5sum"
To get some numbers,
iptraf for network throughput
and of course vmstat or iostat for IO come to mind.
dd if=/dev/zero of=/data1/delete_me bs=1M
should show the maximum sequential write speed for your disc/raid.
Do 8 of them concurrently to the same disc/raid and you will see the numbers drop.Last edited by Irithori; 01-26-2010 at 10:30 PM.
You must always face the curtain with a bow.


Reply With Quote
