Find the answer to your Linux question:
Results 1 to 10 of 10
Hi all, I'm working with a lot of data, but always the same. I have, say 2GB that I keep loading 100 times a day from a local disk to ...
  1. #1
    Just Joined!
    Join Date
    Mar 2010
    Posts
    15

    mapping memory space to a path/filename

    Hi all,

    I'm working with a lot of data, but always the same. I have, say 2GB that I keep loading 100 times a day from a local disk to do some computations.
    I was wondering if anyone knew if it is possible to read it once for all and then access it like a file but with the speed of RAM access. I would be looking for something like:
    Code:
    file2mem ~/mybigdatafile.dat ~/mybigdata_thats_now_accessed_superfast.dat
    And then the data is accessible in a way like with a symlink...

    Anything like that existing anywhere?
    Thanks a lot!
    Max

  2. #2
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    The problem here is that every process has its own memory space, so you can't just let a bunch of processes access eachother's memory.

    A possible solution is shared memory:
    shm_overview(7)

    This would allow you to map the file into shared memory using shm_open(), then mmap() that same memory into multiple other processes. Shared memory is persistent, so once it has been created, it will continue to exist until explicitly deleted or the system is shut down (even if the various processes exit).

    Do you understand how to use this?
    DISTRO=Arch
    Registered Linux User #388732

  3. #3
    Just Joined!
    Join Date
    Mar 2010
    Posts
    15
    Thanks Cabhan, I think I understand...
    I would be able to access files mapped to memory via a call to /dev/shm/myfile right?
    Now, let's put it simply: I have never programmed in c. All I know is some shell scripting and matlab. Can I write a small shell program to do that? or would it have to be a compiled prgm?
    Also I see the mapping is persistent. How do I erase it when I'm done? I don't want to end up restarting the system everytime I loose the address of my memory map (does that make sense?)...

  4. #4
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Yes. If you read the file into shared memory segment myfile, it would be under /dev/shm/myfile. Attempting to read this file would access kernel memory and should therefore be a straight memory read instead of going to disk (excluding RAM swapping and such).

    To set this up, as far as I know, you would need to write a C program. I'm actually rather tempted to write a C program that would read a file into shared memory so that anyone could access it, but I'm not on my Linux box right now. But once the shared memory was set up, anyone should be able to access it via the /dev/shm/ directory, with no regard for the programming language.

    As far as removing the mapping, this can be done programaticly as well by employing the shm_unlink() function, which would remove the mapping. What I meant to say before is that the mapping will exist until either explicitly removed or the system is shutdown.

    On a Linux system, shared memory functionality is handled by librt.so. It is designed to be used with C, but theoretically any language that can employ a shared library should be able to use these functions. Bash and Matlab do not fall into this category, sadly. C, C++, and any languages that do have librt bindings could use it.

    So I'm not sure what to tell you. If you're willing to wait a few days, I'm actually pretty interested in this right now, so I'd be willing to throw together a simple program to try it out (your file2mem program, for instance), but otherwise, you'll need to find someone who knows a bit about C programming.
    DISTRO=Arch
    Registered Linux User #388732

  5. #5
    Just Joined!
    Join Date
    Mar 2010
    Posts
    15
    Awesome!
    I am ready to wait a few days, of course. This has been a long standing problem for me, so I can certainly wait a bit more.
    Doing a rm /dev/shm/myfile would not remove the mapping, would it?
    In any case, let me know if you need some testing or if I can help in any way.
    Thanks!

  6. #6
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Sorry to dig up an old thread, but I had some interesting fun with this.

    I whipped up a small utility that takes a file and reads it into shared memory. I then ran a few tests.

    For small files, reading the file from disk or from memory seemed to take exactly the same amount of time as measured by both real time and system time. This persisted even up into file sizes of 4 MB.

    I then decided to go all out and created a file of size 1 GB. And here is when I got some interesting results. First of all, loading the file into memory:
    Code:
    alex@danu:~/source/file2mem$ time ./file2mem test_big test
    Bus error
    
    real	0m23.188s
    user	0m1.000s
    sys	0m3.216s
    I don't know what caused the bus error, but there is now a 1 GB file sitting in shared memory. You will note that only 3.216s of system time were used, but 23.188s of real time passed. This time was passed just waiting for disk latency, I expect. Also, to prove that all 1GB of space was loaded into memory:
    Code:
    alex@danu:~/source/file2mem$ ls -lh test_big
    -rw-r--r-- 1 alex alex 1.0G 2010-05-13 00:13 test_big
    alex@danu:~/source/file2mem$ ls -lh /dev/shm/test
    -r--r--r-- 1 alex alex 1.0G 2010-05-13 00:24 /dev/shm/test
    I will note that my computer is running pretty sluggishly right now, as I have a very small amount of RAM in my computer . I only have 1.2 GB of RAM, so I am now dedicating virtually all of it to this file. Remember that shared memory takes up space in kernel memory buffers, so it appears to be a bit harder to hide away somewhere. But apparently some swapping is happening (more on this later).

    Now we will compare the speed of reading data from disk to the speed of reading data from the shared memory:
    Code:
    alex@danu:~/source/file2mem$ time cat test_big > /dev/null
    
    real	0m28.383s
    user	0m0.000s
    sys	0m1.312s
    alex@danu:~/source/file2mem$ time cat /dev/shm/test > /dev/null
    
    real	0m9.598s
    user	0m0.012s
    sys	0m1.268s
    alex@danu:~/source/file2mem$ time cat /dev/shm/test > /dev/null
    
    real	0m0.914s
    user	0m0.016s
    sys	0m0.796s
    alex@danu:~/source/file2mem$ time cat /dev/shm/test > /dev/null
    
    real	0m0.852s
    user	0m0.036s
    sys	0m0.768s
    The first output here is from reading from disk. You will note that 28.383s are spent reading from disk, only 1.312s of which is actual program time. The rest is waiting on disk I/O.

    The outputs after that are from reading from the shared memory. The first one takes 9.5s, which is about a 66% improvement. The one after that, however, takes only a single second, which is obviously MUCH faster. It is for this reason that I believe the shared memory may have been swapped out, and once re-read becomes much more responsive. The last test is just to show that these very fast speeds continue after the shared memory has been swapped back in.

    Obviously use of this shared memory needs to be careful, as the larger the shared memory, the more memory the kernel is using, which can overwhelm normal processes' use of memory. However, on machines with more RAM than mine, you will note that there can be as much as a 30x improvement in read time by using shared memory.

    If anyone is interested in playing with the software that I wrote, please feel free to contact me via PM and I'll send it to you. I'd be interested in any comments or feedback.
    DISTRO=Arch
    Registered Linux User #388732

  7. #7
    Linux Newbie theNbomr's Avatar
    Join Date
    May 2007
    Location
    BC Canada
    Posts
    150
    Don't forget that the OS does a lot of optimization by storing disk data in memory caches. Also, you have the possibility to use a RAM disk for storing your data file(s). This allows any application that reads/writes data files to use the data natively. The downside, of course is that the data is volatile until backed up to spinning media, but with shared memory, you already have the same problem.

    --- rod.
    Stuff happens. Then stays happened.

  8. #8
    Just Joined!
    Join Date
    Mar 2010
    Posts
    15
    Hi,
    Interesting info.
    I still have 3 questions:
    1) how do I create a RAM disk?
    2) can I control what the os does in terms of caching?
    3) this is more about file2mem, directly: if I understand correctly, it is not really worth using this to read a lot of small data files into memory. Then I guess I will have to wait for my disk...

    Thanks a lot,
    Max

  9. #9
    Linux Newbie theNbomr's Avatar
    Join Date
    May 2007
    Location
    BC Canada
    Posts
    150
    A couple of links explain RAMdisk creation:
    Linux Ramdisk mini-HOWTO
    Linux RAM Disk: Creating A Filesystem In RAM
    I don't think there is very much, if any, control of caching by users. I'm almost certain that you could not access or manipulate the cache directly through any filesystem calls.
    You can never 'skip' the step of reading the data from disk into memory. What you might be able to do is start the process early in the boot process, so it is ready to use when your application(s) starts. I don;t think there is any downside to using a lot of small files, vs one large file in a RAM disk. The one certain downside is that once memory is allocated to the RAMdisk, it cannot be relinquished back to general purpose RAM for all applications.

    --- rod.
    Stuff happens. Then stays happened.

  10. #10
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Quote Originally Posted by theNbomr View Post
    Don't forget that the OS does a lot of optimization by storing disk data in memory caches. Also, you have the possibility to use a RAM disk for storing your data file(s). This allows any application that reads/writes data files to use the data natively. The downside, of course is that the data is volatile until backed up to spinning media, but with shared memory, you already have the same problem.

    --- rod.
    I agree about the data storage optimization, but for a file of a large size, it obviously may not fit in disk block caches or memory caches, hence the file2mem approach. It basically forces the file to be saved in memory, regardless of size.

    As for RAM disks, I did forget about such things, and I feel like I'm basically reimplementing that. That may be a better approach, simply because it's better supported.
    DISTRO=Arch
    Registered Linux User #388732

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...