Results 1 to 10 of 10
Hi,
I have 2 applications that send and receive messages through shared memory IPC. When I run the app ..it works but the number of messages per sec keeps changing ...
- 04-19-2010 #1Just Joined!
- Join Date
- Apr 2010
- Posts
- 18
Shared Memory IPC Variations
Hi,
I have 2 applications that send and receive messages through shared memory IPC. When I run the app ..it works but the number of messages per sec keeps changing drastically sometimes it is 400-500 per sec..then 800 then 1200 then 2000. is this normal with SHM IPC or could it be a code related issue.
- 04-20-2010 #2Just Joined!
- Join Date
- Jan 2006
- Posts
- 1
You did not mention any locking mechanisms, assuming these are OK (if they exist), then suspect and external system pathology like running out of memory and page faulting or other CPU hogs running. Start with vmstat for system level loading, then maybe info in /proc/<proc-id> of your applications.
Also, make sure your time measurement is accurate. Verify by average over >10 seconds to get a per/second rate.
- 04-20-2010 #3Just Joined!
- Join Date
- Mar 2009
- Location
- Moves between London, Oslom Brussels
- Posts
- 30
IPC on Shared Memory
This may sound strange, but its based on a number of porting projects of Unix, and several ports of huge application packages.
Do not "invent" own IPC based on shared memory. There are a number of reasons, the usual is wrong signalling between the processes and that your applications do not catch the signal. Use local "Sockets" - tcp/ip, or udp if you need multidrop.
Shared memory is intended for applications to share data structures of persistent nature: e.g. all share a pool of instrument read-out, updates these, and read the data to show on screen.
You will never ever be able to obtain the same consistency on signal handling as the kernel (that use the I/O system). I have seen a number of implementations of these and few come close to the performance of plain vanilla "sockets". The top end 2000 messages per second is getting close, but I would expect sockets to exceed 5000 on most (local) platforms today, 20 000 at the better ones.
You also get a highly modular system then, that you can split on a number of processors, and leave it to the system to figure out the fastest way to communicate.
- 04-20-2010 #4Just Joined!
- Join Date
- Mar 2009
- Location
- Moves between London, Oslom Brussels
- Posts
- 30
hmmm .. the locking should not be needed between two processes, if you include message length and status in the message header.The use of system semaphores is usually implemented by the same mechanism as "sockets" - and is just overhead. The only way to beat it is for the processes that reads e.g. a queue, to all wait for a user signal, and the writer then signals every waiting processing, that catch the event of the writing (reading the update flag on the queue after catching the event.
If you lock, just drop it, the system supplied IPC will be 10 times faster.
- 04-21-2010 #5Just Joined!
- Join Date
- Apr 2010
- Posts
- 18
Well, what happens design wise is actually this.
I have 1 application which forks out 5 instances(test purposes..actual scenario around 100) communicates with one central process using shared memory (writes to the central process shared memory) which in turn relays the message to one of 5 receiving end forked processes. The receiving process directly replies to the sender forked process. They signal each other with the process Ids. I use semaphores to lock the structure while writing and signaling another process.
Currently I get anywhere between 300-2000 messages per second per instance.
Are you saying this can be 20000 PER SECOND? I was thinking even 2000 was maybe high. Was expecting few hundreds. The signals seem to be working fine(I ran the application overnight and over 48 hrs and no signals seem to be lost since I unlock only after sending the signal and then do a sleep(0) so control immediately switches to the process which received signal). But the varied results do bother me.
- 04-22-2010 #6Just Joined!
- Join Date
- Mar 2009
- Location
- Moves between London, Oslom Brussels
- Posts
- 30
NO - but the variation in performance will be less. You should see if you can use a simpler signalling, allow all to process the queue and use semaphores with caution. Depending on the kernel, 20 000 is possible; - we did that in 1987.
Protect the message queue, and signal once at the end of inserting a new element. Minimize the code around the read/write and reuse shared memory.
Variation in performance is an indicator of hitting a bottleneck. You should review all system resources, there is a "performance bug" in there just waiting to be found. The bad news is that this may even be hardware - the way memory is organised. If the cache needs to be cleared because you use shared memory, the "Cache Clear" may cause PF because of the huge cache used. On the larger computers this is solved by the "Scalable Coherent Interface".
- 04-22-2010 #7Just Joined!
- Join Date
- Apr 2010
- Posts
- 18
Thanks khf..I am going to walkthrough once more the detailed design and check for any loopholes..I ran it overnight yest 12 hrs..it gave me 60 million mesgs per instance. Can you brief me a little more on the cache factor you were talking about. I am not aware how it will work in this shared memory scenario and what I should do.
- 04-23-2010 #8Just Joined!
- Join Date
- Mar 2009
- Location
- Moves between London, Oslom Brussels
- Posts
- 30
GREAT! That is the correct method
Hmmm. The guys at Dolphin needs that edge.
OK. The Intel hardware is a menace. The addressing was designed as a quick solution for the S100 bus - intended to make washing machines. You need 5 cycles to load from memory (and 8 if you use 64 bit addressing) on the Intel instruction set. So, a 2GHz computer will use 400ns to fetch a instruction or data. Well, the memory ("RAM") runs at a fraction of this speed - say 200MHz, and needs 5 full cycles per access... So what we do is to keep a copy near the CPU, hidden - "cached". Thus the CPU can run at full speed. Then as the memory is modified, the modifications are written to the RAM. With the increased application size, the "cache" increases to keep local copies for the CPU to be busy as much as possible.
When you change process, the OS (Linux) will flush the content of this cache to RAM, and start the next process with an empty cache. If you use single process but different "threads" you can avoid this flushing.
Dolphin's SCI technology is about managing the content of these caches so that multi-processors can access a shared RAM and retain a consistent cache. Beware that our PCs also use shared memory with dedicated video controllers and disk DMA that access the same memory. The video will never write, so that is safe, but a disk DMA transfer is writing to memory that the CPU may have copies of the content - and thus cause an inconsistency.
You can gain a lot in performance by just keeping things tight - do what you have to do in tight loops. Simply because the CPU with cache will run a tight loop without memory delay. A kernel request may cause the cache to be flushed, a few MB sound like an overkill but if the hardware is already busy with updating the screen, a pagefault causes the DMA channel to be activated - then the memory bus may end up without cycles - since they also need 5 "cycles" to access RAM and we talk of nanoseconds in delay here.
If the kernel passes the same, it makes a queue, copies from one process to the other and signals at completion. The kernel is small enough and active enough to be resident, so passing messages with "Sockets" may be faster, the kernel "owns" all the other processes, and a switch to this is a "light" context switch.
I have seen so many vain attempts of making "RAM-Resident Databases" that the reason for responding is to raise awareness of that "its not easy". The common error is that they blame that they do not capture events properly, and then end up with complex "handshaking" to make certain that the communication works. Try removing the semaphore, and let just one process be the "writer" - where all the other sends the updates to this as a socket request. Then they can read and are halted only when sending the message. Or, they can set a "dirty bit" to indicate that update is needed, and skip reading until the dirty bit has been removed. Processes that only read, also cause very few "dirty bits" set on the RAM cache.
There are also a fleet of flags you can set for socket messages. Since these are so critical for network performance, a lot of effort has been put into getting these as fast as possible. This is not Windows, all socket options are exposed - so what is stopping you?
The theory here is the queue theory in mathematics / statistics. Denning at MIT has tried explain a lot around why things go in hick-up mode in his "Working Set models". A old classic is Coffman & Denning: Operating Systems Theory.
- 05-24-2010 #9Just Joined!
- Join Date
- May 2010
- Posts
- 1
Hello.
So. I have a related problem. I'm using shared memory between two processes. One is 'unwilling' and I hack into it using LD_PRELOAD, hook some library functions and set up the shared memory.
Second is a 'client' that connects to the 'unwilling' process.
Synchronization is done with a homebrew spinlock that resides in the shared memory - basically a variable the processes poll for changes, with some strict rules on what happens depending on its value.
As you can see, I don't want to get the kernel involved much, but at the same time, I could benefit from some better process scheduling. The most common usecase is getting sparse sets of memory pieces out of the 'unwilling' process in a kind of a 'burst' pattern. Once every 100ms, I update the state of the 'client' and pull data from the 'unwilling'. This has to complete as fast as possible, because any extra time spent moving data causes visible latencies in both programs.
Some (very) rough numbers, might not add up:
With two cores, I get ~700000 messages per second and ~800MB/s (varies depending on message length obviously, this is with messages of ~0-4K in length).
This is quite nice, but requires both cores. I can live with that.
With only one core available, I'm forced to do a bit of wizardry in the spinlock - I tell it to give up the CPU to the other process. This way, I get only ~179000 messages/s. This I can't really live with. I'd like to get it closer to ~300000 so that it's equivalent to my previous implementation -- reading from /proc/PID/mem.
Any ideas on how to make the thing behave better when pinned to one core?
The code that goes into the 'unwilling' program can be found here: github.com/peterix/dfhack/tree/master/dfhack/shm/
I know, it's a bit crazy, but I want to make this as fast as possible
- 05-24-2010 #10Just Joined!
- Join Date
- Mar 2009
- Location
- Moves between London, Oslom Brussels
- Posts
- 30
On the dual core you have a way of kicking "another" instead of waiting, which is the clue. The "ditch" that may cause the single core to go slower is that you wait in a tight loop, that is always ready to execute. That will consume CPU resources. Make the waiting processes insert their PID, and the writer, when it is waiting, kick the waiting readers (signal them).
A simple "performance bug" is to patch in the code "Jump current location" - "JMP *". That will (usually) monopolise the CPU. While (qEmpty) Signal (IOWAIT,NextPID);
may be a too close loop given the amount of memory in the cache. Make "qEmpty" access more memory....



