Edit: I managed to get the issue at last thanks to a tip to use valgrind which eventually led me to the issue which was with correctly closed sockets not being properly closed on the server end(if they had an error they closed fine... My client wasn't closing them properly till recently hence the problem wasn't obvious from the start).


Hi there, I'm currently working on a chat server and while most things are working fine with it, after several hours, even if there's no clientside activity, the process will just disappear with a "Killed" message.

The dmesg output for it is the following
Code:
ninjaserver.exe invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

Call Trace:
 [<ffffffff800bed05>] out_of_memory+0x8e/0x2f5
 [<ffffffff8000f071>] __alloc_pages+0x22b/0x2b4
 [<ffffffff80012720>] __do_page_cache_readahead+0x95/0x1d9
 [<ffffffff800618e1>] __wait_on_bit_lock+0x5b/0x66
 [<ffffffff880fbc61>] :dm_mod:dm_any_congested+0x38/0x3f
 [<ffffffff800130ab>] filemap_nopage+0x148/0x322
 [<ffffffff800087ed>] __handle_mm_fault+0x1f8/0xdf4
 [<ffffffff80064a6a>] do_page_fault+0x4b8/0x81d
 [<ffffffff80013333>] tcp_poll+0x0/0x12f
 [<ffffffff8005bde9>] error_exit+0x0/0x84

Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: high 0, batch 1 used:0
cpu 0 cold: high 0, batch 1 used:0
cpu 1 hot: high 0, batch 1 used:0
cpu 1 cold: high 0, batch 1 used:0
cpu 2 hot: high 0, batch 1 used:0
cpu 2 cold: high 0, batch 1 used:0
cpu 3 hot: high 0, batch 1 used:0
cpu 3 cold: high 0, batch 1 used:0
Node 0 DMA32 per-cpu:
cpu 0 hot: high 186, batch 31 used:30
cpu 0 cold: high 62, batch 15 used:49
cpu 1 hot: high 186, batch 31 used:29
cpu 1 cold: high 62, batch 15 used:48
cpu 2 hot: high 186, batch 31 used:30
cpu 2 cold: high 62, batch 15 used:59
cpu 3 hot: high 186, batch 31 used:30
cpu 3 cold: high 62, batch 15 used:49
Node 0 Normal per-cpu:
cpu 0 hot: high 186, batch 31 used:15
cpu 0 cold: high 62, batch 15 used:54
cpu 1 hot: high 186, batch 31 used:68
cpu 1 cold: high 62, batch 15 used:55
cpu 2 hot: high 186, batch 31 used:172
cpu 2 cold: high 62, batch 15 used:14
cpu 3 hot: high 186, batch 31 used:45
cpu 3 cold: high 62, batch 15 used:60
Node 0 HighMem per-cpu: empty
Free pages:       40792kB (0kB HighMem)
Active:1118174 inactive:897487 dirty:0 writeback:0 unstable:0 free:10198 slab:3934 mapped-file:28 mapped-anon:2015317 pagetables:9900
Node 0 DMA free:11172kB min:12kB low:12kB high:16kB active:0kB inactive:0kB present:10820kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3511 8056 8056
Node 0 DMA32 free:23164kB min:5004kB low:6252kB high:7504kB active:1966912kB inactive:1572876kB present:3596256kB pages_scanned:7479413 all_unreclaimable? yes
lowmem_reserve[]: 0 0 4544 4544
Node 0 Normal free:6456kB min:6476kB low:8092kB high:9712kB active:2463032kB inactive:2059696kB present:4654072kB pages_scanned:13406328 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 3*8kB 4*16kB 6*32kB 4*64kB 3*128kB 0*256kB 0*512kB 2*1024kB 0*2048kB 2*4096kB = 11172kB
Node 0 DMA32: 5*4kB 3*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 5*4096kB = 23164kB
Node 0 Normal: 14*4kB 2*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 6456kB
Node 0 HighMem: empty
Swap cache: add 508414, delete 508414, find 75/128, race 0+0
Free swap  = 0kB
Total swap = 2031608kB
Free swap:            0kB
2228224 pages of RAM
182590 reserved pages
1052 pages shared
0 pages swap cached
Out of memory: Killed process 5866 (ninjaserver.exe).
I'm originally a windows programmer and am picking up linux stuff on the run... Currently developping from a cygwin CDT toolchain/eclipse setup, then moving the code across to a linux server where I recompile it for "real" testing. This crash has been bugging me for a couple of weeks now, and the cause is hard to diagnose as it can take up to 8 hours to happen, and gdb doesn't appear to pick up on anything odd until the kill request is sent.

If I intercepted the kill signal is there a way from inside the program to dump information that would be useful to diagnosing the issue?

Including relevant parts of the source too if it would give any hints... I've trawled this several times and can't seem to see anything that would be causing a memory leak...

Any help would be appreciated and if anyone has questions about the setup or whatever I'll try to provide any necessary information.

Code:
// main loop
        while(true)
	{
		if(!GetComms(&sockIncoming) || g_exitFlag)
			break;
	}


// GetComms
int GetComms(Listener * incoming)
{
	static fd_set socks;
	static int highSock = 0;
	
	static UserList users;
	
	FD_ZERO(&socks);
	UpdateSocks(&socks, incoming, &highSock, &users);
	
	fd_set tempSocks;
	memcpy(&tempSocks, &socks, sizeof(socks));
	
	timeval timeOut;
	timeOut.tv_sec = 10;
	timeOut.tv_usec = 0;
	
	int results = select(highSock+1, &tempSocks, (fd_set*)0, (fd_set*)0, &timeOut);
	
	if(results < 0)
	{
		incoming->AddLog("Error in UpdateSocks. Negative results on select");
		return 0;
	}
	
	if(results > 0)
	{
		return ProcessSocks(&tempSocks, incoming, &users);
	}		
	return 1;
}

void UpdateSocks(fd_set * socks, Listener * incoming, int * highSock, UserList * users)
{
	FD_ZERO(socks);
	FD_SET(incoming->GetSock(), socks);
	*highSock = incoming->GetSock();
	
	users->UpdateSet(socks, highSock);
}

void UserList::UpdateSet(fd_set * socks, int * highSock)
{
	if(m_users.empty())
		return;
	
	for(list<User>::iterator it = m_users.begin(); it != m_users.end(); it++)
	{
		if(it->IsValid())
		{
			it->AddToList(socks, highSock);
		}
	}
}

bool User::AddToList(fd_set * socks, int * highSock)
{
	if(m_sock.IsValid())
	{
		FD_SET(m_sock.GetSock(), socks);
		if(m_sock.GetSock() > (*highSock))
			*highSock = m_sock.GetSock();
		
		return true;
	}
	else
		return false;
}