NFS inode cache eats up memory, design flaw?
Hello,
We have a NFS with a huge amount of file and noticed that the servers who access many files quickly have almost no free (including buffered and cached) memory any more. We soon found out that it was mostly taken by the nfs_inode_cache as slabs.
Cleaning the cache /proc/sys/vm/drop_caches frees up the memory, but setting the swappiness to 1 doesn't help: the cache still grew to the same size and applications where swapped to free up memory. For us swapping is bad, we have strict SLA's for the response times of our applications and very lax SLA's on the NFS.
We can easily reproduce the issue by doing a full scan of the NFS drive, the nfs_inode_cache is soon much more than 1GB but with a cache and buffer smaller then 100MB (free command).
The issue is so bad that the find we execute even grinds to a halt. We aren't sure it is due to the cache, but we are investigating it.
So far the stuff I'm sure about, I investigated and came to the following conclusion (correct me if I'm wrong).
I found the following image to describe the file system design of linux: tldp.org/LDP/tlk/fs/vfs.gif (sorry, I can't post urls yet). The nfs_inode_cache (and ext4_inode_cache) aren't indicated, which seems to be exactly the problem the kernel has.
The kernel seems to be only aware of the vfs inodes, it does not (for obvious reasons) know about the nfs or ext4 inode cache. As far as I could understand the kernel code, the nfs inodes are put/removed from cache based on calls that come from the vfs. Because the vfs only sees it own cache, it might think it uses only x MB while in effect, behind it there is another x GB that it can't see. This could explain why the kernel would not clean any more cache due to memory pressure, but that when you clear it manually it frees up the nfs inode cache. This can be explained because the nfs cache is linked to vfs cache.
It seems to me that the problem is there for all file systems, but is especially visible for NFS. Since NFS is network based, it might require quite a lot more information in the cache per inode (this is pure guessing). This would case a much bigger "hidden" cache.
So my questions are:
* Is my conclusion (somewhat) correct?
* What can I do about it?
Kind regards,
Bryan Brouckaert.