intermittent rpm DB corruption issue on Redhat 6
I have an intermittent rpm DB corruption issue that I'm trying to track down and I'm trying to see if other folks have encountered the same issue and/or know how to prevent it.
I'm in an enterprise environment where we package our apps into many different rpm packages. We have many builds per day and many internal deployments per day. Our internal deployments span 200+ servers. Our system had been working beautifully for a long time with no issues. A few months back we migrated from CentOs to Redhat, and shortly thereafter we started seeing a bizarre corruptions in our rpm database. (we're not exactly sure that the CentOs --> Redhat switch had anything to do with it, but it's something we're looking at)
In short, our rpm database gets into a funky state where it shows that multiple versions of the same package are installed, yet when you query the rpm db, it says that the package is not installed. When we see this, we manually remove the packages in question and rebuild the DB to get things back to a working state. This problem seems to pop up intermittently, across many different servers and many different packages (but all of them internal packages that we deploy frequently). Weeks can go by without seeing this issue and then we'll see it an "outbreak" of this situation on a random assortment (10%) of our servers.
We've written automation to detect this and even automation to correct the problem, but we haven't figured the root cause, nor can we reproduce the issue on demand.
At the end of the day, our autodeployment automation uses "rpm --install" or rpm --upgrade --oldpackage" for deployment (we explicitly don't use yum for installation of our internal packages for various unrelated reasons). Our automation also ensures that multiple installations don't occur at the same time (on top of rpm's own DB locks). We deploy somewhere between 2-8 different app packages per machine and there are up to 12 different versions of each app package deployed each day. Most of these deployments occur flawlessly. The app rpms deploy locally and there are no known external-to-the-host scriptlet factors that we know of.
--- begin example ---
# rpm -qa | grep jsearch
$ rpm -q jsearch_wa
package jsearch_wa is not installed (**wrong**)
$ rpm -q jsearch_wa-
$ rpm -qa jsearch_wa
$ rpm --erase --allmatches jsearch_wa-\*
$ rpm --rebuilddb
Autodeployment of latest package which calls the following:
rpm --install --oldpackage <path>/jsearch_wa-118.0.53-1.x86_64.rpm
$ rpm -q jsearch_wa
Subsequent deployments (rpm --upgrade) work as expected (old package removed, new package installed, rpmdb queries return correct results).
--- done with example ---
We've starting to log some of "yum history" output after our deployments to try to track this issue down further, but the problem is randomly intermittent, so we're waiting for it to occur again.
Anyone ever seen anything like this? Have any advice on where to look for the root cause?
Thanks in advance.