Find the answer to your Linux question:
Results 1 to 2 of 2
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    kswapd algorithm

    Hi All,
    I countered a weird problem with kswapd:
    it runs in some infinite loop trying to swap until order 10 of zone highmem is OK, While zone higmem (as I understand) has nothing to do with continuous memory (cause there is no 1-1 mapping) which means kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem).
    Can anyone please explain me the algorithm of kswapd and why it trys to balance order 10 of zone higmem ?

    I build an instrumented kernel with debug messages in "zone_watermark_ok" function, and from the code and debug messages I see that "zone_watermark_ok" returns 0 when kswapd invokes it (through balance_pgdat) in order to decide if zone highmem is balanced or not, which lead in some configurations to infinite loop of kswapd ( if no large chunks of highmem released) . I added a condition to "balance_pgdat" so it doesn't try to balance order higher than 1 in zone highmem and this conditon solved the problem, what are the risks with such solution? isn't it a bug that kswapd is looking for continuous memory in zone highmem ( as I understand there is no 1-1 mapping in zone highmem which is meaningless in kswapd)?


  2. #2

    Explanation of the algorith supplied by the linux community

    At first, following is digest of balance_pgdat.

    1709 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
    1710 {
    1732 loop_again:
    1741 for (priority = DEF_PRIORITY; priority >= 0; priority--) {
    1755 for (i = pgdat->nr_zones - 1; i >= 0; i--) {
    1773 if (!zone_watermark_ok(zone, order, zone->pages_high,
    1774 0, 0)) {
    1775 end_zone = i;
    1776 break;
    1777 }
    1778 }
    1779 if (i < 0)
    1780 goto out;
    1797 for (i = 0; i <= end_zone; i++) {
    1808 if (!zone_watermark_ok(zone, order, zone->pages_high,
    1809 end_zone, 0))
    1810 all_zones_ok = 0;

    1820 nr_reclaimed += shrink_zone(priority, zone, &sc);
    1821 reclaim_state->reclaimed_slab = 0;
    1822 nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
    1823 lru_pages);
    1840 }
    1841 if (all_zones_ok)
    1842 break; /* kswapd: all done */
    1856 if (nr_reclaimed >= SWAP_CLUSTER_MAX)
    1857 break;
    1858 }
    1859 out:
    1870 if (!all_zones_ok) {
    1871 cond_resched();
    1873 try_to_freeze();
    1875 goto loop_again;
    1876 }
    1878 return nr_reclaimed;
    1879 }

    the core algorithm is simple.
    it has triple nested loop.
    outer is 1732-1875. its role is to prevent unnecessary priority promotion. (explain later)
    middle is 1741-1858. its role is low -> high priority promotion.
    innter is 1797-1840. its role is traverse each zones.

    why outer loop exist?
    if system has two cpu and cpu-0 process page alloc contenuously and
    cpu-1 process kswapd reclaiming.
    zone_watermark_ok(pages_high) does not readily succeed altough kswapd
    success to reclaim page.
    unfortunately, near 0 priority indicate very aggressive reclaim and
    cause slow down.
    then, if kswapd reclaimed SWAP_CLUSTER_MAX pages, priority is resetted.

    in the other hand, kswapd can't reclaim enough page (= your case),
    it also retry outer loop.
    kswapd hope to make enough memory by contenuous page scan.

    By the way, we pay attension line 1870.
    There are 4 route to reach it.

    case A. jump from 1780
    ok, kswapd find enough memory. exit balance_pgdat()

    case B. jump from 1842
    similar A. kswapd find enough memory. exit balance_pgdat()

    case C. jump from 1856
    kswapd reclaimed some page. priority reset.

    case D. exiting middle loop
    kswapd can't reclaim any page.
    oops, priority reset and retry.

    This patch only change case D.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts