The interface to __alloc_pages() is embodied up by the
get_free_pages() and alloc_pages() functions, but __alloc_pages() is
really the central physical-page allocation function in the
kernel.
It attempts to allocate a free block of the given order from each of
the zones in the zonelist in turn. It is comparatively well-commented,
so I won't go into too much detail here. Essentially, __alloc_pages()
looks at the zones in the given zonelist in order, trying to allocate
a block of 2order pages using the buddy
allocator. It wakes up the swapper and/or bdflush if there are not
enough free pages; those tasks (both kernel threads) attempt to free
some pages by writing dirty pages
to disk.
There are some complicated interactions between __alloc_pages() and
the higher-level VM code. Each zone keeps track of pages that have
recently been mapped into some task's VM, and we may decide to reclaim
some of those pages rather than allocating actual free pages.
If there are a lot ( >= zone->pages_low) of free pages in any of the
zones in the zonelist, we attempt to immediately allocate a block of
the requested order at line 332 with a call to rmqueue(). Otherwise, the actual allocation
attempts are delegated to __alloc_pages_limit(). Basically, we are
trying to allocate something first from a not-heavily-used zone, and
second from a more-preferred zone. That is, if one zone in
the zonelist has substantially more free pages, we are likely to
allocate from it, even if it's not the preferred zone for the
allocation. But when all the zones are more-or-less equally allocated,
we will allocate from the most-preferred zone that can support the
allocation request.
__alloc_pages_limit(zonelist_t *zonelist, unsigned long order, int
limit, int direct_reclaim) does the actual work of interrogating the
zone_structs and determining whether an allocation from any zone is
possible within the specified constraints. It's called when
__alloc_pages() can't immediately find a mostly-free zone from which
to do the allocation.
The first two arguments are obvious, the last two less so.
limit is an enum that tells us how to figure out how many
pages must be free in a zone before we use that zone for
allocation. zone_struct->pages_min is the minimum number of free+inactive_clean (F+IC) pages the
zone should keep available; zone_struct->pages_low is the number of
F+IC pages at which the zone is considered to be low on free pages;
and zone_struct->pages_high is the number of F+IC pages considered to
be "lots". If we find a zone with a suitable number of F+IC pages, we
call rmqueue() to try to perform the
allocation.
direct_reclaim is 1 if the allocation request is one which can
possibly be fulfilled by reclaiming a page from the inactive_clean
list. This flag is set in __alloc_pages() on line 297. (The PF_MEMALLOC
flag is set if the allocation is a recursive one - that is, if, while
trying to get a free block, we end up needing some memory for some
reason and enter __alloc_pages() again, PF_MEMALLOC will be set.)
Note: the comment beginning on line 498 doesn't make sense to
me. direct_reclaim is set only once, it's a local, and the loop
referred to in the comment doesn't change it, so it seems to me like
direct_reclaim is possible under exactly the conditions expressed on
line 297. Unless there's stuff going on inside reclaim_pages() that's
sensitive to PF_MEMALLOC. And there doesn't appear to be anything like
that.
[working on it...]
* rmqueue(zone_t *zone, unsigned long order) attempts to
remove a block of the given order from the freelists of the given
zone. This is the fundamental allocation operation: when you've
removed (the page struct representing) a page from the freelists and
fixed up the buddy bitmaps, that memory is allocated.
Here's how it works:
- The do loop beginning on line 181 is looping over
page orders, starting at the requested order - so each trip
through the loop we are considering free areas twice the size of the
previous iteration.
- On line 185 we check the freelist of the current order
and see if it's empty. If so we proceed to the next order; otherwise,
- Line 191: We remove the first element from the chosen freelist.
- Line 193: We adjust the fragmentation bit for the buddy pair from
which we're allocating.
- Line 194: accounting info for the zone struct.
- Line 196: expand() walks the buddy bitmaps, marking lower-order
blocks fragmented and adding their fragments to the appropriate
freelists, if we chose a block of higher-than-requested order from
which to allocate.
- page struct.
I must say, this is a whole lot clearer than it was in 2.2!
struct page* expand(zone_t* zone,struct page *page, unsigned long index, int low, int high, free_area_t * area)
adjusts the buddy bitmaps and freelists to reflect a new
allocation. zone is the zone from which the allocation took
place, page is the allocated page, index is the
index into mem_map of the allocated page, low is the order of
the requested allocation, high is the order of the block
actually removed from the freelists, and area is the
free_area_struct for the order of the block actually allocated. It may
be that high != low, and area != the area of the requested allocation,
if the block was allocated from a free list of higher-than-requested
order.
What we're going to do in that case is return the top 2low
pages of the allocated block to the requester, and put the remainder
of the pages on the appropriate freelists, managing the buddy bitmaps
as appropriate. For example, let's say we've requested a 2-page block,
but we had to allocate that block from the order-3 (8-page) freelist,
and we happen to have chosen physical page 800 as the base of the
8-page block:
__free_pages(struct page *page, unsigned long order) frees the block
of size 2^order pages starting at the given
page. It is very slightly tricky. The entire function looks like this:
void __free_pages( *page, unsigned long order)
{
if (!PageReserved(page) && put_page_testzero(page))
__free_pages_ok(page, order);
}
The tricky part is the call to put_page_testzero(), which decrements
the page's reference count and returns 1 if the reference count is 0
after the decrement. Thus, if the caller is not the last user of the
page, it will not actually be freed.
A great deal of the higher-level VM logic depends upon this test
being done properly; see for example try_to_swap_out().
If the caller is the last user of the page block, then
__free_pages() goes on to call __free_pages_ok(), which links the page
block back into the freelists and manages the buddy maps, coalescing
buddy blocks as necessary. This is essentially the inverse operation
to expand(), and therefore I will not discuss it further here.
I'm suspending work on the zone-allocator documentation for the moment
and moving on to higher-level VM stuff. I need to understand how
direct_reclaim() works, which means understanding exactly how the page
lists are managed. I'll get back to this at some point, but if you
understand how __alloc_pages() works and the nature of the data
structures involved, figuring out other zone allocator functions (eg
__free_pages_ok() ) is not too hard, if you ignore the interactions
between the zone allocator and the page-aging logic.