Vmalloc on huge memory machine??



贊助商連結


wangcm
2005-08-30, 06:25 PM
最近遇到一台客戶的機器(跑FC3 v2.6.12.5smp kernel,有6GB RAM,BTW,kernel的config有support highmemory到64GB),Oracle會抱怨linux vmalloc failed,kernel也有message建議在boot時加入vmalloc參數的message,/proc的內容如下

/proc/meminfo:
MemTotal: 6236528 kB
MemFree: 1148544 kB
Buffers: 232600 kB
Cached: 3426956 kB
SwapCached: 13668 kB
Active: 2341312 kB
Inactive: 2237616 kB
HighTotal: 5373412 kB
HighFree: 963008 kB
LowTotal: 863116 kB
LowFree: 185536 kB
SwapTotal: 3389664 kB
SwapFree: 3375268 kB
Dirty: 8260 kB
Writeback: 0 kB
Mapped: 1115880 kB
Slab: 127224 kB
Committed_AS: 1395364 kB
PageTables: 292756 kB
VmallocTotal: 114680 kB
VmallocUsed: 83144 kB
VmallocChunk: 13792 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB

/proc/mtrr:
reg00: base=0xf8000000 (3968MB), size= 128MB: uncachable, count=1
reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
reg02: base=0x100000000 (4096MB), size=2048MB: write-back, count=1
reg03: base=0x180000000 (6144MB), size= 128MB: write-back, count=1
reg04: base=0xf7f80000 (3967MB), size= 512KB: uncachable, count=1

看來記憶體還沒被吃光,但vmalloc最多只有128MB,於是鵝在reboot時加入vmalloc=256M,但此時反而boot不起來(kernel panic :|||: ),但同一台機器跑slackware(kernel同為V2.6.12)就無此狀況,而且RAM在1GB以下時也沒問題(vmalloc要設多大就設多大 :D ),不知有網友遇過類似的問題嗎 :confused: :confused: ....

贊助商連結


dou0228
2005-08-30, 06:32 PM
我想應該是 Red$hit 自己做了 customized 之後的 Kernel 無法改變 vmalloc 的關係

Kernel 2.6.13 已經在 昨天釋出, 倒不如用 Official 的 Kernel 試試

wangcm
2005-08-30, 06:57 PM
我想應該是 Red$hit 自己做了 customized 之後的 Kernel 無法改變 vmalloc 的關係

Kernel 2.6.13 已經在 昨天釋出, 倒不如用 Official 的 Kernel 試試

不過鵝的kernel都是自己用標準的kernel compile出來的,照說不致於同一個kernel source compile出來的kernel image在不同distribution上會有不同的表現(kernel boot時還用不上distribution相關的library/utility,不過不同版本gcc compile出來不知會不會有影響 :confused: )....BTW,鵝用google找到一篇文章(原文在LWN (http://lwn.net/Articles/75174/)),看來可能是這個原因....

Virtual Memory I: the problem

This article serves mostly as background to help understand why the kernel developers are considering making fundamental virtual memory changes at this point in the development cycle. It can probably be skipped by readers who understand how high and low memory work on 32-bit systems.

A 32-bit processor can address a maximum of 4GB of memory. One could, in theory, extend the instruction set to allow for larger pointers, but, in practice, nobody does that; the effects on performance and compatibility would be too strong. So the limitation remains: no process on a 32-bit system can have an address space larger than 4GB, and the kernel cannot directly address more than 4GB.

In fact, the limitations are more severe than that. Linux kernels split the 4GB address space between user processes and the kernel; under the most common configuration, the first 3GB of the 32-bit range are given over to user space, and the kernel gets the final 1GB starting at 0xc0000000. Sharing the address space gives a number of performance benefits; in particular, the hardware's address translation buffer can be shared between the kernel and user space.

If the kernel wishes to be able to access the system's physical memory directly, however, it must set up page tables which map that memory into the kernel's part of the address space. With the default 3GB/1GB mapping, the amount of physical memory which can be addressed in this way is somewhat less than 1GB - part of the kernel's space must be set aside for the kernel itself, for memory allocated with vmalloc(), and various other purposes. That is why, until a few years ago, Linux could not even fully handle 1GB of memory on 32-bit systems. In fact, back in 1999, Linus decreed that 32-bit Linux would never, ever support more than 2GB of memory. "This is not negotiable."

Linus's views notwithstanding, the rest of the world continued on with the strange notion that 32-bit systems should be able to support massive amounts of memory. The processor vendors added paging modes which could use physical addresses which exceed 32 bits in length, thus ending the 4GB limit for physical memory. The internal addressing limitations in the Linux kernel remained, however. Happily for users of large systems, Linus can acknowledge an error and change his mind; he did eventually allow large memory support into the 2.3 kernel. That support came with its own costs and limitations, however.

On 32-bit systems, memory is now divided into "high" and "low" memory. Low memory continues to be mapped directly into the kernel's address space, and is thus always reachable via a kernel-space pointer. High memory, instead, has no direct kernel mapping. When the kernel needs to work with a page in high memory, it must explicitly set up a special page table to map it into the kernel's address space first. This operation can be expensive, and there are limits on the number of high-memory pages which can be mapped at any particular time.

For the most part, the kernel's own data structures must live in low memory. Memory which is not permanently mapped cannot appear in linked lists (because its virtual address is transient and variable), and the performance costs of mapping and unmapping kernel memory are too high. High memory is useful for process pages and some kernel tasks (I/O buffers, for example), but the core of the kernel stays in low memory.

Some 32-bit processors can now address 64GB of physical memory, but the Linux kernel is still not able to deal effectively with that much; the current limit is around 8GB to 16GB, depending on the load. The problem now is that larger systems simply run out of low memory. As the system gets larger, it requires more kernel data structures to manage, and eventually room for those structures can run out. On a very large system, the system memory map (an array of struct page structures which represents physical memory) alone can occupy half of the available low memory.

There are users out there wanting to scale 32-bit Linux systems up to 32GB or more of main memory, so the enterprise-oriented Linux distributors have been scrambling to make that possible. One approach is the 4G/4G patch written by Ingo Molnar. This patch separates the kernel and user address spaces, allowing user processes to have 4GB of virtual memory while simultaneously expanding the kernel's low memory to 4GB. There is a cost, however: the translation buffer is no longer shared and must be flushed for every transition between kernel and user space. Estimates of the magnitude of the performance hit vary greatly, but numbers as high as 30% have been thrown around. This option makes some systems work, however, so Red Hat ships a 4G/4G kernel with its enterprise offerings.

The 4G/4G patch extends the capabilities of the Linux kernel, but it remains unpopular. It is widely seen as an ugly solution, and nobody likes the performance cost. So there are efforts afoot to extend the scalability of the Linux kernel via other means. Some of these efforts will likely go forward - in 2.6, even - but the kernel developers seem increasingly unwilling to distort the kernel's memory management systems to meet the needs of a small number of users who are trying to stretch 32-bit systems far beyond where they should go. There will come a time where they will all answer as Linus did back in 1999: go get a 64-bit system.

dou0228
2005-08-30, 11:59 PM
不過鵝的kernel都是自己用標準的kernel compile出來的,照說不致於同一個kernel source compile出來的kernel image在不同distribution上會有不同的表現(kernel boot時還用不上distribution相關的library/utility,不過不同版本gcc compile出來不知會不會有影響 :confused: )....BTW,鵝用google找到一篇文章(原文在LWN (http://lwn.net/Articles/75174/)),看來可能是這個原因....

[deleted]


照這 article 來看, 應該與 vmalloc 被定在 128 MB 無關
這個定義被放在 <kernel_src>/kernel/<arch>/mm/init.c

裡面的一個 __VMALLOC_RESERVED = 128 << 20;

但是 FC3 v2.6.12.5smp kernel 表示這是被加料過的 Kernel
並非 official kernel

slackware 的 kernel 應該是正常的版本; 與 GCC 版本無關

雖然 2.6.13 還是被寫死在 128 MB, 現在沒空改, 不過其實不難改就是了

另: 這問題應該是出在 FC3 的 kernel 問題上, 並非 gcc/library 問題

wangcm
2005-08-31, 01:19 AM
照這 article 來看, 應該與 vmalloc 被定在 128 MB 無關
這個定義被放在 <kernel_src>/kernel/<arch>/mm/init.c

裡面的一個 __VMALLOC_RESERVED = 128 << 20;

但在鵝自用的Socket754 SP2800(未支援AMD64的舊版本 :|||: )配上512MB RAM(有32MB被onboard VGA給"借走"了)上看到/proc的內容如下,不知其中VmallocTotal/VmallocChunk真正的意義為何 :confused: :confused: ....

/proc/meminfo:
MemTotal: 483560 kB
MemFree: 209936 kB
Buffers: 8352 kB
Cached: 184700 kB
SwapCached: 0 kB
Active: 123036 kB
Inactive: 134336 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 483560 kB
LowFree: 209936 kB
SwapTotal: 257000 kB
SwapFree: 257000 kB
Dirty: 284 kB
Writeback: 0 kB
Mapped: 109188 kB
Slab: 12788 kB
CommitLimit: 498780 kB
Committed_AS: 108876 kB
PageTables: 1196 kB
VmallocTotal: 540664 kB
VmallocUsed: 760 kB
VmallocChunk: 539884 kB


但是 FC3 v2.6.12.5smp kernel 表示這是被加料過的 Kernel
並非 official kernel

slackware 的 kernel 應該是正常的版本; 與 GCC 版本無關

不好意思引起您的誤解了,FC3 v2.6.12.5smp是指FC3,但kernel跟slackware一樣是鵝自己compile v2.6.12.5版support SMP/64G highmemory的offical kernel source(連config都一樣,只是FC3和slackware所用的gcc版本不知是否相同)....BTW,目前的CPU大都有support PAE,較新者(ex K8或Xeon等等)甚至有64bit extention(AMD64,EM64T...),若只更新kernel而不更新distribution時對此狀況是否有解??....

wangcm
2005-09-01, 07:12 PM
鵝又找到一篇相關的東東(原文在Linux-Kernel Archive (http://www.ussg.iu.edu/hypermail/linux/kernel/0504.0/0782.html)),不過還沒機會試---鵝家埵菪峈瑣鷑僑磥ㄔX這麼多RAM :D :D .....另外鵝也不太能理解同一台機器上,不同distribution用同樣的loader(GRUB,只差FC3會用到initrd,Slackware則否),同樣是鵝自己compile的kernel會有不同的表現,可能真是initrd的關係 :|||: :|||: ....

Re: UPDATE: out of vmalloc space - but vmalloc parameter does notallow boot
From: Ranko Zivojnovic
Date: Mon Apr 04 2005 - 18:27:54 EST

--------------------------------------------------------------------------------
Ok, I think I've figured it out so I will try and answer my own
questions (the best part is at the end)...

On Mon, 2005-04-04 at 17:36 +0300, Ranko Zivojnovic wrote:
> (please do CC replies as I am still not on the list)
>
> As I am kind of pressured to resolve this issue, I've set up a test
> environment using VMWare in order to reproduce the problem and
> (un)fortunately the attempt was successful.
>
> I have noticed a few points that relate to the size of the physical RAM
> and the behavior vmalloc. As I am not sure if this is by design or a
> bug, so please someone enlighten me:
>
> The strange thing I have seen is that with the increase of the physical
> RAM, the VmallocTotal in the /proc/meminfo gets smaller! Is this how it
> is supposed to be?
>

As the size of memory grows, more gets allocated to the low memory, less
to the vmalloc memory - within first 1GB of RAM.

> Now the question: Is this behavior normal?
I guess it is (nobody said the oposite).

> Should it not be in reverse -
> more RAM equals more space for vmalloc?
>

It really depends on the setup and the workload - some reasonable
defaults (i.e. 128M) have been defined - you can change them using
vmalloc parameter - but with the _extreme_ care as it gets really tricky
if your RAM is 1G and above - read on...

> With regards to the 'vmalloc' kernel parameter, I was able to boot
> normally using kernel parameter vmalloc=192m with RAM sizes 256, 512,
> 768 but _not_ with 1024M of RAM and above.
>
> With 1024M of RAM (and apparently everything above), it is unable to
> boot if vmalloc parameter is specified to a value lager than default
> 128m. It panics with the following:
>
> EXT2-fs: unable to read superblock
> isofs_fill_super: bread failed, dev=md0, iso_blknum=16, block=32
> XFS: SB read failed
> VFS: Cannot open root device "md0" or unknown-block(9,0)
> Please append a correct "root=" boot option
> Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(9,0)
>
And not just - I have just seen the actual culprit message (way up
front):
initrd extends beyond end of memory (0x37fef33a > 0x34000000)
disabling initrd

> Question: Is this inability to boot related to the fact that the system
> is unable to reserve enough space for vmalloc?
>

The resolution (or rather workaround) to the above is to _trick_ the
GRUB into loading the initrd image into the area below what is _going_
to be the calculated "end of memory" using the "uppermem" command.

Now:
1. I hope this is the right way around the problem.
2. I hope this is going to help someone.

Best regards,

Ranko

dou0228
2005-09-01, 08:41 PM
鵝又找到一篇相關的東東(原文在Linux-Kernel Archive (http://www.ussg.iu.edu/hypermail/linux/kernel/0504.0/0782.html)),不過還沒機會試---鵝家埵菪峈瑣鷑僑磥ㄔX這麼多RAM :D :D .....另外鵝也不太能理解同一台機器上,不同distribution用同樣的loader(GRUB,只差FC3會用到initrd,Slackware則否),同樣是鵝自己compile的kernel會有不同的表現,可能真是initrd的關係 :|||: :|||: ....
[delete]

突然覺得有點被騙的感覺 :|||:
不是說兩台的 kernel config 都一樣嗎? 怎麼會一台有用 initrd?
話說回來, 我是絕不使用 initrd, 除非有特殊需要

照上面那個說明, 應該是 initrd load 的 address 錯誤, 才會如此

wangcm
2005-09-02, 12:34 AM
突然覺得有點被騙的感覺 :|||:
不是說兩台的 kernel config 都一樣嗎? 怎麼會一台有用 initrd?

基本上config是一樣的,除了FC3上為了維持和rpm style kernel的一致性--boot時絕對必要的部份才以static link方式compile進kernel image,其餘部份以module方式為原則,只須更動initrd,而Slackware部份(鵝平常用的主OS :D )因為搞initrd不如RH/FC系列方便,所以很少用到initrd,或許反而因此逃過一劫 :D :D ....


話說回來, 我是絕不使用 initrd, 除非有特殊需要

照上面那個說明, 應該是 initrd load 的 address 錯誤, 才會如此

此話怎講 :confused: :confused: ....initrd可以降低須要compile kernel的機會(只要更動initrd即可....BTW,雖然在現在的機器上compile已經不算苦差事了,但鵝常有機會在老機器上搞linux,在老機器上compile kernel可真是件苦差事 :|||: ),除了少數狀況以外(ex這個case)應該還算利多於弊吧 :) :) ....

dou0228
2005-09-02, 08:00 AM
此話怎講 :confused: :confused: ....initrd可以降低須要compile kernel的機會(只要更動initrd即可....BTW,雖然在現在的機器上compile已經不算苦差事了,但鵝常有機會在老機器上搞linux,在老機器上compile kernel可真是件苦差事 :|||: ),除了少數狀況以外(ex這個case)應該還算利多於弊吧 :) :) ....
降低須要 compile 機會 & compile time 是指在 2.6 之前的 kernel
2.6 的 kernel 加了一個 module, compile 速度還是很快的

完全沒有利用 initrd 來省時間的問題
如果是為了老機器, 通常我是在別台機器 compile 好之後, 把 vmlinuz 搬過去就好了 :D

wangcm
2005-09-22, 01:07 AM
補充一下,鵝試過插4GB RAM跑linux,只要大量讀寫速度就慘不忍睹,但插2GB時就一切正常(印像中X86跑32bit時PCI等MMIO會佔掉3-4G的memory address,所以3-4GB的RAM要reallocate到4GB以上的地方不能直接給OS用,只能當buffer用會稍慢一點,但未免也太慘了 :|||: ),想換成X86_64又怕其AP會出問題,不知有沒有網友試過在X86上裝超過3GB RAM嗎 :confused: :confused: ....