Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into next
Pull Xen updates from David Vrabel:
"xen: features and fixes for 3.16-rc0
- support foreign mappings in PVH domains (needed when dom0 is PVH)
- fix mapping high MMIO regions in x86 PV guests (this is also the
first half of removing the PAGE_IOMAP PTE flag).
- ARM suspend/resume support.
- ARM multicall support"
* tag 'stable/for-linus-3.16-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: map foreign pfns for autotranslated guests
xen-acpi-processor: Don't display errors when we get -ENOSYS
xen/pciback: Document the entry points for 'pcistub_put_pci_dev'
xen/pciback: Document when the 'unbind' and 'bind' functions are called.
xen-pciback: Document when we FLR an PCI device.
xen-pciback: First reset, then free.
xen-pciback: Cleanup up pcistub_put_pci_dev
x86/xen: do not use _PAGE_IOMAP in xen_remap_domain_mfn_range()
x86/xen: set regions above the end of RAM as 1:1
x86/xen: only warn once if bad MFNs are found during setup
x86/xen: compactly store large identity ranges in the p2m
x86/xen: fix set_phys_range_identity() if pfn_e > MAX_P2M_PFN
x86/xen: rename early_p2m_alloc() and early_p2m_alloc_middle()
xen/x86: set panic notifier priority to minimum
arm,arm64/xen: introduce HYPERVISOR_suspend()
xen: refactor suspend pre/post hooks
arm: xen: export HYPERVISOR_multicall to modules.
arm64: introduce virt_to_pfn
arm/xen: Remove definiition of virt_to_pfn in asm/xen/page.h
arm: xen: implement multicall hypercall support.
|
|
While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.
When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.
I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.
Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )
So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.
I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.
Here's an example call trace using up the kernel stack:
Depth Size Location (51 entries)
----- ---- --------
0) 7696 16 lookup_address
1) 7680 16 _lookup_address_cpa.isra.3
2) 7664 24 __change_page_attr_set_clr
3) 7640 392 kernel_map_pages
4) 7248 256 get_page_from_freelist
5) 6992 352 __alloc_pages_nodemask
6) 6640 8 alloc_pages_current
7) 6632 168 new_slab
8) 6464 8 __slab_alloc
9) 6456 80 __kmalloc
10) 6376 376 vring_add_indirect
11) 6000 144 virtqueue_add_sgs
12) 5856 288 __virtblk_add_req
13) 5568 96 virtio_queue_rq
14) 5472 128 __blk_mq_run_hw_queue
15) 5344 16 blk_mq_run_hw_queue
16) 5328 96 blk_mq_insert_requests
17) 5232 112 blk_mq_flush_plug_list
18) 5120 112 blk_flush_plug_list
19) 5008 64 io_schedule_timeout
20) 4944 128 mempool_alloc
21) 4816 96 bio_alloc_bioset
22) 4720 48 get_swap_bio
23) 4672 160 __swap_writepage
24) 4512 32 swap_writepage
25) 4480 320 shrink_page_list
26) 4160 208 shrink_inactive_list
27) 3952 304 shrink_lruvec
28) 3648 80 shrink_zone
29) 3568 128 do_try_to_free_pages
30) 3440 208 try_to_free_pages
31) 3232 352 __alloc_pages_nodemask
32) 2880 8 alloc_pages_current
33) 2872 200 __page_cache_alloc
34) 2672 80 find_or_create_page
35) 2592 80 ext4_mb_load_buddy
36) 2512 176 ext4_mb_regular_allocator
37) 2336 128 ext4_mb_new_blocks
38) 2208 256 ext4_ext_map_blocks
39) 1952 160 ext4_map_blocks
40) 1792 384 ext4_writepages
41) 1408 16 do_writepages
42) 1392 96 __writeback_single_inode
43) 1296 176 writeback_sb_inodes
44) 1120 80 __writeback_inodes_wb
45) 1040 160 wb_writeback
46) 880 208 bdi_writeback_workfn
47) 672 144 process_one_work
48) 528 112 worker_thread
49) 416 240 kthread
50) 176 176 ret_from_fork
[ Note: the problem is exacerbated by certain gcc versions that seem to
generate much bigger stack frames due to apparently bad coalescing of
temporaries and generating too many spills. Rusty saw gcc-4.6.4 using
35% more stack on the virtio path than 4.8.2 does, for example.
Minchan not only uses such a bad gcc version (4.6.3 in his case), but
some of the stack use is due to debugging (CONFIG_DEBUG_PAGEALLOC is
what causes that kernel_map_pages() frame, for example). But we're
clearly getting too close.
The VM code also seems to have excessive stack frames partly for the
same compiler reason, triggered by excessive inlining and lots of
function arguments.
We need to improve on our stack use, but in the meantime let's do this
simple stack increase too. Unlike most earlier reports, there is
nothing simple that stands out as being really horribly wrong here,
apart from the fact that the stack frames are just bigger than they
should need to be. - Linus ]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S Tsirkin <mst@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: PJ Waskiewicz <pjwaskiewicz@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The XSAVE instruction family takes a memory argment. The macros use
(%edi)/(%rdi) as that memory argument - make that clear to the reader.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-7-git-send-email-fenghua.yu@intel.com
|
|
In standard form, each state is saved in the xsave area in fixed offset.
But in compacted form, offset of each saved state only can be calculated during
run time because some xstates may not be enabled and saved.
We define kernel API get_xsave_addr() returns address of a given state saved in a xsave area.
It can be called in kernel to get address of each xstate in xsave area in
either standard format or compacted format.
It's useful when kernel wants to directly access each state in xsave area.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-17-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
__save_fpu() can be called during early booting time when cpu caps are not
enabled and alternative can not be used yet. Therefore, it calls
xsave_state_booting() during booting time to save xstate to task's xsave area.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-14-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Since boot_cpu_data and cpu capabilities are not enabled yet during early
booting time, alternative can not be used in some functions to access xsave
area. Therefore, we define two new functions xrstor_state_booting() and
xsave_state_booting() to access xsave area just during early booting time.
xrstor_state_booting restores xstate from xsave area during early booting time.
xsave_state_booting saves xstate to xsave area during early booting time.
The two functions are similar to xrstor_state and xsave_state respectively.
But the two functions don't use alternatives because alternatives are not
enabled when they are called in such early booting time.
xrstor_state_booting is called only by functions defined as __init. So it's
defined as __init and will be removed from memory after booting time. There
is no extra memory cost caused by this function during running time.
But because xsave_state_booting can be called by run-time function __save_fpu(),
it's not defined as __init and will stay in memory during running time although
it will not be called anymore during running time. It is not ideal to
have this function stay in memory during running time. But it's a pretty small
function and the memory cost will be small. By doing in this way, we can
avoid to change a lot of code to just remove this small function and save a
bit memory for running time.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-13-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
We use legacy xsave/xrstor to save and restore standard form of xsave area
in user space context. No xsaveopt or xsaves is used here for two reasons.
First, we don't want to use modified optimization which is implemented in
xsaveopt and xsaves because xrstor/xrstors might track a wrong user space
application.
Secondly, we don't use compacted format of xsave area for backward
compatibility because legacy user space applications only don't understand
the compacted format of the xsave area.
Using standard form of the xsave area may allocate more memory for
user context than compacted form, but preserves compatibility with
legacy applications. Furthermore, even with holes, the relevant cache
lines don't get touched and thus the performance impact is limited.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-11-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
If xsaves is eanbled, use xsaves/xrstors for context switch to support
compacted format xsave area to occupy less memory and modified optimization
to improve saving performance.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-10-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
If xsaves is eanbled, use xsaves/xrstors instrucitons to save and restore
xstate. xsaves and xrstors support compacted format, init optimization,
modified optimization, and supervisor states.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-9-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Define a macro to handle fault generated by xsave, xsaveopt, xsaves, xrstor,
and xrstors instructions. It is used in functions like xsave_state() etc.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-8-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Define macros for xsave, xsaveopt, xsaves, xrstor, and xrstors inline
instructions. The instructions will be used for saving and restoring xstate.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-7-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
The XSAVE area header is changed to support both compacted format and
standard format of xsave area.
The XSAVE header of an xsave area comprises the 64 bytes starting at offset
512 from the area base address:
- Bytes 7:0 of the xsave header is a state-component bitmap called
xstate_bv. It identifies the state components in the xsave area.
- Bytes 15:8 of the xsave header is a state-component bitmap called
xcomp_bv. It is used as follows:
- xcomp_bv[63] indicates the format of the extended region of
the xsave area. If it is clear, the standard format is used.
If it is set, the compacted format is used.
- xcomp_bv[62:0] indicate which features (starting at feature 2)
have space allocated for them in the compacted format.
- Bytes 63:16 of the xsave header are reserved.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-6-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
features and input
alternative_input_2() replaces old instruction with new instructions with
input based on two features.
In alternative_input_2(oldinstr, newinstr1, feature1, newinstr2, feature2,
input...),
feature2 has higher priority to replace oldinstr than feature1.
If CPU has feature2, newinstr2 replaces oldinstr and newinstr2 is
executed during run time.
If CPU doesn't have feature2, but it has feature1, newinstr1 replaces oldinstr
and newinstr1 is executed during run time.
If CPU doesn't have feature2 and feature1, oldinstr is executed during run
time.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-5-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Detect the xsaveopt, xsavec, xgetbv, and xsaves features in processor extended
state enumberation sub-leaf (eax=0x0d, ecx=1):
Bit 00: XSAVEOPT is available
Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set
Bit 02: Supports XGETBV with ECX = 1 if set
Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set
The above features are defined in the new word 10 in cpu features.
The IA32_XSS MSR (index DA0H) contains a state-component bitmap that specifies
the state components that software has enabled xsaves and xrstors to manage.
If the bit corresponding to a state component is clear in XCR0 | IA32_XSS,
xsaves and xrstors will not operate on that state component, regardless of
the value of the instruction mask.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-3-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
In each X86 feature macro definition, add one space in front of the word
number which is a one-digit number currently.
The purpose of reformatting the macros is to align one-digit and two-digit
word numbers.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-2-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
pcibios_penalize_isa_irq() is only implemented by x86 now, and legacy ISA
is not used by some architectures. Make pcibios_penalize_isa_irq() a
__weak function to simplify the code. This removes the need for new
platforms to add stub implementations of pcibios_penalize_isa_irq().
[bhelgaas: changelog, comments]
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
|
|
Since mis-order issues have been solved, we can cleanup redundant
definitions that already have defaults in <acpi/platform/acenv.h>.
This patch removes redudant environments for __KERNEL__ surrounded code.
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
<asm/acpi.h>
There is a mis-order inclusion for <asm/acpi.h>.
As we will enforce including <linux/acpi.h> for all Linux ACPI users, we
can find the inclusion order is as follows:
<linux/acpi.h>
<acpi/acpi.h>
<acpi/platform/acenv.h>
(acenv.h before including aclinux.h)
<acpi/platform/aclinux.h>
...........................................................................
(aclinux.h before including asm/acpi.h)
<asm/acpi.h> @Redundant@
(ACPICA specific stuff)
...........................................................................
...........................................................................
(Linux ACPI specific stuff) ? - - - - - - - - - - - - +
(aclinux.h after including asm/acpi.h) @Invisible@ |
(acenv.h after including aclinux.h) @Invisible@ |
other ACPICA headers @Invisible@ |
............................................................|..............
<acpi/acpi_bus.h> |
<acpi/acpi_drivers.h> |
<asm/acpi.h> (Excluded) |
(Linux ACPI specific stuff) ! <- - - - - - - - - - - - - +
NOTE that, in ACPICA, <acpi/platform/acenv.h> is more like Kconfig
generated <generated/autoconf.h> for Linux, it is meant to be included
before including any ACPICA code.
In the above figure, there is a question mark for "Linux ACPI specific
stuff" in <asm/acpi.h> which should be included after including all other
ACPICA header files. Thus they really need to be moved to the position
marked with exclaimation mark or the definitions in the blocks marked with
"@Invisible@" will be invisible to such architecture specific "Linux ACPI
specific stuff" header blocks. This leaves 2 issues:
1. All environmental definitions in these blocks should have a copy in the
area marked with "@Redundant@" if they are required by the "Linux ACPI
specific stuff".
2. We cannot use any ACPICA defined types in <asm/acpi.h>.
This patch splits architecture specific ACPICA stuff from <asm/acpi.h> to
fix this issue.
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I noticed on some of my systems that page fault tracing doesn't
work:
cd /sys/kernel/debug/tracing
echo 1 > events/exceptions/enable
cat trace;
# nothing shows up
I eventually traced it down to CONFIG_KVM_GUEST. At least in a
KVM VM, enabling that option breaks page fault tracing, and
disabling fixes it. I tried on some old kernels and this does
not appear to be a regression: it never worked.
There are two page-fault entry functions today. One when tracing
is on and another when it is off. The KVM code calls do_page_fault()
directly instead of calling the traced version:
> dotraplinkage void __kprobes
> do_async_page_fault(struct pt_regs *regs, unsigned long
> error_code)
> {
> enum ctx_state prev_state;
>
> switch (kvm_read_and_reset_pf_reason()) {
> default:
> do_page_fault(regs, error_code);
> break;
> case KVM_PV_REASON_PAGE_NOT_PRESENT:
I'm also having problems with the page fault tracing on bare
metal (same symptom of no trace output). I'm unsure if it's
related.
Steven had an alternative to this which has zero overhead when
tracing is off where this includes the standard noops even when
tracing is disabled. I'm unconvinced that the extra complexity
of his apporach:
http://lkml.kernel.org/r/20140508194508.561ed220@gandalf.local.home
is worth it, expecially considering that the KVM code is already
making page fault entry slower here. This solution is
dirt-simple.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS. And CS.DPL is also not equal
to the CPL for conforming code segments.
However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).
So this patch:
- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back
- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions). It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.
This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Not needed anymore now that the CPL is computed directly
during task switch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code
that otherwise cause conflicts.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Merge in Linus' tree with:
fa81511bb0bb x86-64, modify_ldt: Make support for 16-bit segments a runtime option
... reverted, to avoid a conflict. This commit is no longer necessary
with the proper fix in place.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Add a cmdline param which disables the microcode loader. This is useful
mostly in debugging situations where we want to turn off microcode
loading, both early from the initrd and late, as a means to be able to
rule out its influence on the machine.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-3-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Carve out early cmdline parsing function into .../lib/cmdline.c so it
can be used by early code in the kernel proper as well.
Adapted from arch/x86/boot/cmdline.c.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Using arch_vma_name to give special mappings a name is awkward. x86
currently implements it by comparing the start address of the vma to
the expected address of the vdso. This requires tracking the start
address of special mappings and is probably buggy if a special vma
is split or moved.
Improve _install_special_mapping to just name the vma directly. Use
it to give the x86 vvar area a name, which should make CRIU's life
easier.
As a side effect, the vvar area will show up in core dumps. This
could be considered weird and is fixable.
[hpa: I say we accept this as-is but be prepared to deal with knocking
out the vvars from core dumps if this becomes a problem.]
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
ia64 and x86 share this driver. x86 is moving to a different irq
allocation and ia64 keeps its private irq_create/destroy stuff.
Use macros to redirect to one or the other. Yes, macros to avoid
include hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Joerg Roedel <joro@8bytes.org>
Cc: x86@kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20140507154336.372289825@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
No need to expose this outside of the ioapic code. The dynamic
allocations are guaranteed not to happen in the gsi space. See commit
62a08ae2a.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20140507154335.959870037@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Trivial, make math_error() static.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
|
|
It is possible to replace rip-relative addressing mode with addressing
mode of the same length: (reg+disp32). This eliminates the need to fix
up immediate and correct for changing instruction length.
And we can kill arch_uprobe->def.riprel_target.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
|
|
The invalidation is required in order to maintain proper semantics
under CoW conditions. In scenarios where a process clones several
threads, a thread operating on a core whose DTLB entry for a
particular hugepage has not been invalidated, will be reading from
the hugepage that belongs to the forked child process, even after
hugetlb_cow().
The thread will not see the updated page as long as the stale DTLB
entry remains cached, the thread attempts to write into the page,
the child process exits, or the thread gets migrated to a different
processor.
Signed-off-by: Anthony Iliopoulos <anthony.iliopoulos@huawei.com>
Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp
Suggested-by: Shay Goikhman <shay.goikhman@huawei.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v2.6.16+ (!)
|
|
Added all the MBI units below and their associated read/write
opcodes:
- Host Bridge Arbiter
- Host Bridge
- Remote Management Unit
- Memory Manager & eSRAM
- SoC Unit
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-3-git-send-email-david.e.box@linux.intel.com
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Currently drivers that run on non-IOSF systems (Core/Xeon) can't use the IOSF
driver on SOC's without selecting it which forces an unnecessary and limiting
dependency. Provides dummy functions to allow these modules to conditionally
use the driver on IOSF equipped platforms without impacting their ability to
compile and load on non-IOSF platforms. Build default m to ensure availability
on x86 SOC's.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-2-git-send-email-david.e.box@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
The spuriously added semicolon didn't have any effect because the
macro isn't currently in use.
c0a639ad0bc6b178b46996bd1f821a04643e2bde
Signed-off-by: Andres Freund <andres@anarazel.de>
Link: http://lkml.kernel.org/r/1399598957-7011-3-git-send-email-andres@anarazel.de
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Standardize the idle polling indicator to TIF_POLLING_NRFLAG such that
both TIF_NEED_RESCHED and TIF_POLLING_NRFLAG are in the same word.
This will allow us, using fetch_or(), to both set NEED_RESCHED and
check for POLLING_NRFLAG in a single operation and avoid pointless
wakeups.
Changing from the non-atomic thread_info::status flags to the atomic
thread_info::flags shouldn't be a big issue since most polling state
changes were followed/preceded by a full memory barrier anyway.
Also, fix up the apm_32 idle function, clearly that was forgotten in
the last conversion. The default idle state is !POLLING so just kill
the lot.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Link: http://lkml.kernel.org/n/tip-7yksmqtlv4nfowmlqr1rifoi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
HPET on some platform has accuracy problem. Making
"boot_hpet_disable" extern so that we can runtime disable
the HPET timer by using quirk to check the platform.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1398327498-13163-1-git-send-email-feng.tang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This makes the 64-bit and x32 vdsos use the same mechanism as the
32-bit vdso. Most of the churn is deleting all the old fixmap code.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/8af87023f57f6bb96ec8d17fce3f88018195b49b.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
This unifies the vdso mapping code and teaches it how to map special
pages at addresses corresponding to symbols in the vdso image. The
new code is used for all vdso variants, but so far only the 32-bit
variants use the new vvar page position.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/b6d7858ad7b5ac3fd3c29cab6d6d769bc45d195e.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel. Replace all of that with plain C code that
runs at build time.
All five vdso images now generate .c files that are compiled and
linked in to the kernel image.
This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be. Everything
outside the loadable segment is dropped. In particular, this causes
the section table and section name strings to be missing. This
should be fine: real dynamic loaders don't load or inspect these
tables anyway. The result is roughly equivalent to eu-strip's
--strip-sections option.
The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment. Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page. This happens whenever the load segment is just under a
multiple of PAGE_SIZE.
The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall. That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation. Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work. vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.
(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
This code is used during CPU setup, and it isn't strictly speaking
related to the 32-bit vdso. It's easier to understand how this
works when the code is closer to its callers.
This also lets syscall32_cpu_init be static, which might save some
trivial amount of kernel text.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/4e466987204e232d7b55a53ff6b9739f12237461.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Rather than using 'vdso_enabled' and an awful #define, just call the
parameters vdso32_enabled and vdso64_enabled.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/87913de56bdcbae3d93917938302fc369b05caee.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Note add32_with_carry(a, b) is suboptimal, as it forces
a and b in registers.
b could be a memory or a register operand.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add csum_add function for x86_64.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Header guard is #ifndef, not #ifdef...
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
Sparse warns that the percpu variables aren't declared before they are
defined. Rather than hacking around it, move espfix definitions into
a proper header file.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. This
causes some 16-bit software to break, but it also leaks kernel state
to user space. We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.
In checkin:
b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.
This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart. When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace. The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.
(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)
Special thanks to:
- Andy Lutomirski, for the suggestion of using very small stack slots
and copy (as opposed to map) the IRET frame there, and for the
suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
|
|
The only insn which could have both UPROBE_FIX_IP and UPROBE_FIX_CALL
was 0xe8 "call relative", and now it is handled by branch_xol_ops.
So we can change default_post_xol_op(UPROBE_FIX_CALL) to simply push
the address of next insn == utask->vaddr + insn.length, just we need
to record insn.length into the new auprobe->def.ilen member.
Note: if/when we teach branch_xol_ops to support jcxz/loopz we can
remove the "correction" logic, UPROBE_FIX_IP can use the same address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
|
|
handle_riprel_insn() assumes that nobody else could modify ->fixups
before. This is correct but fragile, change it to use "|=".
Also make ->fixups u8, we are going to add the new members into the
union. It is not clear why UPROBE_FIX_RIP_.X lived in the upper byte,
redefine them so that they can fit into u8.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
|