summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2019-09-16drm/amdgpu: split the VM entity into direct and delayedChristian König
For page fault handling we need to use a direct update which can't be blocked by ongoing user CS. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: grab the id mgr lock while accessing passid_mappingChristian König
Need to make sure that we actually dropping the right fence. Could be done with RCU as well, but to complicated for a fix. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu/SRIOV: Navi12 SRIOV VF doesn't load TOCJiange Zhao
In SRIOV case, the autoload sequence is the same as bare metal, except VF won't load TOC. Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu/SRIOV: Navi10/12 VF doesn't support SMUJiange Zhao
In SRIOV case, SMU and powerplay are handled in HV. VF shouldn't have control over SMU and powerplay. Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amd/amdgpu: power up sdma engine when S3 resume backPrike Liang
The sdma_v4 should be ungated when the IP resume back, otherwise it will hang up and resume time out error. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: For Navi12 SRIOV VF, register mailbox functionsJiange Zhao
Mailbox functions and interrupts are only for Navi12 VF. Register functions and irqs during initialization. Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu/sriov: add ring_stop before ring_create in psp v11 codeJack Zhang
psp v11 code missed ring stop in ring create function(VMR) while psp v3.1 code had the code. This will cause VM destroy1 fail and psp ring create fail. For SIOV-VF, ring_stop should not be deleted in ring_create function. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amd/powerplay: properly set mp1 state for SW SMU suspend/reset routineEvan Quan
Set mp1 state properly for SW SMU suspend/reset routine. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Fix KFD-related kernel oops on HawaiiFelix Kuehling
Hawaii needs to flush caches explicitly, submitting an IB in a user VMID from kernel mode. There is no s_fence in this case. Fixes: eb3961a57424 ("drm/amdgpu: remove fence context from the job") Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Fix mutex lock from atomic context.Andrey Grodzovsky
Problem: amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu because writing to EEPROM during ASIC reset was unstable. But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called directly from ISR context and so locking is not allowed. Also it's irrelevant for this partilcular interrupt as this is generic RAS interrupt and not memory errors specific. Fix: Avoid calling amdgpu_ras_reserve_bad_pages if not in task context. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Add SRIOV mailbox backend for Navi1xJiange Zhao
Mimic the ones for Vega10, add mailbox backend for Navi1x Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: implement ras query function for pcie bifGuchun Chen
ras error query funtionality implementation Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: support pcie bif ras query and injectGuchun Chen
Call pcie bif ras query/inject in amdgpu ras. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: add ras error query count interface for nbioGuchun Chen
Add the interface query_ras_error_count for nbio. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: fix CPDMA hang in PRT mode for VEGA10Tianci.Yin
add and_mask since the programming logic of golden setting changed Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Tianci.Yin <tianci.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: enable error injection to XGMI block via debugfsHawking Zhang
allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: initialize ras structures for xgmi block (v2)Hawking Zhang
init ras common interface and fs node for xgmi block v2: remove unnecessary physical node number check before invoking amdgpu_xgmi_ras_late_init Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Allow to reset to EERPOM table.Andrey Grodzovsky
The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Add amdgpu_ras_eeprom_reset_tableAndrey Grodzovsky
This will allow to reset the table on the fly. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: rename umc ras_init to err_cnt_initTao Zhou
this interface is related to specific version of umc, distinguish it from ras_late_init Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: move umc ras init to umc blockTao Zhou
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: move umc late init from gmc to umc blockTao Zhou
umc late init is umc specific, it's more suitable to be put in umc block Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: remove duplicated header file includeGuchun Chen
amdgpu_ras.h is already included. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: remove needless usage of #ifdefShirish S
define sched_policy in case CONFIG_HSA_AMD is not enabled, with this there is no need to check for CONFIG_HSA_AMD else where in driver code. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Shirish S <shirish.s@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: fix build error without CONFIG_HSA_AMDShirish S
If CONFIG_HSA_AMD is not set, build fails: drivers/gpu/drm/amd/amdgpu/amdgpu_device.o: In function `amdgpu_device_ip_early_init': drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1626: undefined reference to `sched_policy' Use CONFIG_HSA_AMD to guard this. Fixes: 1abb680ad371 ("drm/amdgpu: disable gfxoff while use no H/W scheduling policy") Signed-off-by: Shirish S <shirish.s@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Avoid RAS recovery init when no RAS support.Andrey Grodzovsky
Fixes driver load regression on APUs. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: cleanup PTE flag generation v3Christian König
Move the ASIC specific code into a new callback function. v2: mask the flags for SI and CIK instead of a BUG_ON(). v3: remove last missed BUG_ON(). Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: cleanup mtype mappingChristian König
Unify how we map the UAPI flags to the PTE hardware flags for a mapping. Only the MTYPE is actually ASIC dependent, all other flags should be copied over 1 to 1 and ASIC differences are handled later on. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: add navi14 PCI ID for work station SKUTianci.Yin
Add the navi14 PCI device id. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Tianci.Yin <tianci.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: check if nbio->ras_if existPhilip Yang
To avoid NULL function pointer access. This happens on VG10, reboot command hangs and have to power off/on to reboot the machine. This is serial console log: [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Final Step. Starting Reboot... [ 305.696271] systemd-shutdown[1]: Syncing filesystems and block devices. [ 306.947328] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 306.963920] systemd-journald[1722]: Received SIGTERM from PID 1 (systemd-shutdow). [ 307.322717] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 307.336472] systemd-shutdown[1]: Unmounting file systems. [ 307.454202] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro [ 307.480523] systemd-shutdown[1]: All filesystems unmounted. [ 307.486537] systemd-shutdown[1]: Deactivating swaps. [ 307.491962] systemd-shutdown[1]: All swaps deactivated. [ 307.497624] systemd-shutdown[1]: Detaching loop devices. [ 307.504418] systemd-shutdown[1]: All loop devices detached. [ 307.510418] systemd-shutdown[1]: Detaching DM devices. [ 307.565907] sd 2:0:0:0: [sda] Synchronizing SCSI cache [ 307.731313] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 307.738802] #PF: supervisor read access in kernel mode [ 307.744326] #PF: error_code(0x0000) - not-present page [ 307.749850] PGD 0 P4D 0 [ 307.752568] Oops: 0000 [#1] SMP PTI [ 307.756314] CPU: 3 PID: 1 Comm: systemd-shutdow Not tainted 5.2.0-rc1-kfd-yangp #453 [ 307.764644] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, BIOS 9001 03/07/2016 [ 307.773580] RIP: 0010:soc15_common_hw_fini+0x33/0xc0 [amdgpu] [ 307.779760] Code: 89 fb e8 60 f5 ff ff f6 83 50 df 01 00 04 75 3d 48 8b b3 90 7d 00 00 48 c7 c7 17 b8 530 [ 307.799967] RSP: 0018:ffffac9483153d40 EFLAGS: 00010286 [ 307.805585] RAX: 0000000000000000 RBX: ffff9eb299da0000 RCX: 0000000000000006 [ 307.813261] RDX: 0000000000000000 RSI: ffff9eb29e3508a0 RDI: ffff9eb29e350000 [ 307.820935] RBP: ffff9eb299da0000 R08: 0000000000000000 R09: 0000000000000000 [ 307.828609] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9eb299dbd1f8 [ 307.836284] R13: ffffffffc04f8368 R14: ffff9eb29cebd130 R15: 0000000000000000 [ 307.843959] FS: 00007f06721c9940(0000) GS:ffff9eb2a18c0000(0000) knlGS:0000000000000000 [ 307.852663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 307.858842] CR2: 0000000000000000 CR3: 000000081d798005 CR4: 00000000001606e0 [ 307.866516] Call Trace: [ 307.869169] amdgpu_device_ip_suspend_phase2+0x80/0x110 [amdgpu] [ 307.875654] ? amdgpu_device_ip_suspend_phase1+0x4d/0xd0 [amdgpu] [ 307.882230] amdgpu_device_ip_suspend+0x2e/0x60 [amdgpu] [ 307.887966] amdgpu_pci_shutdown+0x2f/0x40 [amdgpu] [ 307.893211] pci_device_shutdown+0x31/0x60 [ 307.897613] device_shutdown+0x14c/0x1f0 [ 307.901829] kernel_restart+0xe/0x50 [ 307.905669] __do_sys_reboot+0x1df/0x210 [ 307.909884] ? task_work_run+0x73/0xb0 [ 307.913914] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 307.918970] do_syscall_64+0x4a/0x1c0 [ 307.922904] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 307.928336] RIP: 0033:0x7f0671cf8373 [ 307.932176] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 128 [ 307.952384] RSP: 002b:00007ffdd1723d68 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9 [ 307.960527] RAX: ffffffffffffffda RBX: 0000000001234567 RCX: 00007f0671cf8373 [ 307.968201] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead [ 307.975875] RBP: 00007ffdd1723dd0 R08: 0000000000000000 R09: 0000000000000000 [ 307.983550] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffdd1723dd8 [ 307.991224] R13: 0000000000000000 R14: 0000001b00000004 R15: 00007ffdd17240c8 [ 307.998901] Modules linked in: xt_MASQUERADE nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat nf_cos [ 308.026505] CR2: 0000000000000000 [ 308.039998] RIP: 0010:soc15_common_hw_fini+0x33/0xc0 [amdgpu] [ 308.046180] Code: 89 fb e8 60 f5 ff ff f6 83 50 df 01 00 04 75 3d 48 8b b3 90 7d 00 00 48 c7 c7 17 b8 530 [ 308.066392] RSP: 0018:ffffac9483153d40 EFLAGS: 00010286 [ 308.072013] RAX: 0000000000000000 RBX: ffff9eb299da0000 RCX: 0000000000000006 [ 308.079689] RDX: 0000000000000000 RSI: ffff9eb29e3508a0 RDI: ffff9eb29e350000 [ 308.087366] RBP: ffff9eb299da0000 R08: 0000000000000000 R09: 0000000000000000 [ 308.095042] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9eb299dbd1f8 [ 308.102717] R13: ffffffffc04f8368 R14: ffff9eb29cebd130 R15: 0000000000000000 [ 308.110394] FS: 00007f06721c9940(0000) GS:ffff9eb2a18c0000(0000) knlGS:0000000000000000 [ 308.119099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 308.125280] CR2: 0000000000000000 CR3: 000000081d798005 CR4: 00000000001606e0 [ 308.135304] printk: systemd-shutdow: 3 output lines suppressed due to ratelimiting [ 308.143518] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 308.151798] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xff) [ 308.171775] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]--- Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: fix null pointer deref in firmware header printingXiaojie Yuan
v2: declare as (struct common_firmware_header *) type because struct xxx_firmware_header inherits from it When CE's ucode_id(8) is used to get sdma_hdr, we will be accessing an unallocated amdgpu_firmware_info instance. This issue appears on rhel7.7 with gcc 4.8.5. Newer compilers might have optimized out such 'defined but not referenced' variable. [ 1120.798564] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 1120.806703] IP: [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1120.813693] PGD 80000002603ff067 PUD 271b8d067 PMD 0 [ 1120.818931] Oops: 0000 [#1] SMP [ 1120.822245] Modules linked in: amdgpu(OE+) amdkcl(OE) amd_iommu_v2 amdttm(OE) amd_sched(OE) xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod intel_pmc_core intel_powerclamp coretemp intel_rapl joydev kvm_intel eeepc_wmi asus_wmi kvm sparse_keymap iTCO_wdt irqbypass rfkill crc32_pclmul snd_hda_codec_realtek mxm_wmi ghash_clmulni_intel intel_wmi_thunderbolt iTCO_vendor_support snd_hda_codec_generic snd_hda_codec_hdmi aesni_intel lrw gf128mul glue_helper ablk_helper sg cryptd pcspkr snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd pinctrl_sunrisepoint pinctrl_intel soundcore acpi_pad mei_me wmi mei i2c_i801 pcc_cpufreq ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i915 i2c_algo_bit iosf_mbi drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm ptp libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw pps_core drm_panel_orientation_quirks video i2c_hid [ 1120.954136] CPU: 4 PID: 2426 Comm: modprobe Tainted: G OE ------------ 3.10.0-1062.el7.x86_64 #1 [ 1120.964390] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1302 11/09/2015 [ 1120.973321] task: ffff991ef1e3c1c0 ti: ffff991ee625c000 task.ti: ffff991ee625c000 [ 1120.981020] RIP: 0010:[<ffffffffc0e3c9b3>] [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1120.990483] RSP: 0018:ffff991ee625f950 EFLAGS: 00010202 [ 1120.995935] RAX: 0000000000000002 RBX: ffff991edf6b2d38 RCX: ffff991edf6a0000 [ 1121.003391] RDX: 0000000000000000 RSI: ffff991f01d13898 RDI: ffffffffc110afb3 [ 1121.010706] RBP: ffff991ee625f9b0 R08: 0000000000000000 R09: 0000000000000000 [ 1121.018029] R10: 00000000000004c4 R11: ffff991ee625f64e R12: ffff991edf6b3220 [ 1121.025353] R13: ffff991edf6a0000 R14: 0000000000000008 R15: ffff991edf6b2d30 [ 1121.032666] FS: 00007f97b0c0b740(0000) GS:ffff991f01d00000(0000) knlGS:0000000000000000 [ 1121.041000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1121.046880] CR2: 000000000000000a CR3: 000000025e604000 CR4: 00000000003607e0 [ 1121.054239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1121.061631] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1121.068938] Call Trace: [ 1121.071494] [<ffffffffc0e3dba8>] psp_hw_init+0x218/0x270 [amdgpu] [ 1121.077886] [<ffffffffc0da3188>] amdgpu_device_fw_loading+0xe8/0x160 [amdgpu] [ 1121.085296] [<ffffffffc0e3b34c>] ? vega10_ih_irq_init+0x4bc/0x730 [amdgpu] [ 1121.092534] [<ffffffffc0da5c75>] amdgpu_device_init+0x1495/0x1c90 [amdgpu] [ 1121.099675] [<ffffffffc0da9cab>] amdgpu_driver_load_kms+0x8b/0x2f0 [amdgpu] [ 1121.106888] [<ffffffffc01b25cf>] drm_dev_register+0x12f/0x1d0 [drm] [ 1121.113419] [<ffffffffa4dcdfd8>] ? pci_enable_device_flags+0xe8/0x140 [ 1121.120183] [<ffffffffc0da260a>] amdgpu_pci_probe+0xca/0x170 [amdgpu] [ 1121.126919] [<ffffffffa4dcf97a>] local_pci_probe+0x4a/0xb0 [ 1121.132622] [<ffffffffa4dd10c9>] pci_device_probe+0x109/0x160 [ 1121.138607] [<ffffffffa4eb4205>] driver_probe_device+0xc5/0x3e0 [ 1121.144766] [<ffffffffa4eb4603>] __driver_attach+0x93/0xa0 [ 1121.150507] [<ffffffffa4eb4570>] ? __device_attach+0x50/0x50 [ 1121.156422] [<ffffffffa4eb1da5>] bus_for_each_dev+0x75/0xc0 [ 1121.162213] [<ffffffffa4eb3b7e>] driver_attach+0x1e/0x20 [ 1121.167771] [<ffffffffa4eb3620>] bus_add_driver+0x200/0x2d0 [ 1121.173590] [<ffffffffa4eb4c94>] driver_register+0x64/0xf0 [ 1121.179345] [<ffffffffa4dd0905>] __pci_register_driver+0xa5/0xc0 [ 1121.185593] [<ffffffffc099f000>] ? 0xffffffffc099efff [ 1121.190914] [<ffffffffc099f0a4>] amdgpu_init+0xa4/0xb0 [amdgpu] [ 1121.197101] [<ffffffffa4a0210a>] do_one_initcall+0xba/0x240 [ 1121.202901] [<ffffffffa4b1c90a>] load_module+0x271a/0x2bb0 [ 1121.208598] [<ffffffffa4dad740>] ? ddebug_proc_write+0x100/0x100 [ 1121.214894] [<ffffffffa4b1ce8f>] SyS_init_module+0xef/0x140 [ 1121.220698] [<ffffffffa518bede>] system_call_fastpath+0x25/0x2a [ 1121.226870] Code: b4 01 60 a2 00 00 31 c0 e8 83 60 33 e4 41 8b 47 08 48 8b 4d d0 48 c7 c7 b3 af 10 c1 48 69 c0 68 07 00 00 48 8b 84 01 60 a2 00 00 <48> 8b 70 08 31 c0 48 89 75 c8 e8 56 60 33 e4 48 8b 4d d0 48 c7 [ 1121.247422] RIP [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1121.254432] RSP <ffff991ee625f950> [ 1121.258017] CR2: 000000000000000a [ 1121.261427] ---[ end trace e98b35387ede75bd ]--- Signed-off-by: Xiaojie Yuan <xiaojie.yuan@amd.com> Fixes: c5fb912653dae3f878 ("drm/amdgpu: add firmware header printing for psp fw loading (v2)") Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdkfd: enable renoir while device probesHuang Rui
This patch is to add asic flag to enable device probe during kfd init. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: disable gfxoff while use no H/W scheduling policyHuang Rui
While gfxoff is enabled, the mmVM_XXX registers will be 0xfffffff while the GFX is in "off" state. KFD queue creattion doesn't use ring based method, so it will trigger a VM fault. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Disable retry faults in VMID0Felix Kuehling
There is no point retrying page faults in VMID0. Those faults are always fatal. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-and-Tested-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu: Add a kernel parameter for specifying the asic typeYong Zhao
As more and more new asics start to reuse the old device IDs before launch, there is a need to quickly override the existing asic type corresponding to the reused device ID through a kernel parameter. With this, engineers no longer need to rely on local hack patches, facilitating cooperation across teams. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16drm/amdgpu/irq: check if nbio funcs existAlex Deucher
We need to check if the nbios funcs exist before checking the individual pointers. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
2019-09-13drm/amdgpu: move the call of ras recovery_init and bad page reserve to ↵Tao Zhou
proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: save umc error recordsTao Zhou
save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: Hook EEPROM table to RASTao Zhou
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: change ras bps type to eeprom table record structureTao Zhou
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/madgpu: Fix EEPROM Checksum calculation.Andrey Grodzovsky
Fix typo which messed up the calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: Remove clock gating restore.Andrey Grodzovsky
Restoring clock gating break SMU opeartion afterwards, avoid this until this further invistigated with SMU. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdkfd: Query kfd device info by CHIP id instead of pci device idYong Zhao
This optimizes out the pci device id usage in KFD and makes the code more maintainable. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: Disable page faults while reading user wptrsFelix Kuehling
These wptrs must be pinned and GPU accessible when this is called from hqd_load functions. So they should never fault. This resolves a circular lock dependency issue involving four locks including the DQM lock and mmap_sem. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: clean up load TMR sequenceJohn Clements
Removed redundant goto statement Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: enable TA load support in ArcturusJohn Clements
Add support for loading XGMI/RAS FW Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: change r type to int in gmc_v9_0_late_initTao Zhou
change r type from bool to int, suitable for both bool and int return value Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amd/amdgpu: add sw_fini interface for df_funcsJack Zhang
add sw_fini interface of df_funcs. This interface will remove sysfs file of df_cntr_avail function. The old behavior only create sysfs of df_cntr_avail in sw_init, but never remove it for lack of sw_fini interface. With this,driver will report create sysfs fail when it's loaded for the second time. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu: init UMC & RSMU register base addressHawking Zhang
UMC RAS feature requires access to UMC & RSMU registers Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13drm/amdgpu/nbio: switch to amdgpu_nbio_ras_late_init helper functionHawking Zhang
amdgpu_nbio_ras_late_init is used to init nbio specfic ras debugfs/sysfs node and nbio specific interrupt handler. It can be shared among nbio generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>