summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-27Merge tag 'ext4-for_linus-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Ext4 bug fixes and cleanups, including: - hardening against maliciously fuzzed file systems - backwards compatibility for the brief period when we attempted to ignore zero-width characters - avoid potentially BUG'ing if there is a file system corruption found during the file system unmount - fix free space reporting by statfs when project quotas are enabled and the free space is less than the remaining project quota Also improve performance when replaying a journal with a very large number of revoke records (applicable for Lustre volumes)" * tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits) ext4: fix OOB read when checking dotdot dir ext4: on a remount, only log the ro or r/w state when it has changed ext4: correct the error handle in ext4_fallocate() ext4: Make sb update interval tunable ext4: avoid journaling sb update on error if journal is destroying ext4: define ext4_journal_destroy wrapper ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...) jbd2: add a missing data flush during file and fs synchronization ext4: don't over-report free space or inodes in statvfs ext4: clear DISCARD flag if device does not support discard jbd2: remove jbd2_journal_unfile_buffer() ext4: reorder capability check last ext4: update the comment about mb_optimize_scan jbd2: fix off-by-one while erasing journal ext4: remove references to bh->b_page ext4: goto right label 'out_mmap_sem' in ext4_setattr() ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all() ext4: introduce ITAIL helper jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature ext4: remove redundant function ext4_has_metadata_csum ...
2025-03-24Merge tag 'vfs-6.15-rc1.ceph' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs ceph updates from Christian Brauner: "This contains the work to remove access to page->index from ceph and fixes the test failure observed for ceph with generic/421 by refactoring ceph_writepages_start()" * tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio ceph: Fix error handling in fill_readdir_cache() fs: Remove page_mkwrite_check_truncate() ceph: Pass a folio to ceph_allocate_page_array() ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array() ceph: Remove uses of page from ceph_process_folio_batch() ceph: Convert ceph_check_page_before_write() to use a folio ceph: Convert writepage_nounlock() to write_folio_nounlock() ceph: Convert ceph_readdir_cache_control to store a folio ceph: Convert ceph_find_incompatible() to take a folio ceph: Use a folio in ceph_page_mkwrite() ceph: Remove ceph_writepage() ceph: fix generic/421 test failure ceph: introduce ceph_submit_write() method ceph: introduce ceph_process_folio_batch() method ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-24Merge tag 'vfs-6.15-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Allow the filesystem to submit the writeback bios. - Allow the filsystem to track completions on a per-bio bases instead of the entire I/O. - Change writeback_ops so that ->submit_bio can be done by the filesystem. - A new ANON_WRITE flag for writes that don't have a block number assigned to them at the iomap level leaving the filesystem to do that work in the submission handler. - Incremental iterator advance The folio_batch support for zero range where the filesystem provides a batch of folios to process that might not be logically continguous requires more flexibility than the current offset based iteration currently offers. Update all iomap operations to advance the iterator within the operation and thus remove the need to advance from the core iomap iterator. - Make buffered writes work with RWF_DONTCACHE If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. On writeback completion the pages will be dropped. - Introduce infrastructure for large atomic writes This will eventually be used by xfs and ext4. * tag 'vfs-6.15-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) iomap: rework IOMAP atomic flags iomap: comment on atomic write checks in iomap_dio_bio_iter() iomap: inline iomap_dio_bio_opflags() iomap: fix inline data on buffered read iomap: Lift blocksize restriction on atomic writes iomap: Support SW-based atomic writes iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW xfs: flag as supporting FOP_DONTCACHE iomap: make buffered writes work with RWF_DONTCACHE iomap: introduce a full map advance helper iomap: rename iomap_iter processed field to status iomap: remove unnecessary advance from iomap_iter() dax: advance the iomap_iter on pte and pmd faults dax: advance the iomap_iter on dedupe range dax: advance the iomap_iter on unshare range dax: advance the iomap_iter on zero range dax: push advance down into dax_iomap_iter() for read and write dax: advance the iomap_iter in the read/write path iomap: convert misc simple ops to incremental advance iomap: advance the iter on direct I/O ...
2025-03-21ext4: fix OOB read when checking dotdot dirAcs, Jakub
Mounting a corrupted filesystem with directory which contains '.' dir entry with rec_len == block size results in out-of-bounds read (later on, when the corrupted directory is removed). ext4_empty_dir() assumes every ext4 directory contains at least '.' and '..' as directory entries in the first data block. It first loads the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry() and then uses its rec_len member to compute the location of '..' dir entry (in ext4_next_entry). It assumes the '..' dir entry fits into the same data block. If the rec_len of '.' is precisely one block (4KB), it slips through the sanity checks (it is considered the last directory entry in the data block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the memory slot allocated to the data block. The following call to ext4_check_dir_entry() on new value of de then dereferences this pointer which results in out-of-bounds mem access. Fix this by extending __ext4_check_dir_entry() to check for '.' dir entries that reach the end of data block. Make sure to ignore the phony dir entries for checksum (by checking name_len for non-zero). Note: This is reported by KASAN as use-after-free in case another structure was recently freed from the slot past the bound, but it is really an OOB read. This issue was found by syzkaller tool. Call Trace: [ 38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710 [ 38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375 [ 38.595158] [ 38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1 [ 38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 38.595304] Call Trace: [ 38.595308] <TASK> [ 38.595311] dump_stack_lvl+0xa7/0xd0 [ 38.595325] print_address_description.constprop.0+0x2c/0x3f0 [ 38.595339] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595349] print_report+0xaa/0x250 [ 38.595359] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595368] ? kasan_addr_to_slab+0x9/0x90 [ 38.595378] kasan_report+0xab/0xe0 [ 38.595389] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595400] __ext4_check_dir_entry+0x67e/0x710 [ 38.595410] ext4_empty_dir+0x465/0x990 [ 38.595421] ? __pfx_ext4_empty_dir+0x10/0x10 [ 38.595432] ext4_rmdir.part.0+0x29a/0xd10 [ 38.595441] ? __dquot_initialize+0x2a7/0xbf0 [ 38.595455] ? __pfx_ext4_rmdir.part.0+0x10/0x10 [ 38.595464] ? __pfx___dquot_initialize+0x10/0x10 [ 38.595478] ? down_write+0xdb/0x140 [ 38.595487] ? __pfx_down_write+0x10/0x10 [ 38.595497] ext4_rmdir+0xee/0x140 [ 38.595506] vfs_rmdir+0x209/0x670 [ 38.595517] ? lookup_one_qstr_excl+0x3b/0x190 [ 38.595529] do_rmdir+0x363/0x3c0 [ 38.595537] ? __pfx_do_rmdir+0x10/0x10 [ 38.595544] ? strncpy_from_user+0x1ff/0x2e0 [ 38.595561] __x64_sys_unlinkat+0xf0/0x130 [ 38.595570] do_syscall_64+0x5b/0x180 [ 38.595583] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Jakub Acs <acsjakub@amazon.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Mahmoud Adam <mngyadam@amazon.com> Cc: stable@vger.kernel.org Cc: security@kernel.org Link: https://patch.msgid.link/b3ae36a6794c4a01944c7d70b403db5b@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: on a remount, only log the ro or r/w state when it has changedNicolas Bretz
A user complained that a message such as: EXT4-fs (nvme0n1p3): re-mounted UUID ro. Quota mode: none. implied that the file system was previously mounted read/write and was now remounted read-only, when it could have been some other mount state that had changed by the "mount -o remount" operation. Fix this by only logging "ro"or "r/w" when it has changed. https://bugzilla.kernel.org/show_bug.cgi?id=219132 Signed-off-by: Nicolas Bretz <bretznic@gmail.com> Link: https://patch.msgid.link/20250319171011.8372-1-bretznic@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: correct the error handle in ext4_fallocate()Zhang Yi
The error out label of file_modified() should be out_inode_lock in ext4_fallocate(). Fixes: 2890e5e0f49e ("ext4: move out common parts into ext4_fallocate()") Reported-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250319023557.2785018-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: Make sb update interval tunableOjaswin Mujoo
Currently, outside error paths, we auto commit the super block after 1 hour has passed and 16MB worth of updates have been written since last commit. This is a policy decision so make this tunable while keeping the defaults same. This is useful if user wants to tweak the superblock behavior or for debugging the codepath by allowing to trigger it more frequently. We can now tweak the super block update using sb_update_sec and sb_update_kb files in /sys/fs/ext4/<dev>/ Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/950fb8c9b2905620e16f02a3b9eeea5a5b6cb87e.1742279837.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: avoid journaling sb update on error if journal is destroyingOjaswin Mujoo
Presently we always BUG_ON if trying to start a transaction on a journal marked with JBD2_UNMOUNT, since this should never happen. However, while ltp running stress tests, it was observed that in case of some error handling paths, it is possible for update_super_work to start a transaction after the journal is destroyed eg: (umount) ext4_kill_sb kill_block_super generic_shutdown_super sync_filesystem /* commits all txns */ evict_inodes /* might start a new txn */ ext4_put_super flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ jbd2_journal_destroy journal_kill_thread journal->j_flags |= JBD2_UNMOUNT; jbd2_journal_commit_transaction jbd2_journal_get_descriptor_buffer jbd2_journal_bmap ext4_journal_bmap ext4_map_blocks ... ext4_inode_error ext4_handle_error schedule_work(&sbi->s_sb_upd_work) /* work queue kicks in */ update_super_work jbd2_journal_start start_this_handle BUG_ON(journal->j_flags & JBD2_UNMOUNT) Hence, introduce a new mount flag to indicate journal is destroying and only do a journaled (and deferred) update of sb if this flag is not set. Otherwise, just fallback to an un-journaled commit. Further, in the journal destroy path, we have the following sequence: 1. Set mount flag indicating journal is destroying 2. force a commit and wait for it 3. flush pending sb updates This sequence is important as it ensures that, after this point, there is no sb update that might be journaled so it is safe to update the sb outside the journal. (To avoid race discussed in 2d01ddc86606) Also, we don't need a similar check in ext4_grp_locked_error since it is only called from mballoc and AFAICT it would be always valid to schedule work here. Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9613c465d6ff00cd315602f99283d5f24018c3f7.1742279837.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: define ext4_journal_destroy wrapperOjaswin Mujoo
Define an ext4 wrapper over jbd2_journal_destroy to make sure we have consistent behavior during journal destruction. This will also come useful in the next patch where we add some ext4 specific logic in the destroy path. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/c3ba78c5c419757e6d5f2d8ebb4a8ce9d21da86a.1742279837.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)Ethan Carter Edwards
sizeof(char) evaluates to 1. Remove the churn. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20250316-ext4-hash-kcalloc-v2-1-2a99e93ec6e0@ethancedwards.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-20ext4: don't over-report free space or inodes in statvfsTheodore Ts'o
This fixes an analogus bug that was fixed in xfs in commit 4b8d867ca6e2 ("xfs: don't over-report free space or inodes in statvfs") where statfs can report misleading / incorrect information where project quota is enabled, and the free space is less than the remaining quota. This commit will resolve a test failure in generic/762 which tests for this bug. Cc: stable@kernel.org Fixes: 689c958cbe6b ("ext4: add project quota support") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-20iomap: rework IOMAP atomic flagsJohn Garry
Flag IOMAP_ATOMIC_SW is not really required. The idea of having this flag is that the FS ->iomap_begin callback could check if this flag is set to decide whether to do a SW (FS-based) atomic write. But the FS can set which ->iomap_begin callback it wants when deciding to do a FS-based atomic write. Furthermore, it was thought that IOMAP_ATOMIC_HW is not a proper name, as the block driver can use SW-methods to emulate an atomic write. So change back to IOMAP_ATOMIC. The ->iomap_begin callback needs though to indicate to iomap core that REQ_ATOMIC needs to be set, so add IOMAP_F_ATOMIC_BIO for that. These changes were suggested by Christoph Hellwig and Dave Chinner. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-4-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-17fs/dax: ensure all pages are idle prior to filesystem unmountAlistair Popple
File systems call dax_break_mapping() prior to reallocating file system blocks to ensure the page is not undergoing any DMA or other accesses. Generally this is needed when a file is truncated to ensure that if a block is reallocated nothing is writing to it. However filesystems currently don't call this when an FS DAX inode is evicted. This can cause problems when the file system is unmounted as a page can continue to be under going DMA or other remote access after unmount. This means if the file system is remounted any truncate or other operation which requires the underlying file system block to be freed will not wait for the remote access to complete. Therefore a busy block may be reallocated to a new file leading to corruption. Link: https://lkml.kernel.org/r/2d3cf575bbd095084993154be2f0aa7442e5cd28.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Wiliams <dan.j.williams@intel.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17fs/dax: create a common implementation to break DAX layoutsAlistair Popple
Prior to freeing a block file systems supporting FS DAX must check that the associated pages are both unmapped from user-space and not undergoing DMA or other access from eg. get_user_pages(). This is achieved by unmapping the file range and scanning the FS DAX page-cache to see if any pages within the mapping have an elevated refcount. This is done using two functions - dax_layout_busy_page_range() which returns a page to wait for the refcount to become idle on. Rather than open-code this introduce a common implementation to both unmap and wait for the page to become idle. Link: https://lkml.kernel.org/r/c4d381e41fc618296cee2820403c166d80599d5c.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17fs/dax: refactor wait for dax idle pageAlistair Popple
A FS DAX page is considered idle when its refcount drops to one. This is currently open-coded in all file systems supporting FS DAX. Move the idle detection to a common function to make future changes easier. Link: https://lkml.kernel.org/r/c2c9d269110b90224eeb1dc661ffbc1d82aa20c9.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-18ext4: clear DISCARD flag if device does not support discardDiangang Li
commit 79add3a3f795e ("ext4: notify when discard is not supported") noted that keeping the DISCARD flag is for possibility that the underlying device might change in future even without file system remount. However, this scenario has rarely occurred in practice on the device side. Even if it does occur, it can be resolved with remount. Clearing the DISCARD flag not only prevents confusion caused by mount options but also avoids sending unnecessary discard commands. Signed-off-by: Diangang Li <lidiangang@bytedance.com> Link: https://patch.msgid.link/20250311021310.669524-1-lidiangang@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: reorder capability check lastChristian Göttsche
capable() calls refer to enabled LSMs whether to permit or deny the request. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to permit the task the requested capability, while it does not need it, violating the principle of least privilege. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250302160657.127253-2-cgoettsche@seltendoof.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: update the comment about mb_optimize_scanZizhi Wo
Commit 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") introduces the sysfs control interface "mb_max_linear_groups" to address the problem that rotational devices performance degrades when the "mb_optimize_scan" feature is enabled, which may result in distant block group allocation. However, the name of the interface was incorrect in the comment to the ext4/mballoc.c file, and this patch fixes it, without further changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250224012005.689549-1-wozizhi@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: remove references to bh->b_pageMatthew Wilcox (Oracle)
Buffer heads are attached to folios, not to pages. Also flush_dcache_page() is now deprecated in favour of flush_dcache_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250213182303.2133205-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: goto right label 'out_mmap_sem' in ext4_setattr()Baokun Li
Otherwise, if ext4_inode_attach_jinode() fails, a hung task will happen because filemap_invalidate_unlock() isn't called to unlock mapping->invalidate_lock. Like this: EXT4-fs error (device sda) in ext4_setattr:5557: Out of memory INFO: task fsstress:374 blocked for more than 122 seconds. Not tainted 6.14.0-rc1-next-20250206-xfstests-dirty #726 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:fsstress state:D stack:0 pid:374 tgid:374 ppid:373 task_flags:0x440140 flags:0x00000000 Call Trace: <TASK> __schedule+0x2c9/0x7f0 schedule+0x27/0xa0 schedule_preempt_disabled+0x15/0x30 rwsem_down_read_slowpath+0x278/0x4c0 down_read+0x59/0xb0 page_cache_ra_unbounded+0x65/0x1b0 filemap_get_pages+0x124/0x3e0 filemap_read+0x114/0x3d0 vfs_read+0x297/0x360 ksys_read+0x6c/0xe0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: c7fc0366c656 ("ext4: partial zero eof block on unaligned inode size extension") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20250213112247.3168709-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()Ye Bin
There's issue as follows: BUG: KASAN: use-after-free in ext4_xattr_inode_dec_ref_all+0x6ff/0x790 Read of size 4 at addr ffff88807b003000 by task syz-executor.0/15172 CPU: 3 PID: 15172 Comm: syz-executor.0 Call Trace: __dump_stack lib/dump_stack.c:82 [inline] dump_stack+0xbe/0xfd lib/dump_stack.c:123 print_address_description.constprop.0+0x1e/0x280 mm/kasan/report.c:400 __kasan_report.cold+0x6c/0x84 mm/kasan/report.c:560 kasan_report+0x3a/0x50 mm/kasan/report.c:585 ext4_xattr_inode_dec_ref_all+0x6ff/0x790 fs/ext4/xattr.c:1137 ext4_xattr_delete_inode+0x4c7/0xda0 fs/ext4/xattr.c:2896 ext4_evict_inode+0xb3b/0x1670 fs/ext4/inode.c:323 evict+0x39f/0x880 fs/inode.c:622 iput_final fs/inode.c:1746 [inline] iput fs/inode.c:1772 [inline] iput+0x525/0x6c0 fs/inode.c:1758 ext4_orphan_cleanup fs/ext4/super.c:3298 [inline] ext4_fill_super+0x8c57/0xba40 fs/ext4/super.c:5300 mount_bdev+0x355/0x410 fs/super.c:1446 legacy_get_tree+0xfe/0x220 fs/fs_context.c:611 vfs_get_tree+0x8d/0x2f0 fs/super.c:1576 do_new_mount fs/namespace.c:2983 [inline] path_mount+0x119a/0x1ad0 fs/namespace.c:3316 do_mount+0xfc/0x110 fs/namespace.c:3329 __do_sys_mount fs/namespace.c:3540 [inline] __se_sys_mount+0x219/0x2e0 fs/namespace.c:3514 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Memory state around the buggy address: ffff88807b002f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88807b002f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88807b003000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88807b003080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88807b003100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Above issue happens as ext4_xattr_delete_inode() isn't check xattr is valid if xattr is in inode. To solve above issue call xattr_check_inode() check if xattr if valid in inode. In fact, we can directly verify in ext4_iget_extra_inode(), so that there is no divergent verification. Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250208063141.1539283-3-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-18ext4: introduce ITAIL helperYe Bin
Introduce ITAIL helper to get the bound of xattr in inode. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250208063141.1539283-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-17ext4: remove redundant function ext4_has_metadata_csumEric Biggers
Since commit f2b4fa19647e ("ext4: switch to using the crc32c library"), ext4_has_metadata_csum() is just an alias for ext4_has_feature_metadata_csum(). ext4_has_feature_metadata_csum() is generated by EXT4_FEATURE_RO_COMPAT_FUNCS and uses the regular naming convention for checking a single ext4 feature. Therefore, remove ext4_has_metadata_csum() and update all its callers to use ext4_has_feature_metadata_csum() directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250207031335.42637-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-17ext4: verify fast symlink lengthJan Kara
Verify fast symlink length stored in inode->i_size matches the string stored in the inode to avoid surprises from corrupted filesystems. Reported-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com Tested-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com Fixes: bae80473f7b0 ("ext4: use inode_set_cached_link()") Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250206094454.20522-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-16fs: convert block_commit_write() to take a folioMatthew Wilcox (Oracle)
All callers now have a folio, so pass it in instead of converting folio->page->folio. Link: https://lkml.kernel.org/r/20250217192009.437916-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ext4: ignore xattrs past endBhupesh
Once inside 'ext4_xattr_inode_dec_ref_all' we should ignore xattrs entries past the 'end' entry. This fixes the following KASAN reported issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 Read of size 4 at addr ffff888012c120c4 by task repro/2065 CPU: 1 UID: 0 PID: 2065 Comm: repro Not tainted 6.13.0-rc2+ #11 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1fd/0x300 ? tcp_gro_dev_warn+0x260/0x260 ? _printk+0xc0/0x100 ? read_lock_is_recursive+0x10/0x10 ? irq_work_queue+0x72/0xf0 ? __virt_addr_valid+0x17b/0x4b0 print_address_description+0x78/0x390 print_report+0x107/0x1f0 ? __virt_addr_valid+0x17b/0x4b0 ? __virt_addr_valid+0x3ff/0x4b0 ? __phys_addr+0xb5/0x160 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 kasan_report+0xcc/0x100 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ? ext4_xattr_delete_inode+0xd30/0xd30 ? __ext4_journal_ensure_credits+0x5f0/0x5f0 ? __ext4_journal_ensure_credits+0x2b/0x5f0 ? inode_update_timestamps+0x410/0x410 ext4_xattr_delete_inode+0xb64/0xd30 ? ext4_truncate+0xb70/0xdc0 ? ext4_expand_extra_isize_ea+0x1d20/0x1d20 ? __ext4_mark_inode_dirty+0x670/0x670 ? ext4_journal_check_start+0x16f/0x240 ? ext4_inode_is_fast_symlink+0x2f2/0x3a0 ext4_evict_inode+0xc8c/0xff0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 ? do_raw_spin_unlock+0x53/0x8a0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 evict+0x4ac/0x950 ? proc_nr_inodes+0x310/0x310 ? trace_ext4_drop_inode+0xa2/0x220 ? _raw_spin_unlock+0x1a/0x30 ? iput+0x4cb/0x7e0 do_unlinkat+0x495/0x7c0 ? try_break_deleg+0x120/0x120 ? 0xffffffff81000000 ? __check_object_size+0x15a/0x210 ? strncpy_from_user+0x13e/0x250 ? getname_flags+0x1dc/0x530 __x64_sys_unlinkat+0xc8/0xf0 do_syscall_64+0x65/0x110 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x434ffd Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffc50fa7b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 RAX: ffffffffffffffda RBX: 00007ffc50fa7e18 RCX: 0000000000434ffd RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007ffc50fa7be0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffc50fa7e08 R14: 00000000004bbf30 R15: 0000000000000001 </TASK> The buggy address belongs to the object at ffff888012c12000 which belongs to the cache filp of size 360 The buggy address is located 196 bytes inside of freed 360-byte region [ffff888012c12000, ffff888012c12168) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12c12 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x40(head|node=0|zone=0) page_type: f5(slab) raw: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 raw: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 head: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000001 ffffea00004b0481 ffffffffffffffff 0000000000000000 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888012c11f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888012c12000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888012c12080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888012c12100: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ffff888012c12180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Reported-by: syzbot+b244bda78289b00204ed@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b244bda78289b00204ed Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Bhupesh <bhupesh@igalia.com> Link: https://patch.msgid.link/20250128082751.124948-2-bhupesh@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-16ext4: remove unused input "inode" in ext4_find_dest_deKemeng Shi
Remove unused input "inode" in ext4_find_dest_de. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250123162050.2114499-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-16ext4: remove unneeded forward declaration in namei.cKemeng Shi
Remove unneeded forward declaration in namei.c Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250123162050.2114499-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-16ext4: add missing brelse() for bh2 in ext4_dx_add_entry()Kemeng Shi
Add missing brelse() for bh2 in ext4_dx_add_entry(). Fixes: ac27a0ec112a ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250123162050.2114499-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13Revert "ext4: add pre-content fsnotify hook for DAX faults"Amir Goldstein
This reverts commit bb480760ffc7018e21ee6f60241c2b99ff26ee0e. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-3-amir73il@gmail.com
2025-03-13ext4: show 'shutdown' hint when ext4 is forced to shutdownBaokun Li
Now, if dmesg is cleared, we have no way of knowing if the file system has been shutdown. Moreover, ext4 allows directory reads even after the file system has been shutdown, so when reading a file returns -EIO, we cannot determine whether this is a hardware issue or if the file system has been shutdown. Therefore, when ext4 file system is shutdown, we're adding a 'shutdown' hint to commands like mount so users can easily check the file system's status. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-8-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: show 'emergency_ro' when EXT4_FLAGS_EMERGENCY_RO is setBaokun Li
After commit d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") in v6.12-rc1, the 'errors=remount-ro' mode no longer sets SB_RDONLY on errors, which results in us seeing the filesystem is still in rw state after errors. Therefore, after setting EXT4_FLAGS_EMERGENCY_RO, display the emergency_ro option so that users can query whether the current file system has become emergency read-only due to errors through commands such as 'mount' or 'cat /proc/fs/ext4/sdx/options'. Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-7-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: correct behavior under errors=remount-ro modeBaokun Li
And after commit 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") in v6.6-rc1, the EXT4_FLAGS_SHUTDOWN bit is set in ext4_handle_error() under errors=remount-ro mode. This causes the read to fail even when the error is triggered in errors=remount-ro mode. To correct the behavior under errors=remount-ro, EXT4_FLAGS_SHUTDOWN is replaced by the newly introduced EXT4_FLAGS_EMERGENCY_RO. This new flag only prevents writes, matching the previous behavior with SB_RDONLY. Fixes: 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") Closes: https://lore.kernel.org/all/22d652f6-cb3c-43f5-b2fe-0a4bb6516a04@huawei.com/ Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122114130.229709-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: add more ext4_emergency_state() checks around sb_rdonly()Baokun Li
Some functions check sb_rdonly() to make sure the file system isn't modified after it's read-only. Since we also don't want the file system modified if it's in an emergency state (shutdown or emergency_ro), we're adding additional ext4_emergency_state() checks where sb_rdonly() is checked. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122114130.229709-5-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: add ext4_emergency_state() helper functionBaokun Li
Since both SHUTDOWN and EMERGENCY_RO are emergency states of the ext4 file system, and they are checked in similar locations, we have added a helper function, ext4_emergency_state(), to determine whether the current file system is in one of these two emergency states. Then, replace calls to ext4_forced_shutdown() with ext4_emergency_state() in those functions that could potentially trigger write operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: add EXT4_FLAGS_EMERGENCY_RO bitBaokun Li
EXT4_FLAGS_EMERGENCY_RO Indicates that the current file system has become read-only due to some error. Compared to SB_RDONLY, setting it does not require a lock because we won't clear it, which avoids over-coupling with vfs freeze. Also, add a helper function ext4_emergency_ro() to check if the bit is set. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: convert EXT4_FLAGS_* defines to enumBaokun Li
Do away with the defines and use an enum as it's cleaner. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: pack holes in ext4_inode_infoBaokun Li
When CONFIG_DEBUG_SPINLOCK is not enabled (general case), there are four 4 bytes holes and one 2 bytes hole in struct ext4_inode_info. Move the members to pack the four 4 bytes holes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-10-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: remove unused member 'i_unwritten' from 'ext4_inode_info'Baokun Li
After commit 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure"), no one cares about the value of i_unwritten, so there is no need to maintain this variable, remove it, and clean up the associated logic. Suggested-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-9-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13jbd2: drop JBD2_ABORT_ON_SYNCDATA_ERRBaokun Li
Since ext4's data_err=abort mode doesn't depend on JBD2_ABORT_ON_SYNCDATA_ERR anymore, and nobody else uses it, we can drop it and only warn in jbd2 as it used to be long ago. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-7-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: abort journal on data writeback failure if in data_err=abort modeBaokun Li
The data_err=abort was initially introduced to address users' worries about data corruption spreading unnoticed. With direct writes, we can rely on return values to confirm successful writes to disk. But with buffered writes, a successful return only means the data has been written to memory. Users have no way of knowing if the data has actually written it to disk unless they use fsync (which impacts performance and can sometimes miss errors). The current data_err=abort implementation relies on the ordered data list, but past changes have inadvertently altered its behavior. For example, if an extent is unwritten, we do not add the inode to the ordered data list. Therefore, jbd2 will not wait for the data write-back of that inode to complete and check for errors in the inode mapping. Moreover, the checks performed by jbd2 can also miss errors. Now, all buffered writes eventually call ext4_end_bio(), where I/O errors are checked. Therefore, we can check for the data_err=abort mode at this point and abort the journal in a kworker (due to the interrupt context). Therefore, when data_err=abort is enabled, the journal is aborted in ext4_end_io_end() when an I/O error is detected in ext4_end_bio() to make users who are concerned about the contents of the file happy. Suggested-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/c7ab26f3-85ad-4b31-b132-0afb0e07bf79@huawei.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122110533.4116662-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: extract ext4_has_journal_option() from __ext4_fill_super()Baokun Li
Extract the ext4_has_journal_option() helper function to reduce code duplication. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122110533.4116662-5-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: reject the 'data_err=abort' option in nojournal modeBaokun Li
data_err=abort aborts the journal on I/O errors. However, this option is meaningless if journal is disabled, so it is rejected in nojournal mode to reduce unnecessary checks. Also, this option is ignored upon remount. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122110533.4116662-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: do not convert the unwritten extents if data writeback failsBaokun Li
When dioread_nolock is turned on (the default), it will convert unwritten extents to written at ext4_end_io_end(), even if the data writeback fails. It leads to the possibility that stale data may be exposed when the physical block corresponding to the file data is read-only (i.e., writes return -EIO, but reads are normal). Therefore a new ext4_io_end->flags EXT4_IO_END_FAILED is added, which indicates that some bio write-back failed in the current ext4_io_end. When this flag is set, the unwritten to written conversion is no longer performed. Users can read the data normally until the caches are dropped, after that, the failed extents can only be read to all 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: replace opencoded ext4_end_io_end() in ext4_put_io_end()Baokun Li
This reduces duplicate code and ensures that a “potential data loss” warning is available if the unwritten conversion fails. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: fix potential null dereference in ext4 kunit testCharles Han
kunit_kzalloc() may return a NULL pointer, dereferencing it without NULL check may lead to NULL dereference. Add a NULL check for grp. Fixes: ac96b56a2fbd ("ext4: Add unit test for mb_mark_used") Fixes: b7098e1fa7bc ("ext4: Add unit test for mb_free_blocks") Signed-off-by: Charles Han <hanchunchao@inspur.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20250110092421.35619-1-hanchunchao@inspur.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: Refactor out ext4_try_to_write_inline_data()Julian Sun
Refactor ext4_try_to_write_inline_data() to simplify its implementation by directly invoking ext4_generic_write_inline_data(). Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250107045730.1837808-1-sunjunchao2870@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>