summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_zone_alloc.c
AgeCommit message (Collapse)Author
2025-06-16xfs: move xfs_submit_zoned_bio a bitChristoph Hellwig
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement") add the new code right between xfs_submit_zoned_bio and xfs_zone_alloc_and_submit which implement the main zoned write path. Move xfs_submit_zoned_bio down to keep it together again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: check for shutdown before going to sleep in xfs_select_zoneChristoph Hellwig
Ensure the file system hasn't been shut down before waiting for a free zone to become available, because that won't happen on a shut down file system. Without this processes can occasionally get stuck in the allocator wait loop when racing with a file system shutdown. This sporadically happens when running generic/388 or generic/475. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: add inode to zone caching for data placementHans Holmberg
Placing data from the same file in the same zone is a great heuristic for reducing write amplification and we do this already - but only for sequential writes. To support placing data in the same way for random writes, reuse the xfs mru cache to map inodes to open zones on first write. If a mapping is present, use the open zone for data placement for this file until the zone is full. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14xfs: add tunable threshold parameter for triggering zone GCHans Holmberg
Presently we start garbage collection late - when we start running out of free zones to backfill max_open_zones. This is a reasonable default as it minimizes write amplification. The longer we wait, the more blocks are invalidated and reclaim cost less in terms of blocks to relocate. Starting this late however introduces a risk of GC being outcompeted by user writes. If GC can't keep up, user writes will be forced to wait for free zones with high tail latencies as a result. This is not a problem under normal circumstances, but if fragmentation is bad and user write pressure is high (multiple full-throttle writers) we will "bottom out" of free zones. To mitigate this, introduce a zonegc_low_space tunable that lets the user specify a percentage of how much of the unused space that GC should keep available for writing. A high value will reclaim more of the space occupied by unused blocks, creating a larger buffer against write bursts. This comes at a cost as write amplification is increased. To illustrate this using a sample workload, setting zonegc_low_space to 60% avoids high (500ms) max latencies while increasing write amplification by 15%. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: don't wake zone space waiters without m_zone_infoDarrick J. Wong
xfs_zoned_wake_all checks SB_ACTIVE to make sure it does the right thing when a shutdown happens during unmount, but it fails to account for the log recovery special case that sets SB_ACTIVE temporarily. Add a NULL check to cover both cases. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: added a commit log and comment] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-03xfs: support write life time based data placementHans Holmberg
Add a file write life time data placement allocation scheme that aims to minimize fragmentation and thereby to do two things: a) separate file data to different zones when possible. b) colocate file data of similar life times when feasible. To get best results, average file sizes should align with the zone capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl. This improvement in data placement efficiency reduces the number of blocks requiring relocation by GC, and thus decreases overall write amplification. The impact on performance varies depending on how full the file system is. For RocksDB using leveled compaction, the lifetime hints can improve throughput for overwrite workloads at 80% file system utilization by ~10%, but for lower file system utilization there won't be as much benefit in application performance as there is less need for garbage collection to start with. Lifetime hints can be disabled using the nolifetime mount option. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement zoned garbage collectionChristoph Hellwig
RT groups on a zoned file system need to be completely empty before their space can be reused. This means that partially empty groups need to be emptied entirely to free up space if no entirely free groups are available. Add a garbage collection thread that moves all data out of the least used zone when not enough free zones are available, and which resets all zones that have been emptied. To find empty zone a simple set of 10 buckets based on the amount of space used in the zone is used. To empty zones, the rmap is walked to find the owners and the data is read and then written to the new place. To automatically defragment files the rmap records are sorted by inode and logical offset. This means defragmentation of parallel writes into a single zone happens automatically when performing garbage collection. Because holding the iolock over the entire GC cycle would inject very noticeable latency for other accesses to the inodes, the iolock is not taken while performing I/O. Instead the I/O completion handler checks that the mapping hasn't changed over the one recorded at the start of the GC cycle and doesn't update the mapping if it change. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add support for zoned space reservationsChristoph Hellwig
For zoned file systems garbage collection (GC) has to take the iolock and mmaplock after moving data to a new place to synchronize with readers. This means waiting for garbage collection with the iolock can deadlock. To avoid this, the worst case required blocks have to be reserved before taking the iolock, which is done using a new RTAVAILABLE counter that tracks blocks that are free to write into and don't require garbage collection. The new helpers try to take these available blocks, and if there aren't enough available it wakes and waits for GC. This is done using a list of on-stack reservations to ensure fairness. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add the zoned space allocatorChristoph Hellwig
For zoned RT devices space is always allocated at the write pointer, that is right after the last written block and only recorded on I/O completion. Because the actual allocation algorithm is very simple and just involves picking a good zone - preferably the one used for the last write to the inode. As the number of zones that can written at the same time is usually limited by the hardware, selecting a zone is done as late as possible from the iomap dio and buffered writeback bio submissions helpers just before submitting the bio. Given that the writers already took a reservation before acquiring the iolock, space will always be readily available if an open zone slot is available. A new structure is used to track these open zones, and pointed to by the xfs_rtgroup. Because zoned file systems don't have a rsum cache the space for that pointer can be reused. Allocations are only recorded at I/O completion time. The scheme used for that is very similar to the reflink COW end I/O path. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>