Age | Commit message (Collapse) | Author |
|
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement")
add the new code right between xfs_submit_zoned_bio and
xfs_zone_alloc_and_submit which implement the main zoned write path.
Move xfs_submit_zoned_bio down to keep it together again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Ensure the file system hasn't been shut down before waiting for a free
zone to become available, because that won't happen on a shut down
file system. Without this processes can occasionally get stuck in
the allocator wait loop when racing with a file system shutdown.
This sporadically happens when running generic/388 or generic/475.
Fixes: 4e4d52075577 ("xfs: add the zoned space allocator")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Placing data from the same file in the same zone is a great heuristic
for reducing write amplification and we do this already - but only
for sequential writes.
To support placing data in the same way for random writes, reuse the
xfs mru cache to map inodes to open zones on first write. If a mapping
is present, use the open zone for data placement for this file until
the zone is full.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Presently we start garbage collection late - when we start running
out of free zones to backfill max_open_zones. This is a reasonable
default as it minimizes write amplification. The longer we wait,
the more blocks are invalidated and reclaim cost less in terms
of blocks to relocate.
Starting this late however introduces a risk of GC being outcompeted
by user writes. If GC can't keep up, user writes will be forced to
wait for free zones with high tail latencies as a result.
This is not a problem under normal circumstances, but if fragmentation
is bad and user write pressure is high (multiple full-throttle
writers) we will "bottom out" of free zones.
To mitigate this, introduce a zonegc_low_space tunable that lets the
user specify a percentage of how much of the unused space that GC
should keep available for writing. A high value will reclaim more of
the space occupied by unused blocks, creating a larger buffer against
write bursts.
This comes at a cost as write amplification is increased. To
illustrate this using a sample workload, setting zonegc_low_space to
60% avoids high (500ms) max latencies while increasing write
amplification by 15%.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_zoned_wake_all checks SB_ACTIVE to make sure it does the right thing
when a shutdown happens during unmount, but it fails to account for the
log recovery special case that sets SB_ACTIVE temporarily. Add a NULL
check to cover both cases.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: added a commit log and comment]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add a file write life time data placement allocation scheme that aims to
minimize fragmentation and thereby to do two things:
a) separate file data to different zones when possible.
b) colocate file data of similar life times when feasible.
To get best results, average file sizes should align with the zone
capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl.
This improvement in data placement efficiency reduces the number of
blocks requiring relocation by GC, and thus decreases overall write
amplification. The impact on performance varies depending on how full
the file system is.
For RocksDB using leveled compaction, the lifetime hints can improve
throughput for overwrite workloads at 80% file system utilization by
~10%, but for lower file system utilization there won't be as much
benefit in application performance as there is less need for garbage
collection to start with.
Lifetime hints can be disabled using the nolifetime mount option.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
RT groups on a zoned file system need to be completely empty before their
space can be reused. This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.
Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied. To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used. To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.
To automatically defragment files the rmap records are sorted by inode
and logical offset. This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O. Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers. This means waiting for garbage collection with the iolock can
deadlock.
To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection. The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC. This is
done using a list of on-stack reservations to ensure fairness.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|
|
For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.
Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode. As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.
Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available. A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup. Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.
Allocations are only recorded at I/O completion time. The scheme used
for that is very similar to the reflink COW end I/O path.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
|