tag name | deferred-inactivation-5.14_2021-06-13 (019c4542cbffa67250fcdc6d4d6ba93aa083e412) |
tag date | 2021-06-13 14:48:40 -0700 |
tagged by | Darrick J. Wong <djwong@kernel.org> |
tagged object | commit 285d819ad3... |
xfs: deferred inode inactivation
This patch series implements deferred inode inactivation. Inactivation
is what happens when an open file loses its last incore reference: if
the file has speculative preallocations, they must be freed, and if the
file is unlinked, all forks must be truncated, and the inode marked
freed in the inode chunk and the inode btrees.
Currently, all of this activity is performed in frontend threads when
the last in-memory reference is lost and/or the vfs decides to drop the
inode. Three complaints stem from this behavior: first, that the time
to unlink (in the worst case) depends on both the complexity of the
directory as well as the the number of extents in that file; second,
that deleting a directory tree is inefficient and seeky because we free
the inodes in readdir order, not disk order; and third, the upcoming
online repair feature needs to be able to xfs_irele while scanning a
filesystem in transaction context. It cannot perform inode inactivation
in this context because xfs does not support nested transactions.
The implementation will be familiar to those who have studied how XFS
scans for reclaimable in-core inodes -- we create a couple more inode
state flags to mark an inode as needing inactivation and being in the
middle of inactivation. When inodes need inactivation, we set
NEED_INACTIVE in iflags, set the INACTIVE radix tree tag, and schedule a
deferred work item. The deferred worker runs in an unbounded workqueue,
scanning the inode radix tree for tagged inodes to inactivate, and
performing all the on-disk metadata updates. Once the inode has been
inactivated, it is left in the reclaim state and the background reclaim
worker (or direct reclaim) will get to it eventually.
Doing the inactivations from kernel threads solves the first problem by
constraining the amount of work done by the unlink() call to removing
the directory entry. It solves the third problem by moving inactivation
to a separate process. Because the inactivations are done in order of
inode number, we solve the second problem by performing updates in (we
hope) disk order. This also decreases the amount of time it takes to
let go of an inode cluster if we're deleting entire directory trees.
There are three big warts I can think of in this series: first, because
the actual freeing of nlink==0 inodes is now done in the background,
this means that the system will be busy making metadata updates for some
time after the unlink() call returns. This temporarily reduces
available iops. Second, in order to retain the behavior that deleting
100TB of unshared data should result in a free space gain of 100TB, the
statvfs and quota reporting ioctls wait for inactivation to finish,
which increases the long tail latency of those calls. This behavior is,
unfortunately, key to not introducing regressions in fstests. The third
problem is that the deferrals keep memory usage higher for longer,
reduce opportunities to throttle the frontend when metadata load is
heavy, and the unbounded workqueues can create transaction storms.
v1-v2: NYE patchbombs
v3: rebase against 5.12-rc2 for submission.
v4: combine the can/has eofblocks predicates, clean up incore inode tree
walks, fix inobt deadlock
v5: actually freeze the inode gc threads when we freeze the filesystem,
consolidate the code that deals with inode tagging, and use
foreground inactivation during quotaoff to avoid cycling dquots
v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation,
refactor to use inode walk goals, use atomic bitflags to control the
scheduling of gc workers
v7: simplify the inodegc worker, which simplifies how flushes work, break
up the patch into smaller pieces, flush inactive inodes on syncfs to
simplify freeze/ro-remount handling, separate inode selection filtering
in iget, refactor inode recycling further, change gc delay to 100ms,
decrease the gc delay when space or quota are low, move most of the
destroy_inode logic to mark_reclaimable, get rid of the fallocate flush
scan thing, get rid of polled flush mode
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmDGfTgACgkQ+H93GTRK
tOv8mxAAniXAzVOhw8N6kra/hP3iQRGhfeZ38SouRtNbTJhNebNav4fRF7i1bmIC
uoQXO+cO5dSdNUOeuYwwJZE5PrwdX2Wn3vTtyeDnE/ArcXO34Z8TOY+Wh58IcXRR
BtWfElowuGrfu9IwYKwvTKjeDUytm3G7e6HApszF0mGEi1MRPE6aVYL8Ro6CP1NZ
+UrFoQUYY/k7ByXdHjx/Douz04nh+IJwDFeCSChWYebx4NgTf2Ou+LCJx3khx1jH
c3D5fp2FDWwMLH3+CiYcDRxtd5tAGVq0dTnLNCGTIsO89Bd1MI7F8sVNmu27uHlJ
IwLptC2w/UcjNrLuj99rESPiyEOCRiEJyvTd4+NfriBRm19RDPh/EPNEH9pAgJyw
vJE4xJ5zDnKWwY5XjwDYPahAHXepL1qyszSoIg/0Cjn1+WDbbKo6ZeSGKF6jrQqe
p0D83ZeBMTZkr2/HJzrqNSFwdz8WGXcl/Bcepswgy7htBzwTETNn2EpWzs2bL8Ye
CDcXcIGimcmIrzYp5qWuDA5KkqXaYTurB0vfwNg0qbqINZqfeaTUsCuTWatkSk7Q
9ctt0qt2+TuhvuzRv0uTkkSksk7zh+sE0VP1cD/3IA5wKoJ8J6CR5IdjJPhq4MjB
RL/cRrj4VquDcMNhl7AgzE1wAaTzx8SblpXLne4qNoGld7sd2LI=
=ff7C
-----END PGP SIGNATURE-----