bcachefs.git - Unnamed repository; edit this file 'description' to name the repository.

tag name	deferred-inactivation-5.15_2021-08-06 (8ab89c5dad9e8da57b78abb13c63ee30695df5e2)
tag date	2021-08-06 17:33:52 -0700
tagged by	Darrick J. Wong <djwong@kernel.org>
tagged object	commit e2c0c5a856...

xfs: deferred inode inactivation

This patch series implements deferred inode inactivation. Inactivation is what happens when an open file loses its last incore reference: if the file has speculative preallocations, they must be freed, and if the file is unlinked, all forks must be truncated, and the inode marked freed in the inode chunk and the inode btrees. Currently, all of this activity is performed in frontend threads when the last in-memory reference is lost and/or the vfs decides to drop the inode. Three complaints stem from this behavior: first, that the time to unlink (in the worst case) depends on both the complexity of the directory as well as the the number of extents in that file; second, that deleting a directory tree is inefficient and seeky because we free the inodes in readdir order, not disk order; and third, the upcoming online repair feature needs to be able to xfs_irele while scanning a filesystem in transaction context. It cannot perform inode inactivation in this context because xfs does not support nested transactions. The implementation will be familiar to those who have studied how XFS scans for reclaimable in-core inodes -- we create a couple more inode state flags to mark an inode as needing inactivation and being in the middle of inactivation. When inodes need inactivation, we set NEED_INACTIVE in iflags and add it to a percpu work list. Eventually, a bounded percpu workqueue item will be scheduled to perform all the on-disk metadata updates. Once the inode has been inactivated, it is left in the reclaim state and the background reclaim worker (or direct reclaim) will get to it eventually. Doing the inactivations from kernel threads solves the first problem by constraining the amount of work done by the unlink() call to removing the directory entry. It solves the third problem by moving inactivation to a separate process. Performing the inactivations in batches decreases the amount of time it takes to let go of an inode cluster if we're deleting entire directory trees. There are three big warts I can think of in this series: first, because the actual freeing of nlink==0 inodes is now done in the background, this means that the system will be busy making metadata updates for some time after the unlink() call returns. This temporarily reduces available iops. Second, in order to retain the behavior that deleting 100TB of unshared data should result in a free space gain of 100TB, the statvfs and quota reporting ioctls wait for inactivation to finish, which increases the long tail latency of those calls. This behavior is, unfortunately, key to not introducing regressions in fstests. The third problem is that the deferrals keep memory usage higher for longer. The final patch in the series (clumsily) addresses this by forcing the inodegc workers to run when memory shrinkers get called and by throttling the frontend xfs_inodegc_queue callers to wait for the worker. v1-v2: NYE patchbombs v3: rebase against 5.12-rc2 for submission. v4: combine the can/has eofblocks predicates, clean up incore inode tree walks, fix inobt deadlock v5: actually freeze the inode gc threads when we freeze the filesystem, consolidate the code that deals with inode tagging, and use foreground inactivation during quotaoff to avoid cycling dquots v6: rebase to 5.13-rc4, fix quotaoff not to require foreground inactivation, refactor to use inode walk goals, use atomic bitflags to control the scheduling of gc workers v7: simplify the inodegc worker, which simplifies how flushes work, break up the patch into smaller pieces, flush inactive inodes on syncfs to simplify freeze/ro-remount handling, separate inode selection filtering in iget, refactor inode recycling further, change gc delay to 100ms, decrease the gc delay when space or quota are low, move most of the destroy_inode logic to mark_reclaimable, get rid of the fallocate flush scan thing, get rid of polled flush mode v8: rebase against 5.14-rc2, hook the memory shrinkers so that we requeue inactivation immediately when memory starts to get tight and force callers queueing inodes for inactivation to wait for the inactivation workers to run (i.e. throttling the frontend) to reduce memory storms, add hch's quotaoff removal series as a dependency to shut down arguments about quota walks v9: replace the entire mechanism with percpu lists and workers, clean out a ton of ratty code that nobody liked anyway :P v10: remove an unnecessary assert, and fix some naming problems in tracepoints -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEN1PAACgkQ+H93GTRK tOuBtw//V8BzKQiq+ZXHGSIA0GJMSjE4bpjgzXXf+FhlZBy+mFoc2zUGHc8Txr6y jqg5+aGfhqGiUNDDJOmWdqRMIBurmsMXeh8QYk27cibRf24voGKwsFeMBFycpuMo Tg5pAtWEDZC9gBkVzOX9D9kf4BkqQwrgGa6KC03+3n48rDGlP3rLK9fMOyxE0gQV ThOWWRji41yUzGmH+tksQiktHbU88VZr+5f7q8PgqR3rfUG1v1BPSH0mqbZbEWKA bYV6q13iN8vUvDj+kDtQKOZoW3L2Av26ofIXhAfU4StcunJv7S/tcaSVRlgojMox RttBO5gaH7dGTRXJKaWt1PUOYE9GLT9sm9Ew9SVUlmCn8VHQifIVN+ZnS44Nn8F9 NAHCe4J4R/gkS90CY5072BQ5eb3CzPYcx+x9v5aah8fqtmmKwFBhIR9mypMvNCPy kk6+63XlA/m428+0cjsd0FhG9T9dVFTGxZ5o1KzWvCMzSOq18bOyD3V2K9qpoGfg WJ5v7IKrVVFcoTW+RlZmb+90qUTWAkr6HuVVftXrMRhvx1TyegRn9CxIkXE931TY IP6Sf4qqunS0BfF54FafKu8vad/fhI50OBnnLXqqkRVJza3XVTYkU8XsW6pOdLKe xbOZ4ijpamhvrcrPKaaVBi3z7f1ilEAO7rn5G46OpXa+OVobL0Y= =18lt -----END PGP SIGNATURE-----