summaryrefslogtreecommitdiff
tag nameonline-fsck-design_2022-10-14 (fbdbb9cf600265b5226d493feeb523e0c45f1a29)
tag date2022-10-14 14:18:12 -0700
tagged byDarrick J. Wong <djwong@kernel.org>
tagged objectcommit 398facc0f7...
xfs: design documentation for online fsck
After six years of development and a nearly two year hiatus from patchbombing, I think it is time to resume the process of merging the online fsck feature into XFS. The full patchset comprises 105 separate patchsets that capture 470 patches across the kernel, xfsprogs, and fstests projects. I would like to merge this feature into upstream in time for the 2023 LTS kernel. As of 5.15 (aka last year's LTS), we have merged all generally useful infrastructure improvements into the regular filesystem. The only changes to the core filesystem that remain are the ones that are only useful to online fsck itself. In other words, the vast majority of the new code in the patchsets comprising the online fsck feature are is mostly self contained and can be turned off via Kconfig. Many of you readers might be wondering -- why have I chosen to make one large submission with 100+ patchsets comprising ~500 patches? Why didn't I merge small pieces of functionality bit by bit and revise common code as necessary? Well, the simple answer is that in the past six years, the fundamental algorithms have been revised repeatedly as I've built out the functionality. In other words, the codebase as it is now has the benefit that I now know every piece that's necessary to get the job done in a reasonable manner and within the constraints laid out by community reviews. I believe this has reduced code churn in mainline and freed up my time so that I can iterate faster. As a concession to the mail servers, I'm breaking up the submission into smaller pieces; I'm only pushing the design document and the revisions to the existing scrub code, which is the first 20% of the patches. Also, I'm arbitrarily restarting the version numbering by reversioning all patchsets from version 22 to epoch 23, version 1. The big question to everyone reading this is: How might I convince you that there is more merit in merging the whole feature and dealing with the consequences than continuing to maintain it out of tree? --------- To prepare the XFS community and potential patch reviewers for the upstream submission of the online fsck feature, I decided to write a document capturing the broader picture behind the online repair development effort. The document begins by defining the problems that online fsck aims to solve and outlining specific use cases for the functionality. Using that as a base, the rest of the design document presents the high level algorithms that fulfill the goals set out at the start and the interactions between the large pieces of the system. Case studies round out the design documentation by adding the details of exactly how specific parts of the online fsck code integrate the algorithms with the filesystem. The goal of this effort is to help the XFS community understand how the gigantic online repair patchset works. The questions I submit to the community reviewers are: 1. As you read the design doc (and later the code), do you feel that you understand what's going on well enough to try to fix a bug if you found one? 2. What sorts of interactions between systems (or between scrub and the rest of the kernel) am I missing? 3. Do you feel confident enough in the implementation as it is now that the benefits of merging the feature (as EXPERIMENTAL) outweigh any potential disruptions to XFS at large? 4. Are there problematic interactions between subsystems that ought to be cleared up before merging? I intend to commit this document to the kernel's documentation directory when we start merging the patchset, albeit without the links to git.kernel.org. A much more readable version of this is posted at: https://djwong.org/docs/xfs-online-fsck-design/ v2: add missing sections about: all the in-kernel data structures and new apis that the scrub and repair functions use; how xattrs and directories are checked; how space btree records are checked; and add more details to the parts where all these bits tie together. Proofread for verb tense inconsistencies and eliminate vague 'we' usage. Move all the discussion of what we can do with pageable kernel memory into a single source file and section. Document where log incompat feature locks fit into the locking model. v3: resync with 6.0, fix a few typos, begin discussion of the merging plan for this megapatchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmNJ0hQACgkQ+H93GTRK tOvt0hAAo/aWvGUBYYh2HNj5o5BChlOreoC+PdUqAFSL+ySpOonMLqSdOiF1t7G8 HJOUpINVPAEaG2He9xaZ37PhqtV+NmMFHHqhx5B7TEL/1GXJX8j6drbQ58WNctaz ZejL8zI/E8mtFrsw6WdBP7/Utjc1GDC86tZAhfhEW77CTsrvKq5NboZcLrK+hcSt 0nkAv7ZA2tC0USgoi/AZPR//GE4zpxAW5rA2Si5I3ggNTyqbX5O251CYgQ0QibTr F9FV/85TSowAcZdJhWkxZvh2Lm1MrmTMcIdA/jLKOEcY6pxODm8lP8cdXDino5rt NXSpjdGNYcm4GZiQUJrl2LyyM/hqM90eT0KN1RUlKVb4siiv8KbUaw1mvddBBse5 fo1QWaD2Cdlt/3qRyJXwdNLLf9cRP7XhtWmBIDYQJvPLmibPnYlL2YVe3awAT/9E BNSOh6F9Q4vVGmQT6H1+ZtDfz+15df8Dlz4JmUSru4Z+CTpFmJPZk9W9yj1iw+dR S7K3xEKucFWAgCGuqrMmfk+IxvyeGgG+Ege1ijrOyVrE3h72D2VEbZhk317fbkLq SSTa35Hem18ztt2XvDkyzGhih6dAxjWp6SU0fJGq5RUMTGlBdwACeyHTOhZ07S63 uaBsyG7bfrriRIZ06gW66ycZcU1HHtbqdVrxYiL+MMNrq09GHXk= =8Hxf -----END PGP SIGNATURE-----