summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2019-03-19xfs: always init bma in xfs_bmapi_writeDarrick J. Wong
Always init the tp/ip fields of bma in xfs_bmapi_write so that the bmapi_finish at the bottom never trips over null transaction or inode pointers. Coverity-id: 1443964 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-03-17xfs: don't trip over uninitialized buffer on extent read of corrupted inodeBrian Foster
We've had rather rare reports of bmap btree block corruption where the bmap root block has a level count of zero. The root cause of the corruption is so far unknown. We do have verifier checks to detect this form of on-disk corruption, but this doesn't cover a memory corruption variant of the problem. The latter is a reasonable possibility because the root block is part of the inode fork and can reside in-core for some time before inode extents are read. If this occurs, it leads to a system crash such as the following: BUG: unable to handle kernel paging request at ffffffff00000221 PF error: [normal kernel read fault] ... RIP: 0010:xfs_trans_brelse+0xf/0x200 [xfs] ... Call Trace: xfs_iread_extents+0x379/0x540 [xfs] xfs_file_iomap_begin_delay+0x11a/0xb40 [xfs] ? xfs_attr_get+0xd1/0x120 [xfs] ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_iomap_begin+0x4c4/0x6d0 [xfs] ? __vfs_getxattr+0x53/0x70 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_apply+0x63/0x130 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_file_buffered_write+0x62/0x90 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_buffered_aio_write+0xe4/0x3b0 [xfs] __vfs_write+0x150/0x1b0 vfs_write+0xba/0x1c0 ksys_pwrite64+0x64/0xa0 do_syscall_64+0x5a/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The crash occurs because xfs_iread_extents() attempts to release an uninitialized buffer pointer as the level == 0 value prevented the buffer from ever being allocated or read. Change the level > 0 assert to an explicit error check in xfs_iread_extents() to avoid crashing the kernel in the event of localized, in-core inode corruption. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-12xfs: clean up xfs_dir2_leaf_addnamexfs-5.1-merge-5Darrick J. Wong
Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-03-10xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addnameDarrick J. Wong
Smatch complains about the following: fs/xfs/libxfs/xfs_dir2_leaf.c:848 xfs_dir2_leaf_addname() error: uninitialized symbol 'lowstale'. fs/xfs/libxfs/xfs_dir2_leaf.c:849 xfs_dir2_leaf_addname() error: uninitialized symbol 'highstale'. I don't think there's any incorrect behavior associated with the uninitialized variable, but as the author of the previous zero-init patch points out, it's best not to be passing around pointers to uninitialized stack areas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2019-03-08xfs: clean up xfs_dir2_leafn_addDarrick J. Wong
Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
2019-03-08xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_addNathan Chancellor
When building with -Wsometimes-uninitialized, Clang warns: fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'lowstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'highstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] While it isn't technically wrong, it isn't a problem in practice because highstale and lowstale are only initialized in xfs_dir2_leafn_add when compact is not zero then they are passed to xfs_dir3_leaf_find_entry, where they are initialized before use when compact is zero. Regardless, it's better not to be passing around uninitialized stack memory so zero initialize these variables, which silences this warning. Link: https://github.com/ClangBuiltLinux/linux/issues/393 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-25xfs: fix uninitialized error variablesDarrick J. Wong
smatch complained about some uninitialized error returns, so fix those. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2019-02-21xfs: make COW fork unwritten extent conversions more robustChristoph Hellwig
If we have racing buffered and direct I/O COW fork extents under writeback can have been moved to the data fork by the time we call xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly harmless as the block numbers don't change by this move, except for the fact that xfs_bmapi_write will crash or trigger asserts when not finding existing extents, even despite trying to paper over this with the XFS_BMAPI_CONVERT_ONLY flag. Instead of special casing non-transaction conversions in the already way too complicated xfs_bmapi_write just add a new helper for the much simpler non-transactional COW fork case, which simplify ignores not found extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-18xfs: fix xfs_buf magic number endian checksDarrick J. Wong
Create a separate magic16 check function so that we don't run afoul of static checkers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-17xfs: move stat accounting to xfs_bmapi_convert_delallocChristoph Hellwig
This way we can actually count how many bytes got converted and how many calls we need, unlike in the caller which doesn't have the detailed view. Note that this includes a slight change in behavior as the xs_xstrat_quick is now bumped for every allocation instead of just the one covering the requested writeback offset, which makes a lot more sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: move transaction handling to xfs_bmapi_convert_delallocChristoph Hellwig
No need to deal with the transaction and the inode locking in the caller. Note that we also switch to passing whichfork as the second paramter, matching what most related functions do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: split XFS_BMAPI_DELALLOC handling from xfs_bmapi_writeChristoph Hellwig
Delalloc conversion has traditionally been part of our function to allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but delalloc conversion is a little special as we really do not want to allocate blocks over holes, for which we don't have reservations. Split the delalloc conversions into a separate helper to keep the code simple and structured. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: factor out two helpers from xfs_bmapi_writeChristoph Hellwig
We want to be able to reuse them for the upcoming dedidcated delalloc convert routine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: simplify the xfs_bmap_btree_to_extents calling conventionsChristoph Hellwig
Move boilerplate code from the callers into xfs_bmap_btree_to_extents: - exit early without failure if we don't need to convert to the extent format - assert that we have a btree cursor - don't reinitialize the passed in logflags argument Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-14xfs: rename m_inotbt_nores to m_finobt_noresDarrick J. Wong
Rename this flag variable to imply more strongly that it's related to the free inode btree (finobt) operation. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-11xfs: add magic numbers to dquot buffer opsDarrick J. Wong
Add dquot magic numbers to the buffer ops type, in case we ever want to use them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11xfs: add inode magic to inode verifierDarrick J. Wong
Use xfs_verify_magic to check the magic numbers of inodes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11xfs: factor xfs_da3_blkinfo verification into common helperBrian Foster
With the verifier magic value helper in place, we've left a bit more duplicate code across the verifiers that involve struct xfs_da3_blkinfo. This includes the da node, xattr leaf and dir leaf verifiers, all of which perform similar checks for v4 and v5 filesystems. Create a common helper to verify an xfs_da3_blkinfo structure, taking care to only access v5 fields where appropriate, and refactor the aforementioned verifiers to use the helper. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: miscellaneous verifier magic value fixupsBrian Foster
Most buffer verifiers have hardcoded magic value checks conditionalized on the version of the filesystem. The magic value field of the verifier structure facilitates abstraction of some of this code. Populate the ->magic field of various verifiers to take advantage of this abstraction. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: use verifier magic field in dir2 leaf verifiersBrian Foster
The dir2 leaf verifiers share the same underlying structure verification code, but implement six accessor functions to multiplex the code across the two verifiers. Further, the magic value isn't sufficiently abstracted such that the common helper has to manually fix up the magic from the caller on v5 filesystems. Use the magic field in the verifier structure to eliminate the duplicate code and clean this all up. No functional change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: distinguish between bnobt and cntbt magic valuesBrian Foster
The allocation btree verifiers share code that is unable to detect cross-tree magic value corruptions such as a bnobt block with a cntbt magic value. Populate the b_ops->magic field of the associated verifier structures such that the structure verifier can check the magic value against the expected value based on tree type. The btree level check requires knowledge of the tree type to determine the appropriate maximum value. This was previously part of the hardcoded magic value checks. With that code removed, peek at the first magic value in the verifier to determine the expected tree type of the current block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: split up allocation btree verifierBrian Foster
Similar to the inode btree verifier, the same allocation btree verifier structure is shared between the by-bno (bnobt) and by-size (cntbt) btrees. This prevents the ability to distinguish magic values between them. Separate the verifier into two, one for each tree, and assign them appropriately. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: distinguish between inobt and finobt magic valuesBrian Foster
The inode btree verifier code is shared between the inode btree and free inode btree because the underlying metadata formats are essentially equivalent. A side effect of this is that the verifier cannot determine whether a particular btree block should have an inobt or finobt magic value. This logic allows an unfortunate xfs_repair bug to escape detection where certain level > 0 nodes of the finobt are stamped with inobt magic by xfs_repair finobt reconstruction. This is fortunately not a severe problem since the inode btree magic values do not contribute to any changes in kernel behavior, but we do need a means to detect and prevent this problem in the future. Add a field to xfs_buf_ops to store the v4 and v5 superblock magic values expected by a particular verifier. Add a helper to check an on-disk magic value against the value expected by the verifier. Call the helper from the shared [f]inobt verifier code for magic value verification. This ensures that the inode btree blocks each have the appropriate magic value based on specific tree type and superblock version. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: create a separate finobt verifierBrian Foster
The inobt verifier is reused for the inobt and finobt, which prevents the ability to distinguish between magic values on a per-tree basis. Create a separate finobt structure in preparation for changes to enforce the appropriate magic value for the associated tree. This patch has no functional change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: always check magic values in on-disk byte orderBrian Foster
Most verifiers that check on-disk magic values convert the CPU endian magic value constant to disk endian to facilitate compile time optimization of the byte swap and reduce the need for runtime byte swaps in buffer verifiers. Several buffer verifiers do not follow this pattern. Update those verifiers for consistency. Also fix up a random typo in the inode readahead verifier name. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: cache unlinked pointers in an rhashtableDarrick J. Wong
Use a rhashtable to cache the unlinked list incore. This should speed up unlinked processing considerably when there are a lot of inodes on the unlinked list because iunlink_remove no longer has to traverse an entire bucket list to find which inode points to the one being removed. The incore list structure records "X.next_unlinked = Y" relations, with the rhashtable using Y to index the records. This makes finding the inode X that points to a inode Y very quick. If our cache fails to find anything we can always fall back on the old method. FWIW this drastically reduces the amount of time it takes to remove inodes from the unlinked list. I wrote a program to open a lot of O_TMPFILE files and then close them in the same order, which takes a very long time if we have to traverse the unlinked lists. With the ptach, I see: + /d/t/tmpfile/tmpfile Opened 193531 files in 6.33s. Closed 193531 files in 5.86s real 0m12.192s user 0m0.064s sys 0m11.619s + cd / + umount /mnt real 0m0.050s user 0m0.004s sys 0m0.030s And without the patch: + /d/t/tmpfile/tmpfile Opened 193588 files in 6.35s. Closed 193588 files in 751.61s real 12m38.853s user 0m0.084s sys 12m34.470s + cd / + umount /mnt real 0m0.086s user 0m0.000s sys 0m0.060s Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11xfs: add xfs_verify_agino_or_null helperDarrick J. Wong
Add a new helper to check that a per-AG inode pointer is either null or points somewhere valid within that AG. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11xfs: use the latest extent at writeback delalloc conversion timeBrian Foster
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping outside of the current page. This is because the ilock is cycled between the time the caller originally looked up the mapping and across each real allocation of the provided file range. This code has collected various hacks over the years to help combat the symptoms of these races (i.e., truncate race detection, allocation into hole detection, etc.), but none address the fundamental problem that the imap may not be valid at allocation time. Rather than continue to use race detection hacks, update writeback delalloc conversion to a model that explicitly converts the delalloc extent backing the current file offset being processed. The current file offset is the only block we can trust to remain once the ilock is dropped because any operation that can remove the block (truncate, hole punch, etc.) must flush and discard pagecache pages first. Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc() mechanism to request allocation of the entire delalloc extent backing the current offset instead of assuming the extent passed by the caller is unchanged. Record the range specified by the caller and apply it to the resulting allocated extent so previous checks by the caller for COW fork overlap are not lost. Finally, overload the bmapi delalloc flag with the range reval flag behavior since this is the only use case for both. This ensures that writeback always picks up the correct and current extent associated with the page, regardless of races with other extent modifying operations. If operating on a data fork and the COW overlap state has changed since the ilock was cycled, the caller revalidates against the COW fork sequence number before using the imap for the next block. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: create delalloc bmapi wrapper for full extent allocationBrian Foster
The writeback delalloc conversion code is racy with respect to changes in the currently cached file mapping. This stems from the fact that the bmapi allocation code requires a file range to allocate and the writeback conversion code assumes the range of the currently cached mapping is still valid with respect to the fork. It may not be valid, however, because the ilock is cycled (potentially multiple times) between the time the cached mapping was populated and the delalloc conversion occurs. To facilitate a solution to this problem, create a new xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file (FSB) offset and attempts to allocate whatever delalloc extent backs the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set the range based on the extent backing the bno parameter unless bno lands in a hole. If bno does land in a hole, fall back to the current behavior (which may result in an error or quietly skipping holes in the specified range depending on other parameters). This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: remove superfluous writeback mapping eof trimmingBrian Foster
Now that the cached writeback mapping is explicitly invalidated on data fork changes, the EOF trimming band-aid is no longer necessary. Remove xfs_trim_extent_eof() as well since it has no other users. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: update fork seq counter on data fork changesBrian Foster
The sequence counter in the xfs_ifork structure is only updated on COW forks. This is because the counter is currently only used to optimize out repetitive COW fork checks at writeback time. Tweak the extent code to update the seq counter regardless of the fork type in preparation for using this counter on data forks as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11xfs: check attribute name validityDarrick J. Wong
Check extended attribute entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11xfs: check directory name validityDarrick J. Wong
Check directory entry names for invalid characters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11xfs: scrub should flag dir/attr offsets that aren't mappable with xfs_dablk_tDarrick J. Wong
Teach scrub to flag extent maps that exceed the range that can be mapped with a xfs_dablk_t. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-19xfs: stringify scrub types in ftrace outputxfs-4.21-merge-1Darrick J. Wong
Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: stringify btree cursor types in ftrace outputDarrick J. Wong
Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_INODE_FORMAT_STR mappings to libxfsDarrick J. Wong
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it updated, and add necessary TRACE_DEFINE_ENUM. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfsDarrick J. Wong
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to keep it updated, and TRACE_DEFINE_ENUM the values while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19xfs: fix symbolic enum printing in ftrace outputDarrick J. Wong
ftrace's __print_symbolic() has a (very poorly documented) requirement that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in the ftrace ring buffer. Fix this unsatisfied requirement. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-12xfs: cache minimum realtime summary levelOmar Sandoval
The realtime summary is a two-dimensional array on disk, effectively: u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] rsum[log][bbno] is the number of extents of size 2**log which start in bitmap block bbno. xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether rsum[log][bbno] != 0 for any log level. However, the summary array is stored in row-major order (i.e., like an array in C), so all of these entries are not adjacent, but rather spread across the entire summary file. In the worst case (a full bitmap block), xfs_rtany_summary() has to check every level. This means that on a moderately-used realtime device, an allocation will waste a lot of time finding, reading, and releasing buffers for the realtime summary. In particular, one of our storage services (which runs on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) spends almost 5% of its CPU cycles in xfs_rtbuf_get() and xfs_trans_brelse() called from xfs_rtany_summary(). One solution would be to also store the summary with the dimensions swapped. However, this would require a disk format change to a very old component of XFS. Instead, we can cache the minimum size which contains any extents. We do so lazily; rather than guaranteeing that the cache contains the precise minimum, it always contains a loose lower bound which we tighten when we read or update a summary block. This only uses a few kilobytes of memory and is already serialized via the realtime bitmap and summary inode locks, so the cost is minimal. With this change, the same workload only spends 0.2% of its CPU cycles in the realtime allocator. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: precalculate cluster alignment in inodes and blocksDarrick J. Wong
Store the inode cluster alignment information in units of inodes and blocks in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: precalculate inodes and blocks per inode clusterDarrick J. Wong
Store the number of inodes and blocks per inode cluster in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: add a block to inode count converterDarrick J. Wong
Add new helpers to convert units of fs blocks into inodes, and AG blocks into AG inodes, respectively. Convert all the open-coded conversions and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The OFFBNO_TO_AGINO macro is retained for xfs_repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: remove xfs_rmap_ag_owner and friendsDarrick J. Wong
Owner information for static fs metadata can be defined readonly at build time because it never changes across filesystems. This enables us to reduce stack usage (particularly in scrub) because we can use the statically defined oinfo structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: const-ify xfs_owner_info argumentsDarrick J. Wong
Only certain functions actually change the contents of an xfs_owner_info; the rest can accept a const struct pointer. This will enable us to save stack space by hoisting static owner info types to be const global variables. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: streamline defer op type handlingDarrick J. Wong
There's no need to bundle a pointer to the defer op type into the defer op control structure. Instead, store the defer op type enum, which enables us to shorten some of the lines. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: idiotproof defer op type configurationDarrick J. Wong
Recently, we forgot to port a new defer op type to xfsprogs, which caused us some userspace pain. Reorganize the way we make libxfs clients supply defer op type information so that all type information has to be provided at build time instead of risky runtime dynamic configuration. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: zero length symlinks are not validDave Chinner
A log recovery failure has been reproduced where a symlink inode has a zero length in extent form. It was caused by a shutdown during a combined fstress+fsmark workload. The underlying problem is the issue in xfs_inactive_symlink(): the inode is unlocked between the symlink inactivation/truncation and the inode being freed. This opens a window for the inode to be written to disk before it xfs_ifree() removes it from the unlinked list, marks it free in the inobt and zeros the mode. For shortform inodes, the fix is simple. xfs_ifree() clears the data fork state, so there's no need to do it in xfs_inactive_symlink(). This means the shortform fork verifier will not see a zero length data fork as it mirrors the inode size through to xfs_ifree()), and hence if the inode gets written back and the fork verifiers are run they will still see a fork that matches the on-disk inode size. For extent form (remote) symlinks, it is a little more tricky. Here we explicitly set the inode size to zero, so the above race can lead to zero length symlinks on disk. Because the inode is unlinked at this point (i.e. on the unlinked list) and unreferenced, it can never be seen again by a user. Hence when we set the inode size to zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG inodes to be of zero length, and so this avoids all the problems of zero length symlinks ever hitting the disk. It also avoids the problem of needing to handle zero length symlink inodes in log recovery to replay the extent free intents and the remaining deferops to free the extents the symlink used. Also add a couple of asserts to warn us if zero length symlinks end up in either the symlink create or inactivation paths. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: libxfs: move xfs_perag_put latePan Bian
The function xfs_alloc_get_freelist calls xfs_perag_put to drop the reference. However, pag->pagf_btreeblks is read and written after the put operation. This patch moves the put operation later. Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> [darrick: minor changelog edits] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-04xfs: fix inverted return from xfs_btree_sblock_verify_crcEric Sandeen
xfs_btree_sblock_verify_crc is a bool so should not be returning a failaddr_t; worse, if xfs_log_check_lsn fails it returns __this_address which looks like a boolean true (i.e. success) to the caller. (interestingly xfs_btree_lblock_verify_crc doesn't have the issue) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>