summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-11-09bpf: Add KF_RCU flag to bpf_refcount_acquire_implDave Marchevsky
Refcounted local kptrs are kptrs to user-defined types with a bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of refcounted local kptrs such that the underlying memory is not reused until RCU grace period has elapsed. Separately, verification of bpf_refcount_acquire calls currently succeeds for MAYBE_NULL non-owning reference input, which is a problem as bpf_refcount_acquire_impl has no handling for this case. This patch takes advantage of aforementioned lifetime changes to tag bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL input to the kfunc. The KF_RCU flag applies to all kfunc params; it's fine for it to apply to the void *meta__ign param as that's populated by the verifier and is tagged __ign regardless. [0]: commit 7e26cd12ad1c ("bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes") is the actual change to allocation behaivor [1]: commit 0816b8c6bf7f ("bpf: Consider non-owning refs to refcounted nodes RCU protected") modified verifier understanding of refcounted local kptrs to match [0]'s changes Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 7c50b1cb76ac ("bpf: Add bpf_refcount_acquire kfunc") Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: replace register_is_const() with is_reg_const()Shung-Hsi Yu
The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize is_branch_taken to handle all conditional jumps in one place") has made the register_is_const() redundant. Give the former has more feature, plus the fact the latter is only used in one place, replace register_is_const() with is_reg_const(), and remove the definition of register_is_const. This requires moving the definition of is_reg_const() further up. And since the comment of reg_const_value() reference is_reg_const(), move it up as well. Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: Introduce KF_ARG_PTR_TO_CONST_STRSong Liu
Similar to ARG_PTR_TO_CONST_STR for BPF helpers, KF_ARG_PTR_TO_CONST_STR specifies kfunc args that point to const strings. Annotation "__str" is used to specify kfunc arg of type KF_ARG_PTR_TO_CONST_STR. Also, add documentation for the "__str" annotation. bpf_get_file_xattr() will be the first kfunc that uses this type. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: Factor out helper check_reg_const_str()Song Liu
ARG_PTR_TO_CONST_STR is used to specify constant string args for BPF helpers. The logic that verifies a reg is ARG_PTR_TO_CONST_STR is implemented in check_func_arg(). As we introduce kfuncs with constant string args, it is necessary to do the same check for kfuncs (in check_kfunc_args). Factor out the logic for ARG_PTR_TO_CONST_STR to a new check_reg_const_str() so that it can be reused. check_func_arg() ensures check_reg_const_str() is only called with reg of type PTR_TO_MAP_VALUE. Add a redundent type check in check_reg_const_str() to avoid misuse in the future. Other than this redundent check, there is no change in behavior. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: Add __bpf_dynptr_data* for in kernel useSong Liu
Different types of bpf dynptr have different internal data storage. Specifically, SKB and XDP type of dynptr may have non-continuous data. Therefore, it is not always safe to directly access dynptr->data. Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to dynptr->data. Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of dynptr->data. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf, lpm: Fix check prefixlen before walking trieFlorian Lehner
When looking up an element in LPM trie, the condition 'matchlen == trie->max_prefixlen' will never return true, if key->prefixlen is larger than trie->max_prefixlen. Consequently all elements in the LPM trie will be visited and no element is returned in the end. To resolve this, check key->prefixlen first before walking the LPM trie. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Signed-off-by: Florian Lehner <dev@der-flo.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: generalize reg_set_min_max() to handle two sets of two registersAndrii Nakryiko
Change reg_set_min_max() to take FALSE/TRUE sets of two registers each, instead of assuming that we are always comparing to a constant. For now we still assume that right-hand side registers are constants (and make sure that's the case by swapping src/dst regs, if necessary), but subsequent patches will remove this limitation. reg_set_min_max() is now called unconditionally for any register comparison, so that might include pointer vs pointer. This makes it consistent with is_branch_taken() generality. But we currently only support adjustments based on SCALAR vs SCALAR comparisons, so reg_set_min_max() has to guard itself againts pointers. Taking two by two registers allows to further unify and simplify check_cond_jmp_op() logic. We utilize fake register for BPF_K conditional jump case, just like with is_branch_taken() part. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-18-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: prepare reg_set_min_max for second set of registersAndrii Nakryiko
Similarly to is_branch_taken()-related refactorings, start preparing reg_set_min_max() to handle more generic case of two non-const registers. Start with renaming arguments to accommodate later addition of second register as an input argument. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-17-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: unify 32-bit and 64-bit is_branch_taken logicAndrii Nakryiko
Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE registers. It makes it easier to see parallels between two domains (32-bit and 64-bit), and makes subsequent refactoring more straightforward. No functional changes. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-16-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: generalize is_branch_taken to handle all conditional jumps in one placeAndrii Nakryiko
Make is_branch_taken() a single entry point for branch pruning decision making, handling both pointer vs pointer, pointer vs scalar, and scalar vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op(). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-15-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: move is_branch_taken() downAndrii Nakryiko
Move is_branch_taken() slightly down. In subsequent patched we'll need both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(), but instead of sprinkling forward declarations around, it makes more sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(), and also keep it closer to very tightly related reg_set_min_max(), as they are two critical parts of the same SCALAR range tracking logic. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-14-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: generalize is_branch_taken() to work with two registersAndrii Nakryiko
While still assuming that second register is a constant, generalize is_branch_taken-related code to accept two registers instead of register plus explicit constant value. This also, as a side effect, allows to simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for which we use a fake register to represent BPF_K's imm constant as a register. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-13-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: rename is_branch_taken reg arguments to prepare for the second oneAndrii Nakryiko
Just taking mundane refactoring bits out into a separate patch. No functional changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: drop knowledge-losing __reg_combine_{32,64}_into_{64,32} logicAndrii Nakryiko
When performing 32-bit conditional operation operating on lower 32 bits of a full 64-bit register, register full value isn't changed. We just potentially gain new knowledge about that register's lower 32 bits. Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that reg_set_min_max() performs as a last step, can lose information in some cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64(). That's bad and completely unnecessary. Especially __reg_assign_32_into_64() looks completely out of place here, because we are not performing zero-extending subregister assignment during conditional jump. So this patch replaced __reg_combine_* with just a normal reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds from u32/s32, and vice versa (among all other combinations). __reg_combine_64_into_32() is also used in one more place, coerce_reg_to_size(), while handling 1- and 2-byte register loads. Looking into this, it seems like besides marking subregister as unbounded before performing reg_bounds_sync(), we were also performing deduction of smin32/smax32 and umin32/umax32 bounds from respective smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync() performs all the same logic more generically (e.g., without unnecessary assumption that upper 32 bits of full register should be zero). Long story short, we remove __reg_combine_64_into_32() completely, and coerce_reg_to_size() now only does resetting subreg to unbounded and then performing reg_bounds_sync() to recover as much information as possible from 64-bit umin/umax and smin/smax bounds, set explicitly in coerce_reg_to_size() earlier. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: try harder to deduce register bounds from different numeric domainsAndrii Nakryiko
There are cases (caught by subsequent reg_bounds tests in selftests/bpf) where performing one round of __reg_deduce_bounds() doesn't propagate all the information from, say, s32 to u32 bounds and than from newly learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds() twice to make sure such derivations are propagated fully after reg_bounds_sync(). One such example is test `(s64)[0xffffffff00000001; 0] (u64)< 0xffffffff00000000` from selftest patch from this patch set. It demonstrates an intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from selftest, showing evolution of knowledge. REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1] REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX] R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is unknown. After (u64)< comparison, in FALSE branch we gain knowledge that umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that zero can't happen and upper bound is -1. Then smin/smax is adjusted from umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001. And then eventually umin32/umax32 bounds are drived from umin/umax and become [1; U32_MAX]. Selftest in the last patch is actually implementing a multi-round fixed-point convergence logic, but so far all the tests are handled by two rounds of reg_bounds_sync() on the verifier state, so we keep it simple for now. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: improve deduction of 64-bit bounds from 32-bit boundsAndrii Nakryiko
Add a few interesting cases in which we can tighten 64-bit bounds based on newly learnt information about 32-bit bounds. E.g., when full u64/s64 registers are used in BPF program, and then eventually compared as u32/s32. The latter comparison doesn't change the value of full register, but it does impose new restrictions on possible lower 32 bits of such full registers. And we can use that to derive additional full register bounds information. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: add special smin32/smax32 derivation from 64-bit boundsAndrii Nakryiko
Add a special case where we can derive valid s32 bounds from umin/umax or smin/smax by stitching together negative s32 subrange and non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1] range in u32 domain (taking into account wrap around, so 0xffffffff to 0x00000000 is a valid [N, N+1] range in this sense). See code comment for concrete examples. Eduard Zingerman also provided an alternative explanation ([0]) for more mathematically inclined readers: Suppose: . there are numbers a, b, c . 2**31 <= b < 2**32 . 0 <= c < 2**31 . umin = 2**32 * a + b . umax = 2**32 * (a + 1) + c The number of values in the range represented by [umin; umax] is: . N = umax - umin + 1 = 2**32 + c - b + 1 . min(N) = 2**32 + 0 - (2**32-1) + 1 = 2, with b = 2**32-1, c = 0 . max(N) = 2**32 + (2**31 - 1) - 2**31 + 1 = 2**32, with b = 2**31, c = 2**31-1 Hence [(s32)b; (s32)c] forms a valid range. [0] https://lore.kernel.org/bpf/d7af631802f0cfae20df77fe70068702d24bbd31.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: derive subreg bounds from full bounds when upper 32 bits are constantAndrii Nakryiko
Comments in code try to explain the idea behind why this is correct. Please check the code and comments. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: derive smin32/smax32 from umin32/umax32 boundsAndrii Nakryiko
All the logic that applies to u64 vs s64, equally applies for u32 vs s32 relationships (just taken in a smaller 32-bit numeric space). So do the same deduction of smin32/smax32 from umin32/umax32, if we can. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09bpf: derive smin/smax from umin/max boundsAndrii Nakryiko
Add smin/smax derivation from appropriate umin/umax values. Previously the logic was surprisingly asymmetric, trying to derive umin/umax from smin/smax (if possible), but not trying to do the same in the other direction. A simple addition to __reg64_deduce_bounds() fixes this. Added also generic comment about u64/s64 ranges and their relationship. Hopefully that helps readers to understand all the bounds deductions a bit better. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-07bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_regChuyi Zhou
BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it doesn't work well. The reason is, bpf_iter__task -> task would go through btf_ctx_access() which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in task_iter.c, we actually explicitly declare that the ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL. Actually we have a previous case like this[1] where PTR_TRUSTED is added to the arg flag for map_iter. This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED in task_reg_info. Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are under the protection of cgroup_mutex and we would check cgroup_is_dead() in __cgroup_iter_seq_show(). This patch is to improve the user experience of the newly introduced bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is pointing to the commit introduced the bpf_iter_css_task kfunc. Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/ Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-01bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_ENDShung-Hsi Yu
BPF_END and BPF_NEG has a different specification for the source bit in the opcode compared to other ALU/ALU64 instructions, and is either reserved or use to specify the byte swap endianness. In both cases the source bit does not encode source operand location, and src_reg is a reserved field. backtrack_insn() currently does not differentiate BPF_END and BPF_NEG from other ALU/ALU64 instructions, which leads to r0 being incorrectly marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END instructions. This commit teaches backtrack_insn() to correctly mark precision for such case. While precise tracking of BPF_NEG and other BPF_END instructions are correct and does not need fixing, this commit opt to process all BPF_NEG and BPF_END instructions within the same if-clause to better align with current convention used in the verifier (e.g. check_alu_op). Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Cc: stable@vger.kernel.org Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com> Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Tao Lyu <tao.lyu@epfl.ch> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Relax allowlist for css_task iterChuyi Zhou
The newly added open-coded css_task iter would try to hold the global css_set_lock in bpf_iter_css_task_new, so the bpf side has to be careful in where it allows to use this iter. The mainly concern is dead locking on css_set_lock. check_css_task_iter_allowlist() in verifier enforced css_task can only be used in bpf_lsm hooks and sleepable bpf_iter. This patch relax the allowlist for css_task iter. Any lsm and any iter (even non-sleepable) and any sleepable are safe since they would not hold the css_set_lock before entering BPF progs context. This patch also fixes the misused BPF_TRACE_ITER in check_css_task_iter_allowlist which compared bpf_prog_type with bpf_attach_type. Fixes: 9c66dc94b62ae ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231031050438.93297-2-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Check map->usercnt after timer->timer is assignedHou Tao
When there are concurrent uref release and bpf timer init operations, the following sequence diagram is possible. It will break the guarantee provided by bpf_timer: bpf_timer will still be alive after userspace application releases or unpins the map. It also will lead to kmemleak for old kernel version which doesn't release bpf_timer when map is released. bpf program X: bpf_timer_init() lock timer->lock read timer->timer as NULL read map->usercnt != 0 process Y: close(map_fd) // put last uref bpf_map_put_uref() atomic_dec_and_test(map->usercnt) array_map_free_timers() bpf_timer_cancel_and_free() // just return read timer->timer is NULL t = bpf_map_kmalloc_node() timer->timer = t unlock timer->lock Fix the problem by checking map->usercnt after timer->timer is assigned, so when there are concurrent uref release and bpf timer init, either bpf_timer_cancel_and_free() from uref release reads a no-NULL timer or the newly-added atomic64_read() returns a zero usercnt. Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer) in bpf_timer_cancel_and_free() are not protected by a lock, so add a memory barrier to guarantee the order between map->usercnt and timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless read of timer->timer in bpf_timer_cancel_and_free(). Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Add __bpf_kfunc_{start,end}_defs macrosDave Marchevsky
BPF kfuncs are meant to be called from BPF programs. Accordingly, most kfuncs are not called from anywhere in the kernel, which the -Wmissing-prototypes warning is unhappy about. We've peppered __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are defined in the codebase to suppress this warning. This patch adds two macros meant to bound one or many kfunc definitions. All existing kfunc definitions which use these __diag calls to suppress -Wmissing-prototypes are migrated to use the newly-introduced macros. A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier version of this patch [0] and another recent mailing list thread [1]. In the future we might need to ignore different warnings or do other kfunc-specific things. This change will make it easier to make such modifications for all kfunc defs. [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: Fix check_stack_write_fixed_off() to correctly spill immHao Sun
In check_stack_write_fixed_off(), imm value is cast to u32 before being spilled to the stack. Therefore, the sign information is lost, and the range information is incorrect when load from the stack again. For the following prog: 0: r2 = r10 1: *(u64*)(r2 -40) = -44 2: r0 = *(u64*)(r2 - 40) 3: if r0 s<= 0xa goto +2 4: r0 = 1 5: exit 6: r0 = 0 7: exit The verifier gives: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 1: (7a) *(u64 *)(r2 -40) = -44 ; R2_w=fp0 fp-40_w=4294967252 2: (79) r0 = *(u64 *)(r2 -40) ; R0_w=4294967252 R2_w=fp0 fp-40_w=4294967252 3: (c5) if r0 s< 0xa goto pc+2 mark_precise: frame0: last_idx 3 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (79) r0 = *(u64 *)(r2 -40) 3: R0_w=4294967252 4: (b7) r0 = 1 ; R0_w=1 5: (95) exit verification time 7971 usec stack depth 40 processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 So remove the incorrect cast, since imm field is declared as s32, and __mark_reg_known() takes u64, so imm would be correctly sign extended by compiler. Fixes: ecdf985d7615 ("bpf: track immediate values written to stack by BPF_ST instruction") Cc: stable@vger.kernel.org Signed-off-by: Hao Sun <sunhao.th@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231101-fix-check-stack-write-v3-1-f05c2b1473d5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01bpf: fix compilation error without CGROUPSMatthieu Baerts
Our MPTCP CI complained [1] -- and KBuild too -- that it was no longer possible to build the kernel without CONFIG_CGROUPS: kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_new': kernel/bpf/task_iter.c:919:14: error: 'CSS_TASK_ITER_PROCS' undeclared (first use in this function) 919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED: | ^~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c:919:14: note: each undeclared identifier is reported only once for each function it appears in kernel/bpf/task_iter.c:919:36: error: 'CSS_TASK_ITER_THREADED' undeclared (first use in this function) 919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED: | ^~~~~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c:927:60: error: invalid application of 'sizeof' to incomplete type 'struct css_task_iter' 927 | kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter)); | ^~~~~~ kernel/bpf/task_iter.c:930:9: error: implicit declaration of function 'css_task_iter_start'; did you mean 'task_seq_start'? [-Werror=implicit-function-declaration] 930 | css_task_iter_start(css, flags, kit->css_it); | ^~~~~~~~~~~~~~~~~~~ | task_seq_start kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_next': kernel/bpf/task_iter.c:940:16: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration] 940 | return css_task_iter_next(kit->css_it); | ^~~~~~~~~~~~~~~~~~ | class_dev_iter_next kernel/bpf/task_iter.c:940:16: error: returning 'int' from a function with return type 'struct task_struct *' makes pointer from integer without a cast [-Werror=int-conversion] 940 | return css_task_iter_next(kit->css_it); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_destroy': kernel/bpf/task_iter.c:949:9: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration] 949 | css_task_iter_end(kit->css_it); | ^~~~~~~~~~~~~~~~~ This patch simply surrounds with a #ifdef the new code requiring CGroups support. It seems enough for the compiler and this is similar to bpf_iter_css_{new,next,destroy}() functions where no other #ifdef have been added in kernel/bpf/helpers.c and in the selftests. Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs") Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/6665206927 Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310260528.aHWgVFqq-lkp@intel.com/ Signed-off-by: Matthieu Baerts <matttbe@kernel.org> [ added missing ifdefs for BTF_ID cgroup definitions ] Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231101181601.1493271-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-01Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-10-31Merge tag 'net-next-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ...
2023-10-30Merge tag 'vfs-6.7.ctime' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode time accessor updates from Christian Brauner: "This finishes the conversion of all inode time fields to accessor functions as discussed on list. Changing timestamps manually as we used to do before is error prone. Using accessors function makes this robust. It does not contain the switch of the time fields to discrete 64 bit integers to replace struct timespec and free up space in struct inode. But after this, the switch can be trivially made and the patch should only affect the vfs if we decide to do it" * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits) fs: rename inode i_atime and i_mtime fields security: convert to new timestamp accessors selinux: convert to new timestamp accessors apparmor: convert to new timestamp accessors sunrpc: convert to new timestamp accessors mm: convert to new timestamp accessors bpf: convert to new timestamp accessors ipc: convert to new timestamp accessors linux: convert to new timestamp accessors zonefs: convert to new timestamp accessors xfs: convert to new timestamp accessors vboxsf: convert to new timestamp accessors ufs: convert to new timestamp accessors udf: convert to new timestamp accessors ubifs: convert to new timestamp accessors tracefs: convert to new timestamp accessors sysv: convert to new timestamp accessors squashfs: convert to new timestamp accessors server: convert to new timestamp accessors client: convert to new timestamp accessors ...
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-26bpf: Add more WARN_ON_ONCE checks for mismatched alloc and freeHou Tao
There are two possible mismatched alloc and free cases in BPF memory allocator: 1) allocate from cache X but free by cache Y with a different unit_size 2) allocate from per-cpu cache but free by kmalloc cache or vice versa So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to spot these mismatched alloc and free early. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
2023-10-24netkit, bpf: Add bpf programmable net deviceDaniel Borkmann
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached. Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated. Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one. An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24bpf: Improve JEQ/JNE branch taken logicAndrii Nakryiko
When determining if an if/else branch will always or never be taken, use signed range knowledge in addition to currently used unsigned range knowledge. If either signed or unsigned range suggests that condition is always/never taken, return corresponding branch_taken verdict. Current use of unsigned range for this seems arbitrary and unnecessarily incomplete. It is possible for *signed* operations to be performed on register, which could "invalidate" unsigned range for that register. In such case branch_taken will be artificially useless, even if we can still tell that some constant is outside of register value range based on its signed bounds. veristat-based validation shows zero differences across selftests, Cilium, and Meta-internal BPF object files. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
2023-10-24bpf: Fold smp_mb__before_atomic() into atomic_set_release()Paul E. McKenney
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set() immediately preceded by smp_mb__before_atomic() so as to order storing of ring-buffer consumer and producer positions prior to the atomic_set() call's clearing of the ->busy flag, as follows: smp_mb__before_atomic(); atomic_set(&rb->busy, 0); Although this works given current architectures and implementations, and given that this only needs to order prior writes against a later write. However, it does so by accident because the smp_mb__before_atomic() is only guaranteed to work with read-modify-write atomic operations, and not at all with things like atomic_set() and atomic_read(). Note especially that smp_mb__before_atomic() will not, repeat *not*, order the prior write to "a" before the subsequent non-read-modify-write atomic read from "b", even on strongly ordered systems such as x86: WRITE_ONCE(a, 1); smp_mb__before_atomic(); r1 = atomic_read(&b); Therefore, replace the smp_mb__before_atomic() and atomic_set() with atomic_set_release() as follows: atomic_set_release(&rb->busy, 0); This is no slower (and sometimes is faster) than the original, and also provides a formal guarantee of ordering that the original lacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
2023-10-24bpf: Fix unnecessary -EBUSY from htab_lock_bucketSong Liu
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
2023-10-23bpf: print full verifier states on infinite loop detectionEduard Zingerman
Additional logging in is_state_visited(): if infinite loop is detected print full verifier state for both current and equivalent states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: correct loop detection for iterators convergenceEduard Zingerman
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: exact states comparison for iterator convergence checksEduard Zingerman
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <awerner32@gmail.com> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: extract same_callsites() as utility functionEduard Zingerman
Extract same_callsites() from clean_live_states() as a utility function. This function would be used by the next patch in the set. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: move explored_state() closer to the beginning of verifier.cEduard Zingerman
Subsequent patches would make use of explored_state() function. Move it up to avoid adding unnecessary prototype. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf, tcx: Get rid of tcx_link_constDaniel Borkmann
Small clean up to get rid of the extra tcx_link_const() and only retain the tcx_link(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-20bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()Hou Tao
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.hHou Tao
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Re-enable unit_size checking for global per-cpu allocatorHou Tao
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Let bpf_iter_task_new accept null task ptrChuyi Zhou
When using task_iter to iterate all threads of a specific task, we enforce that the user must pass a valid task pointer to ensure safety. However, when iterating all threads/process in the system, BPF verifier still require a valid ptr instead of "nullable" pointer, even though it's pointless, which is a kind of surprising from usability standpoint. It would be nice if we could let that kfunc accept a explicit null pointer when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer when using BPF_TASK_ITER_THREAD. Given a trival kfunc: __bpf_kfunc void FN(struct TYPE_A *obj); BPF Prog would reject a nullptr for obj. The error info is: "arg#x pointer type xx xx must point to scalar, or struct with scalar" reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and the btf type of ref_t is not scalar or scalar_struct which leads to the rejection of get_kfunc_ptr_arg_type. This patch add "__nullable" annotation: __bpf_kfunc void FN(struct TYPE_A *obj__nullable); Here __nullable indicates obj can be optional, user can pass a explicit nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will detect whether the current arg is optional and register is null, If so, return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next arg in check_kfunc_args(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: teach the verifier to enforce css_iter and task_iter in RCU CSChuyi Zhou
css_iter and task_iter should be used in rcu section. Specifically, in sleepable progs explicit bpf_rcu_read_lock() is needed before use these iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to use them directly. This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check whether we are in rcu cs before we want to invoke this kfunc. If the rcu protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU. Once user do rcu_unlock during the iteration, state MEM_RCU of regs would be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED. It is worth noting that currently, bpf_rcu_read_unlock does not clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg search STACK_ITER. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Introduce css open-coded iterator kfuncsChuyi Zhou
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css in open-coded iterator style. These kfuncs actually wrapps css_next_descendant_{pre, post}. css_iter can be used to: 1) iterating a sepcific cgroup tree with pre/post/up order 2) iterating cgroup_subsystem in BPF Prog, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel. The API design is consistent with cgroup_iter. bpf_iter_css_new accepts parameters defining iteration order and starting css. Here we also reuse BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST, BPF_CGROUP_ITER_ANCESTORS_UP enums. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Introduce task open coded iterator kfuncsChuyi Zhou
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task in open-coded iterator style. BPF programs can use these kfuncs or through bpf_for_each macro to iterate all processes in the system. The API design keep consistent with SEC("iter/task"). bpf_iter_task_new() accepts a specific task and iterating type which allows: 1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS) 2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS) 3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS) Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>