summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-12-23sunrpc: Add a function to close temporary transports immediatelyScott Mayhew
Add a function svc_age_temp_xprts_now() to close temporary transports whose xpt_local matches the address passed in server_addr immediately instead of waiting for them to be closed by the timer function. The function is intended to be used by notifier_blocks that will be added to nfsd and lockd that will run when an ip address is deleted. This will eliminate the ACK storms and client hangs that occur in HA-NFS configurations where nfsd & lockd is left running on the cluster nodes all the time and the NFS 'service' is migrated back and forth within a short timeframe. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-12-22tcp: honour SO_BINDTODEVICE for TW_RST case tooFlorian Westphal
Hannes points out that when we generate tcp reset for timewait sockets we pretend we found no socket and pass NULL sk to tcp_vX_send_reset(). Make it cope with inet tw sockets and then provide tw sk. This makes RSTs appear on correct interface when SO_BINDTODEVICE is used. Packetdrill test case: // want default route to be used, we rely on BINDTODEVICE `ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 // test case still works due to BINDTODEVICE 0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0 0.100...0.200 connect(3, ..., ...) = 0 0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop> 0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop> 0.200 > . 1:1(0) ack 1 0.210 close(3) = 0 0.210 > F. 1:1(0) ack 1 win 29200 0.300 < . 1:1(0) ack 2 win 46 // more data while in FIN_WAIT2, expect RST 1.300 < P. 1:1001(1000) ack 1 win 46 // fails without this change -- default route is used 1.301 > R 1:1(0) win 0 Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22tcp: send_reset: test for non-NULL sk firstFlorian Westphal
tcp_md5_do_lookup requires a full socket, so once we extend _send_reset() to also accept timewait socket we would have to change if (!sk && hash_location) to something like if ((!sk || !sk_fullsock(sk)) && hash_location) { ... } else { (sk && sk_fullsock(sk)) tcp_md5_do_lookup() } Switch the two branches: check if we have a socket first, then fall back to a listener lookup if we saw a md5 option (hash_location). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22addrconf: always initialize sysctl table dataWANG Cong
When sysctl performs restrict writes, it allows to write from a middle position of a sysctl file, which requires us to initialize the table data before calling proc_dostring() for the write case. Fixes: 3d1bec99320d ("ipv6: introduce secret_stable to ipv6_devconf") Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-12-22 Just one patch to fix dst_entries_init with multiple namespaces. From Dan Streetman. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22net: tcp: deal with listen sockets properly in tcp_abort.Lorenzo Colitti
When closing a listen socket, tcp_abort currently calls tcp_done without clearing the request queue. If the socket has a child socket that is established but not yet accepted, the child socket is then left without a parent, causing a leak. Fix this by setting the socket state to TCP_CLOSE and calling inet_csk_listen_stop with the socket lock held, like tcp_close does. Tested using net_test. With this patch, calling SOCK_DESTROY on a listen socket that has an established but not yet accepted child socket results in the parent and the child being closed, such that they no longer appear in sock_diag dumps. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22ipv6/addrlabel: fix ip6addrlbl_get()Andrey Ryabinin
ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded, ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed, ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer. Fix this by inverting ip6addrlbl_hold() check. Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22switchdev: bridge: Pass ageing time as clock_t instead of jiffiesIdo Schimmel
The bridge's ageing time is offloaded to hardware when: 1) A port joins a bridge 2) The ageing time of the bridge is changed In the first case the ageing time is offloaded as jiffies, but in the second case it's offloaded as clock_t, which is what existing switchdev drivers expect to receive. Fixes: 6ac311ae8bfb ("Adding switchdev ageing notification on port bridged") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22RDS: don't pretend to use cpu notifiersSebastian Andrzej Siewior
It looks like an attempt to use CPU notifier here which was never completed. Nobody tried to wire it up completely since 2k9. So I unwind this code and get rid of everything not required. Oh look! 19 lines were removed while code still does the same thing. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Tested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22net-sysfs: use to_net_dev in net_namespace()Geliang Tang
Use to_net_dev() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains two netfilter fixes: 1) Oneliner from Florian to dump missing NFT_CT_L3PROTOCOL netlink attribute, from Florian Westphal. 2) Another oneliner for nf_tables to use skb->protocol from the new netdev family, we can't assume ethernet there. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22xprtrdma: Avoid calling ib_query_deviceOr Gerlitz
Instead, use the cached copy of the attributes present on the device. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22net/rds: Avoid calling ib_query_deviceOr Gerlitz
Instead, use the cached copy of the attributes present on the device. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-206lowpan: fix debugfs interface entry nameAlexander Aring
This patches moves the debugfs interface related register after netdevice register. The function lowpan_dev_debugfs_init will use "dev->name" which can be before register_netdevice a format string. The function register_netdevice will evaluate the format string if necessary and replace "dev->name" to the real interface name. Reported-by: Lukasz Duda <lukasz.duda@nordicsemi.no> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Acked-by: Lukasz Duda <lukasz.duda@nordicsemi.no> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-12-20Bluetooth: use list_for_each_entry*Geliang Tang
Use list_for_each_entry*() instead of list_for_each*() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-12-18openvswitch: correct encoding of set tunnel action attributesSimon Horman
In a set action tunnel attributes should be encoded in a nested action. I noticed this because ovs-dpctl was reporting an error when dumping flows due to the incorrect encoding of tunnel attributes in a set action. Fixes: fc4099f17240 ("openvswitch: Fix egress tunnel info.") Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18ipip: ioctl: Remove superfluous IP-TTL handling.Pravin B Shelar
IP-TTL case is already handled in ip_tunnel_ioctl() API. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18tcp: diag: add support for request sockets to tcp_abort()Eric Dumazet
Adding support for SYN_RECV request sockets to tcp_abort() is quite easy after our tcp listener rewrite. Note that we also need to better handle listeners, or we might leak not yet accepted children, because of a missing inet_csk_listen_stop() call. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Tested-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18bpf: fix misleading comment in bpf_convert_filterDaniel Borkmann
Comment says "User BPF's register A is mapped to our BPF register 6", which is actually wrong as the mapping is on register 0. This can already be inferred from the code itself. So just remove it before someone makes assumptions based on that. Only code tells truth. ;) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18bpf: move clearing of A/X into classic to eBPF migration prologueDaniel Borkmann
Back in the days where eBPF (or back then "internal BPF" ;->) was not exposed to user space, and only the classic BPF programs internally translated into eBPF programs, we missed the fact that for classic BPF A and X needed to be cleared. It was fixed back then via 83d5b7ef99c9 ("net: filter: initialize A and X registers"), and thus classic BPF specifics were added to the eBPF interpreter core to work around it. This added some confusion for JIT developers later on that take the eBPF interpreter code as an example for deriving their JIT. F.e. in f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at least X could leak stack memory. Furthermore, since this is only needed for classic BPF translations and not for eBPF (verifier takes care that read access to regs cannot be done uninitialized), more complexity is added to JITs as they need to determine whether they deal with migrations or native eBPF where they can just omit clearing A/X in their prologue and thus reduce image size a bit, see f.e. cde66c2d88da ("s390/bpf: Only clear A and X for converted BPF programs"). In other cases (x86, arm64), A and X is being cleared in the prologue also for eBPF case, which is unnecessary. Lets move this into the BPF migration in bpf_convert_filter() where it actually belongs as long as the number of eBPF JITs are still few. It can thus be done generically; allowing us to remove the quirk from __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF, while reducing code duplication on this matter in current(/future) eBPF JITs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Zi Shen Lim <zlim.lnx@gmail.com> Cc: Yang Shi <yang.shi@linaro.org> Acked-by: Yang Shi <yang.shi@linaro.org> Acked-by: Zi Shen Lim <zlim.lnx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18bpf: add bpf_skb_load_bytes helperDaniel Borkmann
When hacking tc programs with eBPF, one of the issues that come up from time to time is to load addresses from headers. In eBPF as in classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that extract a byte, half-word or word out of the skb data though helpers such as bpf_load_pointer() (interpreter case). F.e. extracting a whole IPv6 address could possibly look like ... union v6addr { struct { __u32 p1; __u32 p2; __u32 p3; __u32 p4; }; __u8 addr[16]; }; [...] a.p1 = htonl(load_word(skb, off)); a.p2 = htonl(load_word(skb, off + 4)); a.p3 = htonl(load_word(skb, off + 8)); a.p4 = htonl(load_word(skb, off + 12)); [...] /* access to a.addr[...] */ This work adds a complementary helper bpf_skb_load_bytes() (we also have bpf_skb_store_bytes()) as an alternative where the same call would look like from an eBPF program: ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr)); Same verifier restrictions apply as in ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") case, where stack memory access needs to be statically verified and thus guaranteed to be initialized in first use (otherwise verifier cannot tell whether a subsequent access to it is valid or not as it's runtime dependent). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains the first batch of Netfilter updates for the upcoming 4.5 kernel. This batch contains userspace netfilter header compilation fixes, support for packet mangling in nf_tables, the new tracing infrastructure for nf_tables and cgroup2 support for iptables. More specifically, they are: 1) Two patches to include dependencies in our netfilter userspace headers to resolve compilation problems, from Mikko Rapeli. 2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris. 3) Remove duplicate include in the netfilter reject infrastructure, from Stephen Hemminger. 4) Two patches to simplify the netfilter defragmentation code for IPv6, patch from Florian Westphal. 5) Fix root ownership of /proc/net netfilter for unpriviledged net namespaces, from Philip Whineray. 6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal. 7) Add mangling support to our nf_tables payload expression, from Patrick McHardy. 8) Introduce a new netlink-based tracing infrastructure for nf_tables, from Florian Westphal. 9) Change setter functions in nfnetlink_log to be void, from Rami Rosen. 10) Add netns support to the cttimeout infrastructure. 11) Add cgroup2 support to iptables, from Tejun Heo. 12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian. 13) Add support for mangling pkttype in the nf_tables meta expression, also from Florian. BTW, I need that you pull net into net-next, I have another batch that requires changes that I don't yet see in net. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').Chuck Lever
The root of the problem was that sends (especially unsignalled FASTREG and LOCAL_INV Work Requests) were not properly flow- controlled, which allowed a send queue overrun. Now that the RPC/RDMA reply handler waits for invalidation to complete, the send queue is properly flow-controlled. Thus this limit is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Invalidate in the RPC reply handlerChuck Lever
There is a window between the time the RPC reply handler wakes the waiting RPC task and when xprt_release() invokes ops->buf_free. During this time, memory regions containing the data payload may still be accessed by a broken or malicious server, but the RPC application has already been allowed access to the memory containing the RPC request's data payloads. The server should be fenced from client memory containing RPC data payloads _before_ the RPC application is allowed to continue. This change also more strongly enforces send queue accounting. There is a maximum number of RPC calls allowed to be outstanding. When an RPC/RDMA transport is set up, just enough send queue resources are allocated to handle registration, Send, and invalidation WRs for each those RPCs at the same time. Before, additional RPC calls could be dispatched while invalidation WRs were still consuming send WQEs. When invalidation WRs backed up, dispatching additional RPCs resulted in a send queue overrun. Now, the reply handler prevents RPC dispatch until invalidation is complete. This prevents RPC call dispatch until there are enough send queue resources to proceed. Still to do: If an RPC exits early (say, ^C), the reply handler has no opportunity to perform invalidation. Currently, xprt_rdma_free() still frees remaining RDMA resources, which could deadlock. Additional changes are needed to handle invalidation properly in this case. Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Add ro_unmap_sync method for all-physical registrationChuck Lever
physical's ro_unmap is synchronous already. The new ro_unmap_sync method just has to DMA unmap all MRs associated with the RPC request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Add ro_unmap_sync method for FMRChuck Lever
FMR's ro_unmap method is already synchronous because ib_unmap_fmr() is a synchronous verb. However, some improvements can be made here. 1. Gather all the MRs for the RPC request onto a list, and invoke ib_unmap_fmr() once with that list. This reduces the number of doorbells when there is more than one MR to invalidate 2. Perform the DMA unmap _after_ the MRs are unmapped, not before. This is critical after invalidating a Write chunk. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Add ro_unmap_sync method for FRWRChuck Lever
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts LOCAL_INV Work Requests and waits for them to complete before returning. Note also, DMA unmapping is now done _after_ invalidation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Introduce ro_unmap_sync methodChuck Lever
In the current xprtrdma implementation, some memreg strategies implement ro_unmap synchronously (the MR is knocked down before the method returns) and some asynchonously (the MR will be knocked down and returned to the pool in the background). To guarantee the MR is truly invalid before the RPC consumer is allowed to resume execution, we need an unmap method that is always synchronous, invoked from the RPC/RDMA reply handler. The new method unmaps all MRs for an RPC. The existing ro_unmap method unmaps only one MR at a time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Move struct ib_send_wr off the stackChuck Lever
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off the stack. This allows frwr_op_map and frwr_op_unmap to chain WRs together without limit to register or invalidate a set of MRs with a single ib_post_send(). (This will be for chaining LOCAL_INV requests). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Disable RPC/RDMA backchannel debugging messagesChuck Lever
Clean up. Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: xprt_rdma_free() must not release backchannel reqsChuck Lever
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated for the backchannel. Otherwise, after all the pre-allocated backchannel req's are consumed, incoming backward calls start writing on freed memory. Somehow this hunk got lost. Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)Chuck Lever
Clean up. rb_lock critical sections added in rpcrdma_ep_post_extra_recv() should have first been converted to use normal spin_lock now that the reply handler is a work queue. The backchannel set up code should use the appropriate helper instead of open-coding a rb_recv_bufs list add. Problem introduced by glib patch re-ordering on my part. Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: checking for NULL instead of IS_ERR()Dan Carpenter
The rpcrdma_create_req() function returns error pointers or success. It never returns NULL. Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18xprtrdma: clean up some curly bracesDan Carpenter
It doesn't matter either way, but the curly braces were clearly intended here. It causes a Smatch warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18net: Allow accepted sockets to be bound to l3mdev domainDavid Ahern
Allow accepted sockets to derive their sk_bound_dev_if setting from the l3mdev domain in which the packets originated. A sysctl setting is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept. This effectively allow a process to have a "VRF-global" listen socket, with child sockets bound to the VRF device in which the packet originated. A similar behavior can be achieved using sk_mark, but a solution using marks is incomplete as it does not handle duplicate addresses in different L3 domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev domain provides a complete solution. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18ipv6: addrconf: use stable address generator for ARPHRD_NONEBjørn Mork
Add a new address generator mode, using the stable address generator with an automatically generated secret. This is intended as a default address generator mode for device types with no EUI64 implementation. The new generator is used for ARPHRD_NONE interfaces initially, adding default IPv6 autoconf support to e.g. tun interfaces. If the addrgenmode is set to 'random', either by default or manually, and no stable secret is available, then a random secret is used as input for the stable-privacy address generator. The secret can be read and modified like manually configured secrets, using the proc interface. Modifying the secret will change the addrgen mode to 'stable-privacy' to indicate that it operates on a known secret. Existing behaviour of the 'stable-privacy' mode is kept unchanged. If a known secret is available when the device is created, then the mode will default to 'stable-privacy' as before. The mode can be manually set to 'random' but it will behave exactly like 'stable-privacy' in this case. The secret will not change. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18ila: add NETFILTER dependencyArnd Bergmann
The recently added generic ILA translation facility fails to build when CONFIG_NETFILTER is disabled: net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops' static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = { This adds an explicit Kconfig dependency to avoid that case. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 7f00feaf1076 ("ila: Add generic ILA translation facility") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18netfilter: nft_ct: include direction when dumping NFT_CT_L3PROTOCOL keyFlorian Westphal
one nft userspace test case fails with 'ct l3proto original ipv4' mismatches 'ct l3proto ipv4' ... because NFTA_CT_DIRECTION attr is missing. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-18netfilter: nf_tables: use skb->protocol instead of assuming ethernet headerPablo Neira Ayuso
Otherwise we may end up with incorrect network and transport header for other protocols. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-18netfilter: meta: add support for setting skb->pkttypeFlorian Westphal
This allows to redirect bridged packets to local machine: ether type ip ether daddr set aa:53:08:12:34:56 meta pkttype set unicast Without 'set unicast', ip stack discards PACKET_OTHERHOST skbs. It is also useful to add support for a '-m cluster like' nft rule (where switch floods packets to several nodes, and each cluster node node processes a subset of packets for load distribution). Mangling is restricted to HOST/OTHER/BROAD/MULTICAST, i.e. you cannot set skb->pkt_type to PACKET_KERNEL or change PACKET_LOOPBACK to PACKET_HOST. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of people reported this... From Arnd Bergmann. 2) Don't init mutex twice in i40e driver, from Jesse Brandeburg. 3) Fix spurious EBUSY in rhashtable, from Herbert Xu. 4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas. 5) Fix race with work structure access in pppoe driver causing corruptions, from Guillaume Nault. 6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb() actually succeeded or not, from Sergei Shtylyov. 7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from Bjørn Mork. 8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc. 9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo Leitner. 10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven. 11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling properly as well, from Jiri Benc. 12) Handle request sockets properly in xfrm layer, from Eric Dumazet. 13) Double stats update in ipv6 geneve transmit path, fix from Pravin B Shelar. 14) sk->sk_policy[] needs RCU protection, and as a result xfrm_policy_destroy() needs to free policies using an RCU grace period, from Eric Dumazet. 15) SCTP needs to clone ipv6 tx options in order to avoid use after free, from Eric Dumazet. 16) Missing kbuild export if ila.h, from Stephen Hemminger. 17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from Tobias Klauser. 18) Validate protocol value range in ->create() methods, from Hannes Frederic Sowa. 19) Fix early socket demux races that result in illegal dst reuse, from Eric Dumazet. 20) Validate socket address length in pptp code, from WANG Cong. 21) skb_reorder_vlan_header() uses incorrect offset and can corrupt packets, from Vlad Yasevich. 22) Fix memory leaks in nl80211 registry code, from Ola Olsson. 23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and qlcnic. From Dan Carpenter. 24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for example, AF_ALG will interpret it as an async call. From Tadeusz Struk. 25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from Eric Dumazet. 26) rhashtable enforces the minimum table size not early enough, breaking how we calculate the per-cpu lock allocations. From Herbert Xu. 27) Fix FCC port lockup in 82xx driver, from Martin Roth. 28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa. 29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and sock_setsockopt() wrt. timestamp handling. From WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits) net: check both type and procotol for tcp sockets drivers: net: xgene: fix Tx flow control tcp: restore fastopen with no data in SYN packet af_unix: Revert 'lock_interruptible' in stream receive code fou: clean up socket with kfree_rcu 82xx: FCC: Fixing a bug causing to FCC port lock-up gianfar: Don't enable RX Filer if not supported net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration rhashtable: Fix walker list corruption rhashtable: Enforce minimum size on initial hash table inet: tcp: fix inetpeer_set_addr_v4() ipv6: automatically enable stable privacy mode if stable_secret set net: fix uninitialized variable issue bluetooth: Validate socket address length in sco_sock_bind(). net_sched: make qdisc_tree_decrease_qlen() work for non mq ser_gigaset: remove unnecessary kfree() calls from release method ser_gigaset: fix deallocation of platform device structure ser_gigaset: turn nonsense checks into WARN_ON ser_gigaset: fix up NULL checks qlcnic: fix a timeout loop ...
2015-12-17net: check both type and procotol for tcp socketsWANG Cong
Dmitry reported the following out-of-bound access: Call Trace: [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40 mm/kasan/report.c:294 [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880 [< inline >] SYSC_setsockopt net/socket.c:1746 [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729 [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185 This is because we mistake a raw socket as a tcp socket. We should check both sk->sk_type and sk->sk_protocol to ensure it is a tcp socket. Willem points out __skb_complete_tx_timestamp() needs to fix as well. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17tcp: restore fastopen with no data in SYN packetEric Dumazet
Yuchung tracked a regression caused by commit 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives") for TCP Fast Open. Some Fast Open users do not actually add any data in the SYN packet. Fixes: 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives") Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17af_unix: Revert 'lock_interruptible' in stream receive codeRainer Weikusat
With b3ca9b02b00704053a38bfe4c31dbbb9c13595d0, the AF_UNIX SOCK_STREAM receive code was changed from using mutex_lock(&u->readlock) to mutex_lock_interruptible(&u->readlock) to prevent signals from being delayed for an indefinite time if a thread sleeping on the mutex happened to be selected for handling the signal. But this was never a problem with the stream receive code (as opposed to its datagram counterpart) as that never went to sleep waiting for new messages with the mutex held and thus, wouldn't cause secondary readers to block on the mutex waiting for the sleeping primary reader. As the interruptible locking makes the code more complicated in exchange for no benefit, change it back to using mutex_lock. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17ipv6: add IPV6_HDRINCL option for raw socketsHannes Frederic Sowa
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW. The SOL_IP/IP_HDRINCL is not available for IPv6 sockets. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17ipv6: allow routes to be configured with expire valuesXin Long
Add the support for adding expire value to routes, requested by Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager wants it too. implement it by adding the new RTNETLINK attribute RTA_EXPIRES. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16fou: clean up socket with kfree_rcuHannes Frederic Sowa
fou->udp_offloads is managed by RCU. As it is actually included inside the fou sockets, we cannot let the memory go out of scope before a grace period. We either can synchronize_rcu or switch over to kfree_rcu to manage the sockets. kfree_rcu seems appropriate as it is used by vxlan and geneve. Fixes: 23461551c00628c ("fou: Support for foo-over-udp RX path") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16net: Pass ndm_state to route netlink FDB notifications.Hubert Sokolowski
Before this change applications monitoring FDB notifications were not able to determine whether a new FDB entry is permament or not: bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self bridge monitor fdb f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent With this change ndm_state from the original netlink message is passed to the new netlink message sent as notification. bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self bridge monitor fdb f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent f1:f2:f3:f4:f5:f7 dev sw0p1 self static Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16Merge tag 'mac80211-for-davem-2015-12-15' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Another set of fixes: * memory leak fixes (from Ola) * operating mode notification spec compliance fix (from Eyal) * copy rfkill names in case pointer becomes invalid (myself) * two hardware restart fixes (myself) * get rid of "limiting TX power" log spam (myself) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>