ClosureDocs.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


Bcache does practically everything asynchronously, so in the process of writing it I implemented a generic mechanism that was the culmination of some old ideas I had. I don't know who else would need them, but they are rather nifty so perhaps I ought to try and document them. 

They're called closures, because they are almost but not entirely unlike real closures: 

struct closure; closure_init(struct closure *cl, struct closure *parent); closure_get(struct closure *cl); closure_put(struct closure *cl, struct workqueue_struct *wq); 

As you might've guessed they embed a refcount. This refcount corresponds to the number of things the closure is waiting on; when it hits 0, Something Happens. Closure_init() sets it to 1. 

Why 1? Because a closure corresponds to an execution context; that is, they can be running (or not running) and when they are running, the closure itself is waiting on whoever's running it to finish. 

Some examples should make this clearer. First, let's do some io synchronously: 

bio_submit_c(bio1, cl); bio_submit_c(bio2, cl); closure_sync(cl); 

Closure_sync() sleeps until the refcount hits 1 - when it returns, we're not waiting on anything (except for ourselves). 

The asynchronous case is where it gets interesting: 

bio_submit_c(bio1, cl); bio_submit_c(bio2, cl); return_f(cl, next_fn); 

return_f() sets the next function to run, calls closure_put() and returns (Yes, it's a macro that returns the calling function). 

The end result is that next_fn() will be called when everything the closure was waiting on has finished, with the closure itself as the single argument. 

That's why you have a refcount when you're running a closure - otherwise next_fn() could run while your're still running. 

And what's that good for? Well, sometimes you can't just sleep; if you're waiting on IO but you were called from generic_make_request(), that IO is never going to start until you return. Or if you're running out of a workqueue, you're blocking other work items until you return. 

The rest of the interface is for waiting on stuff.  If you want to wait on something - say a bio to complete - you could stick a pointer to your closure somewhere so the bio_endio() function can call closure_put() on it. But if you have some event that many closures could be waiting on, you need a list: 

typedef closure_list_t; bool closure_wait(closure_list_t *list, struct closure *cl); void closure_run_wait(closure_list_t *list, struct workqueue_struct *wq); closure_wait_on(list, wq, c, condition) 

A closure can be on one wait list at a time (as they're implemented as a singly linked list, with the next pointer part of the closure). A wait list should correspond to an event - say, in bcache I've got one for when the journal fills up so whoever can wait until there's room again. 

closure_wait() will call closure_get(), then add the closure to the waitlist - or return false if the closure was already on a waitlist. 

closure_run_wait() will run everything on a given waitlist, and run it out of the workqueue you pass it. 

They're both lockless, the caller doesn't have to care about synchronization, and they can both be run from hardirq context. 

closure_wait_on() is a macro that waits on arbitrary events (much like wait_event(). If your closure was in blocking mode, it additionally sleeps until the event is true - otherwise it just adds you to the wait list, and returns whether or not the condition you were waiting on was true. 

There's also closure_wait_on_async() which always waits asynchoronously - maybe you have to check the condition in a spinlock. 

Lastly, at the start the prototype for closure_init() took a second argument... 

Closures can wait on other closures - they implement a spaghetti stack. I will leave the implications of that to the imagination. 

***************** 

Closures are meant to be lightweight and fast - they're 48 bytes on 64 bit machines, including the embedded workqueue. 

They're also meant to be embedded into other structures, as is commonly done in the kernel. You can (and should) think of that structure as a stack frame; variables in your struct corresponding to local variables and the struct closure containing what corresponds to the return address. 

I mentioned that closures can wait on other closures; more precisely, a closure can be a child of another closure - in which case, it holds a refcount on the parent. 

That refcount is taken when you pass a non null parent to closure_init() and it's dropped when a closure goes away. I don't recommend using the parent field to do anything more interesting - reseating it or manipulating the refcount directly. 

A closure goes away when cl->remaining hits 0 and there's no next function to run. (Conceptually - it doesn't free itself. If you want it to free itself, make a destructor the last function it runs.) 

Previously I mentioned return_f(), but I didn't explain what it really means. Remember that return_f() takes a closure and the next function to run - and the closure itself represents a stack frame complete with return address. 

Then return_f() is really a tailcall that also passes the return address: i.e. return_f() is a function call in continuation passing style. 

Other random bits: I mentioned closures can be in blocking mode. What's this good for? 

I'll give an example from bcache. When we traverse the btree, we call get_bucket() to get a btree node - looking it up in the cache, locking it, and potentially reading it in from disk. 

If we're doing a cache lookup for a read, we're processing a bio out of generic_make_request() - which means if we had to read a btree node in from disk we can't just sleep in get_bucket(). So get_bucket() adds us to the btree node's wait list, starts the IO and returns -EAGAIN; request_read() then calls return_f(cl, <ins>request_read) and when it runs again the btree node will be in memory. 

However, if we're inserting into the b-tree we don't want to just bail out, unlocking the b-tree nodes we were working on if we need to block on something; insertions run out of a dedicated workqueue so they can sleep. So insertions run in blocking mode, and when get_bucket() (and more importantly, btree_alloc()) call closure_wait_on() it magically does the right thing.