Thursday, December 31, 2009

Credentials? We need no stinking credentials!

NetBSD adopted a kauth similar to the one used in MacOSX. But as time has gone on, I've become less enthralled with it. Maybe we need a radical rethought on how authorization is done in the kernel. Part of the problem I have with it is the massive amount of hardcoding of authorization requests which then queries the credentials to see if the request should be allowed. It's amazing inflexible and continuously bloats as new requests are added.

Since it's always been done this way, we keep on doing it this way. Just maybe, though, we shouldn't. Instead of testing credentials, we should be testing capabilities. When our credentials change, we recalculate our capabilities. And we keep that capability set in a opaque object which we can quickly query. The credentials will be ignored by rest of the kernel (even things like the ownership of a newly created file could be gotten from capabilities if we want to extend things that far).

And how we construct our capabilities from our credentials (the security model) could be done using a simple BPF-like language. Imagine that it could say that members of group _sshd could bind to port 22, or that uid foo is allowed to chroot.

And if you attach credentials to a fd, you can do even more crazy stuff. Like specifying that certain operations are allowed on this fd. That fd can then be inherited or passed to a new process which normally would not have those capabilities but the fd retains them so the unprived process can do a small amount of prived actions.

And this is just scratching at the surface... and maybe it's time to turn security on its head.

Thursday, October 29, 2009

lwp handoff between CPUs

Normally, if a CPU decides to run a thread and that thread has previously run on a different CPU, it just grabs and executes it. If you have CPU-dependent state (lazy FP context), this can be a bad idea since it requires extra IPIs and synchronization.

Instead, I propose a standard IPI to request the that LWP be divorced from its last CPU. And until the divorce is complete, the lwp can not be married to a new cpu.


Thursday, October 22, 2009

Killing ticks

NetBSD uses implements its clock via a clock interrupt that fires at a (mostly) constant rate, which was the way BSD 4.2 did it over 25 years ago. However, that implementation is showing its age due to the overhead of the clock timers, the coarseness of them, and their power unfriendliness.

This will require a change to the way callouts and other kernel timeouts are done. Instead of passing the number of ticks for the callout to fire, a hard deadline in the future represented in the opaque type of clock_deadline_t will be passed instead. This deadline is based on the monotonic clock kept by the kernel. There will be functions to convert timespecs and nanosecond unitized delays into clock_deadline_t. Other uses of ticks will need to be converted to use callouts such as quantum scheduling or sleep intervals.

The fallout of this is that the system now knows when the next timer needs to fire, which hopefully is more than a few ticks ahead in the future. The system can now pass that knowledge to the MD clock code so it can avoid scheduling premature clock timers. So whenever the deadline changes (either to a sooner or later value), the timeout code will call void cpu_set_deadline(clock_deadline_t). The MD code can use clock_deadline_to_ns(clock_deadline_t) to find when that deadline is and do the appropriate thing.

Additionally, hardclock needs to take an additional uint64_t argument which is the number of nanoseconds since the last call to hardclock.

Tuesday, August 25, 2009

Using Variable Page Sizes without revamping UVM

Here's an idea for a simple change to UVM to allow pmaps to take advantage of larger mappings.

The first part is to add "special" aligned free and inactive page lists apart from the normal free list. These lists contain pages who's physical address is aligned one of a power-of-2 boundaries: 16KB, 64KB, 256KB, 1MB, 4MB should be enough.

The second part is that when uvm_map maps a page from a map entry, it tries to allocate a physical page that conforms to va & (map_entry's alignment_mask). If this is the first page of the entry, it tries to allocate from off of the aligned page lists. If successful, simply use that page. If not, reset the alignment mask to 0.

When a page is other than the first page, see if the page at (first page pa + (new va - first page va)) , if there is a first, page, can be used and use it, if that fails, and this page's (va & alignment mask) == 0, act as if it's a first page.

Otherwise, if the page belonging to (va ^ PAGE_SIZE) and is mapped and (its pa ^ its va) & PAGE_SIZE == 0, try the physical page at (its pa ^ PAGE_SIZE).

Next, if the page is even, try the pa for the previous, or if odd, the pa for the next page...

Should at least in most instances give you contiguous runs of 2 pages, and hopefully more.

Dumping swap partitions

About the only to use a swap partition these days is so there is someplace to store crash dumps. NetBSD will happily swap to files on a filesystem but it won't dump to a file.

Even 30 years, VMS could dump to file. It did so by using a very simple technique, when you specify the dumpfile, the kernel mapped it with a cathedral window. This was a term for a file mapping containing all the extents of the file, basicly a list of starting sector and sector count for each and every file extent used by that file.

There is no reason why NetBSD couldn't do the same thing. When a swapfile is added, simply record all of it extents. Of course, if the swapfile is a sparse file this won't work so rejecting sparse files as swapfiles might be acceptable. This also the problem of needing to find a buffer in low memory situations to read the swapfile extents (since a complete mapping is not stored anywhere). A VFS hook will be needed so prevent the file from deleted or truncated.

To the core dump code, the change is trivial. Instead of a dev_t, it will be a dev_t and a list of extents. Simply fill-up an extent and move the next until all have been exhausted. For the swap partition case, a single extent is supplied starting at 0 and with a length of the partition.

Sunday, August 23, 2009

mips64 multilib support

My current plan is to have the toolchain default to the N32 (IPL32 LL64) ABI for most programs but allow N64 programs to be built. The N64 libraries would not be required to live under a emulation but would be integrated with N32 libraries under /usr/lib.

But first is the ld.elf_so problem. If you are going to run O32 binaries, they will expect to execute ld.elf_so and find binaries in lib or /usr/lib. Now we could tweak ld.elf_so to prefer libraries in ${LIBDIR}/${abi} but should N32/N64 binaries use ld-${ABI}.elf_so to preserve use of O32 binaries? The kernel could also changed to change ld.elf_so to ld-${ABI}.elf_so and use that in preference. Got to ponder on this problem.

Let's say we have /usr/lib/libc.so and /usr/lib/n64/libc.so. Now for maximum performance, do you want to support ${LIBDIR}/${ABI}/${ARCH}/libc.so where the dynamic loader looks at the architecture embedded in the ELF flags and tries to find a library there first, then fall back to ${LIBDIR}/${ABI}? It seems to be a win in that you can have a system with tuned libraries for mips1, mips3, mips32, mips64 without having tied to that hardware.

mips32 is only use for O32 binaries any platform running N32 or N64 is going to have to be 64bit in nature.

Friday, August 21, 2009

FP Emulation

MIPS processors require anything from some help for denormalized and related matters to complete emulation of all FP instructions. Combine that with the emulation required for instructions in branch delay delays you can get a pretty large amount of emulation code in your kernel. While the latter is pretty much required to be in the kernel, the former could almost as easily live in userspace.

A simple signal like mechanism could quickly dispatch the Coprocessor Not Available exception back to user space using siginfo and a ucontext_t with the state of the lwp at the time of exception. The FP user code can then do the emulation, fixup the ucontext_t contents as needed, and then clean up with a setcontext(ucp) to resume execution. This does mean that setcontext will need to be able to restore entire context, not just the callee-saved registers.

This also means you can try different emulators without having to recompile your kernel or reboot. You've already taken the exception into the kernel for Coprocessor Not Available. You still need to return to userspace. The only difference is where you do the emulation. The flexibility you get by placing letting usermode do the emulation seems to be a clear winner.

One thing to note in this model is that the kernel never really has a copy of the FP state, that state is contained solely in user somewhere, probably in a per-thread context area.

Saturday, August 15, 2009

Include Machinations

For several releases I've want to make a radical change to how machine-specific include files are handed in NetBSD. Currently grabs files of the platform by following the symbolic that points to the platform specific directory (pmax, mac68k, macppc, shark) instead of the architecture specific directory (mips, m68k, powerpc, arm).

This has two advantages: the first is that would look identical to all platforms of that architecture; the second is that existing includes that just do #include can go away since they are no longer needed.

The disadvantage is that files that are platform specific like would need to be truly made platform independent which wouldn't be a bad thing.

Monday, August 3, 2009

TLB Miss (& Mod) Lookup

One critical thing for good performance on MIPS is the speed of the TLB handler. Since a 64 entry TLB with a 8KB page size can address a maximum of 1MB of address space, you really want this to be as efficient as possible since you will likely reloading TLB quite often.

My current idea is to have per-cpu 32K entry caches of PVP tables. As the number that are mapped, so will the number of allocated PVP tables. If all 32K are allocated, that will mean a total of 128MB will be used to lookup between 16 to 32 GB of virtual address space. Given that ASIDs are per-cpu, the ASID of the PV entry will placed in the upper bits of the PV address and masked out before use. As the system starts, all 32K entries will point to a common PVP table and as entries are added, additional PVP pages will be allocated and PVs distributed among them.

To look up a VPN one would do (via xkseg or kseg0):
randomizer = (vpn <>pm_randomizer)
idx = ((vpn ^ randomizer) >> 10) & 0x3fff;
pvp = pvps[idx];
idx = (vpn ^ ramdomizer) & 0x3ff ;
pv = pvp[idx];

Note that ASIDs are not used to compute the PVP or PV index since they are fleeting and would require updating of the PVPs. It could be done but might incur more overhead that I'd like.

if the pv doesn't matter the asid/vpn, a look at a secondary location in the PV page at TBD relatively offset will done.
If that, fails then fallback to a pmap_lookup or uvm_fault.

This is less an inverted page table than a "global" page table.

Wednesday, July 29, 2009

Shared text segment

To decrease the amount of TLB thrashing, it would be very nice if all/most of the shared libraries (and ld.elf_so) could be loaded in such a way that all their .text sections reside in a single contiguous section of RAM that could be shared among all the processes in the system. The .data/.bss sections would live "far" from text (256MB).

The shared text should be mapped by a global TLB entry saving a lot of overhead.

Per CPU data

Since this will be a SMP port, I need some to determine which CPU this is. On sane architectures this is easy since the CPU has registers to help figure that out. But for some reason, the MIPS architects left that out. So I'm going to take a suggestion and rework it a bit.

For each CPU, I'm going to allocate a 64KB contiguous and aligned region of memory and wire it into the TLB at -8000 hex (this will be signed extended to 0xffff8000 in 32 bit kernels and 0xfffffffffff8000 in 64 bit kernels). The advantage of this is you can load and address of per-cpu stuff into a register with one instruction using a signed offset with $0. The 32KB below -8000 will be the interrupt/exception stack for that CPU.

Note that for other CPUs to access the per-cpu data, it will be references using KSEG0 which will require all such blocks to be the first 256MB of RAM (which shouldn't be a problem).

Adding mips64 support to NetBSD

I'm starting (in actuality I'm being paid to start) an effort to add support for running 64-bit LP kernels on 64 bit MIPS processors.

And since I have this empty blog, I'd thought I write various ideas before I lose track of them.

Friday, April 17, 2009

Rather than let my random thoughts languish and die, I've decided to archive them here. I give no assurances on the quality or quanity of posts to this blog.