Making an OS that Doesn't Suck
Last updated: 10/28/2003.
Contents
NOTE: This is all very much my Opinion, of course, so take please take
it that way.
Introduction
Any OS Snob knows that All OSes Suck.
Some just Suck More or Less than others.
A good example of this is the fact that, although I use Linux as the
primary OS I run on most of my computers (used to use FreeBSD, and
may again), I have often been known to say that "Linux/BSD just sucks
a bit less than Windows".
The motivation for this page comes from the fact that the core algorithms
implemented in currently used OSes are generally 30+ years old, and often
from a rather common set of semantics. The core APIs for both Windows
and the *NIX's are rather similar, for example. Most programmers, much less
end-users, aren't even aware of what they could get from newer techniques,
so "don't know any better", so to speak, in what they should demand of
OSes they use.
These pages are dedicated to 3 somewhat related Higher Goals:
- Increasing awareness of the fact that All OSes Suck.
- Generating a list of features/capabilities/design ideas for an OS that
Doesn't Suck.
- Spreading knowledge of what people could get out of an OS that Doesn't
Suck, and in the process, putting pressure on those who implement OSes
to do these things. I.e. Good Architecture and Design of an OS.
I will try to keep the following distinctions clear:
- What is implementable in existing OSes (such as Linux/BSD or Windows).
- What would require a new architecture (and maybe what kind of
architecture it would need).
- What may be controversial, but seems interesting and worthwhile to
report here.
To be fair, in the last few years the free *NIX's seem to be starting to
catch up and even be using some basic OS research again, but I still
think it's going rather glacially overall. I'm also guilty of waiting
too long to do this, but (of course ;-) I'm working on an OS project
which I hope should be able to embody all of the Good Things listed
below.
Email Me Feedback, Commentary,
Intelligent Flames, and Contributions to the list. All would be
greatly appreciated, though I reserve the right to not add items
that don't appear worthwhile, since this is my opinions page.
I'm quite willing to list links to other pages here too (maybe a solution
for some material that I don't quite agree to add here). There's not enough
good info/sources on what needs to be done with OSes out there.
List of Good Things in an OS
NOTE: As these get filled in they will eventually get separate
pages.
- Storage/File Systems/Persistence
- Categories of crash-proof (minimum acceptable is 0):
- 0. Crashless (metadata should never get into a corrupted state).
- 1. Ordered (data written appears in a logical order from writes
of processes).
- 2. Zero-loss (no committed state is lost).
- Want a distributed filesystem that has/supports:
- Should be a "logical pool" of space, with caching as appropriate
for performance, but it shouldn't require the user to know what
systems physical disks reside on.
- Disconnected operation.
- For fault-tolerance (say in relation to "Zero-loss" operation
above), keep at least 2 copies of all data an different machines.
- User Interfaces/Window Systems/GUIs
- Software Componentization/Object Models
- Software Packaging
- Administration
- System API/"User-Level Model"
- Authentication and Security
- Distributed/Cluster Capabilities
- All resources should be network-transparent.
- Don't use a single centralized/global controller for any
resource. Local algorithms that can ignore work happening "far enough
away" become more and more necessary as a logical cluster increases
in size. It is acceptable, and likely beneficial, for resource pools
to be structured in a way that represents this locality with some kind
of cost function applied to distance away from the caller. Not sure
of a natural way to express this cost function yet. The "User Login
Session" can manage and track groups of processes associated with a
particular user or login session. It would be responsible for finding
more resources to apply to the user's task(s) and presenting them, even
to the tasks (for example, it could make multiple separate machines
appear as as single logical NUMA by manipulating the resource
namespace).
- Categories of automatic fault-tolerance (minimum acceptable is 0):
- 0. Transparent process migration (so system maintenance or
upgrades can be done with no downtime).
- 1. Checkpointing (memory/register state for the processes in
question is checkpointed at intervals to disk or over the network
for recovery if a failure occurs).
- 2. "Checking" fault-tolerance (multiple identical processes are
run and the checkpoints are compared, requires that the computation
in question be deterministic).
- Device Drivers/Kernel/IPC Design
- A very key notion, I will try to state this in the most general way
so as not to preclude other implementations, though I have my own idea
of how to do this:
Absolutely ALL kernel namespaces should be capable of being copied and
manipulated by user processes in a secure manner such that these private
copies can be presented to requestors of that kind of resource. Examples:
- A user login session process could, in an attempt to get more
processors to run programs for it's user, create a new scheduler namespace
that has virtual copies of multiple machines in it.
- A disk defragmenter could run underneath the filesystem by presenting
it's own image of the disk contents (as a logical disk) while changing
the actual structure of the disk underneath. Easiest to implement if
filesystems never allow disk block addresses to be given out, of course.
- A new filesystem can be implemented as a logical mount-point with
a program running behind it (much like Plan 9 or the GNU HURD).
erich@uruk.org