Making an OS that Doesn't Suck

Last updated: 10/28/2003.

NOTE: This is all very much my Opinion, of course, so take please take it that way.

Introduction
List of Good Things in an OS

Introduction

Any OS Snob knows that All OSes Suck. Some just Suck More or Less than others.

A good example of this is the fact that, although I use Linux as the primary OS I run on most of my computers (used to use FreeBSD, and may again), I have often been known to say that "Linux/BSD just sucks a bit less than Windows".

The motivation for this page comes from the fact that the core algorithms implemented in currently used OSes are generally 30+ years old, and often from a rather common set of semantics. The core APIs for both Windows and the *NIX's are rather similar, for example. Most programmers, much less end-users, aren't even aware of what they could get from newer techniques, so "don't know any better", so to speak, in what they should demand of OSes they use.

These pages are dedicated to 3 somewhat related Higher Goals:

Increasing awareness of the fact that All OSes Suck.
Generating a list of features/capabilities/design ideas for an OS that Doesn't Suck.
Spreading knowledge of what people could get out of an OS that Doesn't Suck, and in the process, putting pressure on those who implement OSes to do these things. I.e. Good Architecture and Design of an OS.

I will try to keep the following distinctions clear:

What is implementable in existing OSes (such as Linux/BSD or Windows).
What would require a new architecture (and maybe what kind of architecture it would need).
What may be controversial, but seems interesting and worthwhile to report here.

To be fair, in the last few years the free *NIX's seem to be starting to catch up and even be using some basic OS research again, but I still think it's going rather glacially overall. I'm also guilty of waiting too long to do this, but (of course ;-) I'm working on an OS project which I hope should be able to embody all of the Good Things listed below.

Email Me Feedback, Commentary, Intelligent Flames, and Contributions to the list. All would be greatly appreciated, though I reserve the right to not add items that don't appear worthwhile, since this is my opinions page.

I'm quite willing to list links to other pages here too (maybe a solution for some material that I don't quite agree to add here). There's not enough good info/sources on what needs to be done with OSes out there.

List of Good Things in an OS

NOTE: As these get filled in they will eventually get separate pages.

Storage/File Systems/Persistence
- Categories of crash-proof (minimum acceptable is 0):
  - 0. Crashless (metadata should never get into a corrupted state).
  - 1. Ordered (data written appears in a logical order from writes of processes).
  - 2. Zero-loss (no committed state is lost).
- Want a distributed filesystem that has/supports:
  - Should be a "logical pool" of space, with caching as appropriate for performance, but it shouldn't require the user to know what systems physical disks reside on.
  - Disconnected operation.
  - For fault-tolerance (say in relation to "Zero-loss" operation above), keep at least 2 copies of all data an different machines.
User Interfaces/Window Systems/GUIs
Software Componentization/Object Models
Software Packaging
Administration
System API/"User-Level Model"
Authentication and Security
Distributed/Cluster Capabilities
- All resources should be network-transparent.
- Don't use a single centralized/global controller for any resource. Local algorithms that can ignore work happening "far enough away" become more and more necessary as a logical cluster increases in size. It is acceptable, and likely beneficial, for resource pools to be structured in a way that represents this locality with some kind of cost function applied to distance away from the caller. Not sure of a natural way to express this cost function yet. The "User Login Session" can manage and track groups of processes associated with a particular user or login session. It would be responsible for finding more resources to apply to the user's task(s) and presenting them, even to the tasks (for example, it could make multiple separate machines appear as as single logical NUMA by manipulating the resource namespace).
- Categories of automatic fault-tolerance (minimum acceptable is 0):
  - 0. Transparent process migration (so system maintenance or upgrades can be done with no downtime).
  - 1. Checkpointing (memory/register state for the processes in question is checkpointed at intervals to disk or over the network for recovery if a failure occurs).
  - 2. "Checking" fault-tolerance (multiple identical processes are run and the checkpoints are compared, requires that the computation in question be deterministic).
Device Drivers/Kernel/IPC Design
- A very key notion, I will try to state this in the most general way so as not to preclude other implementations, though I have my own idea of how to do this: Absolutely ALL kernel namespaces should be capable of being copied and manipulated by user processes in a secure manner such that these private copies can be presented to requestors of that kind of resource. Examples:
  - A user login session process could, in an attempt to get more processors to run programs for it's user, create a new scheduler namespace that has virtual copies of multiple machines in it.
  - A disk defragmenter could run underneath the filesystem by presenting it's own image of the disk contents (as a logical disk) while changing the actual structure of the disk underneath. Easiest to implement if filesystems never allow disk block addresses to be given out, of course.
  - A new filesystem can be implemented as a logical mount-point with a program running behind it (much like Plan 9 or the GNU HURD).

erich@uruk.org

Making an OS that Doesn't Suck

Contents

Introduction

List of Good Things in an OS