



                          INCOMPLETE   DRAFT



           The  Flux  Operating  System  Toolkit

                 Bryan Ford and Flux Project Members

                         E-mail: oskit@jensen.cs.utah.edu


                                 The Flux Project

Department of Computer Science, University of Utah, Salt Lake City, UT 84112

 E-mail: flux@cs.utah.edu  URL: http://www.cs.utah.edu/projects/flux/



                                September 1, 1996
2







Contents
1   Introduction                                                                                                         11
    1.1    Goals and Scope   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   11
    1.2    Road Map  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   11
    1.3    Using the OS Toolkit : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   12
    1.4    Example Kernels  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   13
    1.5    Overall Design Principles   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   13
    1.6    Portability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   14
    1.7    Building the Toolkit   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   14
    1.8    Linking Order  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   15


2   List-based Memory Manager Library (liblmm.a)                                                        17
    2.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   17
    2.2    Memory regions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   18
           2.2.1    Region flags  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   18
           2.2.2    Allocation priority  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   19
    2.3    Example use   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   19
    2.4    Restrictions and guarantees  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   20
    2.5    Sanity checking  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   20
    2.6    API reference  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   21
           2.6.1    lmm_init: initialize an LMM pool   : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   22
           2.6.2    lmm_add_region: register a memory region in an LMM pool  : : : : : : : : : : : : : : :   23
           2.6.3    lmm_add_free: add a block of free memory to an LMM pool : : : : : : : : : : : : : : :   24
           2.6.4    lmm_remove_free: remove a block of memory from an LMM pool  : : : : : : : : : : : :   25
           2.6.5    lmm_alloc: allocate memory   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   26
           2.6.6    lmm_alloc_aligned: allocate memory with a specific alignment  : : : : : : : : : : : : :   27
           2.6.7    lmm_alloc_gen: allocate memory with general constraints  : : : : : : : : : : : : : : : :   28
           2.6.8    lmm_alloc_page: allocate a page of memory  : : : : : : : : : : : : : : : : : : : : : : : :   29
           2.6.9    lmm_free: free previously-allocated memory  : : : : : : : : : : : : : : : : : : : : : : : :   30
           2.6.10   lmm_free_page: free a page allocated with lmm_alloc_page  : : : : : : : : : : : : : : : :   31
           2.6.11   lmm_avail: find the amount of free memory in an LMM pool  : : : : : : : : : : : : : :   32
           2.6.12   lmm_find_free: scan a memory pool for free blocks : : : : : : : : : : : : : : : : : : : :   33
           2.6.13   lmm_dump: display the free memory list in an LMM pool  : : : : : : : : : : : : : : : : :   34


3   Executable Program Interpreter (libexec.a)                                                             35
    3.1    Header Files : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   36
           3.1.1    exec.h: definitions for executable interpreter functions   : : : : : : : : : : : : : : : : :   37
           3.1.2    a.out.h: (semi-)standard a.out file format definitions  : : : : : : : : : : : : : : : : : :   38
           3.1.3    elf.h: standard 32-bit ELF file format definitions   : : : : : : : : : : : : : : : : : : : :   39
    3.2    Function Reference  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   40
           3.2.1    exec_load: detect the type of an executable file and load it  : : : : : : : : : : : : : : :   41
           3.2.2    exec_load_elf: load a 32-bit ELF executable file  : : : : : : : : : : : : : : : : : : : : :   42
           3.2.3    exec_load_aout: load an a.out-format executable file  : : : : : : : : : : : : : : : : : :   43


                                                                 3
4                                                                                                                 CONTENTS


4   Disk Partition Interpreter (libdiskpart.a)                                                                45

    4.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   45
    4.2    Supported Partitioning Schemes   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   45
    4.3    Example Use   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   45
           4.3.1    Reading the partition table   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   45
           4.3.2    Using Partition Information  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   46
    4.4    Restrictions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   47
           4.4.1    Endian   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   47
           4.4.2    Nesting   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   47
           4.4.3    Lookup   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   47
    4.5    API reference  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   47
           4.5.1    diskpart_get_partitions: initialize an array of partition entries : : : : : : : : : : : :   48
           4.5.2    diskpart_fill_entry: initialize a single partition entry  : : : : : : : : : : : : : : : : :   49
           4.5.3    diskpart_dump: print a partition entry to stdout  : : : : : : : : : : : : : : : : : : : : :   50
           4.5.4    diskpart_lookup_bsd_compat: search for a partition entry   : : : : : : : : : : : : : : :   51
           4.5.5    diskpart_lookup_bsd_string: search for a partition entry   : : : : : : : : : : : : : : :   52
           4.5.6    diskpart_get_foo: Search for foo-type partitions : : : : : : : : : : : : : : : : : : : : :   53


5   File System Reader (libfsread.a)                                                                            55


6   Minimal C Library (libmc.a)                                                                                   57
    6.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   57
    6.2    Unsupported Features   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   58
    6.3    Header Files : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   59
           6.3.1    assert.h: program diagnostics facility : : : : : : : : : : : : : : : : : : : : : : : : : : :   60
           6.3.2    ctype.h: character handling functions  : : : : : : : : : : : : : : : : : : : : : : : : : : :   61
           6.3.3    errno.h: error numbers  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   62
           6.3.4    fcntl.h: POSIX low-level file control   : : : : : : : : : : : : : : : : : : : : : : : : : : :   63
           6.3.5    limits.h: architecture-specific limits   : : : : : : : : : : : : : : : : : : : : : : : : : : :   64
           6.3.6    setjmp.h: nonlocal jumps  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   65
           6.3.7    signal.h: signal handling  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   66
           6.3.8    stdarg.h: variable arguments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   67
           6.3.9    stddef.h: common definitions   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   68
           6.3.10   stdio.h: standard input/output  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   69
           6.3.11   stdlib.h: standard library functions : : : : : : : : : : : : : : : : : : : : : : : : : : : :   70
           6.3.12   string.h: string handling functions   : : : : : : : : : : : : : : : : : : : : : : : : : : : :   71
           6.3.13   strings.h: string handling functions (deprecated)  : : : : : : : : : : : : : : : : : : : :   72
           6.3.14   sys/gmon.h: GNU profiling support definitions  : : : : : : : : : : : : : : : : : : : : : :   73
           6.3.15   sys/ioctl.h: I/O control definitions   : : : : : : : : : : : : : : : : : : : : : : : : : : :   74
           6.3.16   sys/mman.h: memory management and mapping definitions  : : : : : : : : : : : : : : :   75
           6.3.17   sys/reboot.h: memory management and mapping definitions   : : : : : : : : : : : : :   76
           6.3.18   sys/signal.h: signal handling (deprecated)   : : : : : : : : : : : : : : : : : : : : : : :   77
           6.3.19   sys/stat.h: file statistics  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   78
           6.3.20   sys/termios.h: terminal handling functions and definitions : : : : : : : : : : : : : : :   79
           6.3.21   sys/time.h: timing functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   80
           6.3.22   sys/types.h: general POSIX types   : : : : : : : : : : : : : : : : : : : : : : : : : : : :   81
           6.3.23   termios.h: terminal handling functions and definitions   : : : : : : : : : : : : : : : : :   82
           6.3.24   unistd.h: traditional Unix definitions  : : : : : : : : : : : : : : : : : : : : : : : : : : :   83
    6.4    Memory Allocation  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   84
           6.4.1    malloc_lmm: LMM pool used by the default memory allocation functions   : : : : : : :   85
           6.4.2    malloc: allocate uninitialized memory  : : : : : : : : : : : : : : : : : : : : : : : : : : :   86
           6.4.3    memalign: allocate aligned memory   : : : : : : : : : : : : : : : : : : : : : : : : : : : :   87
           6.4.4    calloc: allocate cleared memory  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   88
           6.4.5    realloc: change the size of an existing memory block  : : : : : : : : : : : : : : : : : :   89
CONTENTS                                                                                                                 5


           6.4.6    free: release an allocated memory block : : : : : : : : : : : : : : : : : : : : : : : : : :   90

           6.4.7    smalloc: allocated uninitialized memory. Caller must keep track of size of the allocation.  91
           6.4.8    smemalign: allocate aligned memory. Caller must keep track of size of the allocation.     92
           6.4.9    sfree:  release a memory block allocated with smalloc or smemalign.  Caller must
                    proivde size of the block being freed.  : : : : : : : : : : : : : : : : : : : : : : : : : : : :   93
           6.4.10   mem_lock: Lock access to malloc_lmm.  : : : : : : : : : : : : : : : : : : : : : : : : : : :   94
           6.4.11   mem_unlock: Unlock access to malloc_lmm.   : : : : : : : : : : : : : : : : : : : : : : : :   95
           6.4.12   morecore: grow the heap   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   96
    6.5    Standard I/O Functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   97
    6.6    Termination Functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   98
           6.6.1    exit: terminate normally  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :   99
           6.6.2    abort: terminate abnormally  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 100
           6.6.3    panic: terminate abnormally with an error message   : : : : : : : : : : : : : : : : : : : 101
    6.7    Miscellaneous Functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102
           6.7.1    ntohl: convert 32-bit long word from network byte order   : : : : : : : : : : : : : : : : 103
           6.7.2    ntohs: convert 16-bit short word from network byte order  : : : : : : : : : : : : : : : : 104
           6.7.3    getenv: search for an environment variable  : : : : : : : : : : : : : : : : : : : : : : : : 105
           6.7.4    creat: create a file  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106
           6.7.5    hexdump: print a buffer as a hexdump   : : : : : : : : : : : : : : : : : : : : : : : : : : : 107


7   Memory Debug Utilities Library (libmemdebug.a)                                                     109
    7.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 109
    7.2    Debugging versions of standard routines  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 111
    7.3    Additional debugging utilities  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112
           7.3.1    memdebug_mark: Mark all currently allocated blocks.  : : : : : : : : : : : : : : : : : : : 113
           7.3.2    memdebug_check: Look for blocks allocated since mark that haven't been freed.  : : : : 114
           7.3.3    memdebug_ptrchk: Check validity of a pointer's fence-posts   : : : : : : : : : : : : : : : 115
           7.3.4    memdebug_sweep: Check validity of all allocated block's fence-posts  : : : : : : : : : : : 116


8   Kernel Support Library (libkern.a)                                                                        117
    8.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
           8.1.1    Machine-dependence of code and interfaces   : : : : : : : : : : : : : : : : : : : : : : : : 117
           8.1.2    Generic versus Base Environment code : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
           8.1.3    Road Map : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118
    8.2    Machine-independent Facilities   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120
           8.2.1    types.h: C-language machine-dependent types  : : : : : : : : : : : : : : : : : : : : : : 121
           8.2.2    page.h: Page size definitions   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122
           8.2.3    bitops.h: efficient bit field operations  : : : : : : : : : : : : : : : : : : : : : : : : : : : 123
           8.2.4    spin_lock.h: Spin locks  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124
           8.2.5    debug.h: debugging support facilities   : : : : : : : : : : : : : : : : : : : : : : : : : : : 125
    8.3     _____|_X86|Generic Low-level Definitions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127
           8.3.1    asm.h: assembly language support macros  : : : : : : : : : : : : : : : : : : : : : : : : : 128
           8.3.2    eflags.h: Processor flags register definitions  : : : : : : : : : : : : : : : : : : : : : : : 129
           8.3.3    proc_reg.h: Processor register definitions and accessor functions  : : : : : : : : : : : : 130
           8.3.4    debug_reg.h: Debug register definitions and accessor functions  : : : : : : : : : : : : : 131
           8.3.5    fp_reg.h: Floating point register definitions and accessor functions : : : : : : : : : : : 132
           8.3.6    far_ptr.h: Far (segment:offset) pointers : : : : : : : : : : : : : : : : : : : : : : : : : : 133
           8.3.7    pio.h: Programmed I/O functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 134
           8.3.8    seg.h: Segment descriptor data structure definitions and constants : : : : : : : : : : : 135
           8.3.9    gate_init.h: Gate descriptor initialization support   : : : : : : : : : : : : : : : : : : : 136
           8.3.10   trap.h: Processor trap vectors  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 137
           8.3.11   paging.h: Page translation data structures and constants  : : : : : : : : : : : : : : : : 138
           8.3.12   tss.h: Processor task save state structure definition  : : : : : : : : : : : : : : : : : : : 139
    8.4     ________|_X86GPC|eneric Low-level Definitions   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 140
6                                                                                                                 CONTENTS


           8.4.1    irq_list.h: Standard hardware interrupt assignments  : : : : : : : : : : : : : : : : : : 141

           8.4.2    pic.h: Programmable Interrupt Controller definitions  : : : : : : : : : : : : : : : : : : 142
           8.4.3    keyboard.h: PC keyboard definitions   : : : : : : : : : : : : : : : : : : : : : : : : : : : 143
           8.4.4    rtc.h: NVRAM Register locations  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 144
    8.5     _____|_X86|Processor Identification and Management   : : : : : : : : : : : : : : : : : : : : : : : : : : 145
           8.5.1    cpu_info: CPU identification data structure   : : : : : : : : : : : : : : : : : : : : : : : 146
           8.5.2    cpuid: identify the current CPU  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 148
           8.5.3    cpu_info_format: output a cpu_info structure in ASCII form  : : : : : : : : : : : : : : 149
           8.5.4    cpu_info_dump: pretty-print a cpu_info structure to the console  : : : : : : : : : : : : : 150
           8.5.5    i16_enter_pmode: enter protected mode  : : : : : : : : : : : : : : : : : : : : : : : : : : 151
           8.5.6    i16_leave_pmode: leave protected mode  : : : : : : : : : : : : : : : : : : : : : : : : : : 152
           8.5.7    paging_enable: enable page translation  : : : : : : : : : : : : : : : : : : : : : : : : : : 153
           8.5.8    paging_disable: disable page translation  : : : : : : : : : : : : : : : : : : : : : : : : : 154
    8.6     _____|_X86|Base Environment   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 155
           8.6.1    Memory model  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 155
           8.6.2    base_vm.h: definitions for the base virtual memory environment   : : : : : : : : : : : : 157
           8.6.3    base_cpu_setup: initialize and activate the base CPU environment  : : : : : : : : : : : 158
           8.6.4    base_cpu_init: initialize the base environment data structures  : : : : : : : : : : : : : 159
           8.6.5    base_cpu_load: activate the base processor execution environment  : : : : : : : : : : : 160
           8.6.6    base_cpuid: global variable describing the processor  : : : : : : : : : : : : : : : : : : : 161
           8.6.7    base_stack.h: default kernel stack  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 162
    8.7     _____|_X86|Base Environment: Segmentation Support  : : : : : : : : : : : : : : : : : : : : : : : : : : 163
           8.7.1    base_gdt: default global descriptor table for the base environment  : : : : : : : : : : : 164
           8.7.2    base_gdt_init: initialize the base GDT to default values   : : : : : : : : : : : : : : : : 166
           8.7.3    base_gdt_load: load the base GDT into the CPU   : : : : : : : : : : : : : : : : : : : : 167
           8.7.4    base_idt: default interrupt descriptor table  : : : : : : : : : : : : : : : : : : : : : : : : 168
           8.7.5    base_idt_load: load the base IDT into the current processor  : : : : : : : : : : : : : : 169
           8.7.6    base_tss: default task state segment : : : : : : : : : : : : : : : : : : : : : : : : : : : : 170
           8.7.7    base_tss_init: initialize the base task state segment   : : : : : : : : : : : : : : : : : : 171
           8.7.8    base_tss_load: initialize the base task state segment   : : : : : : : : : : : : : : : : : : 172
    8.8     _____|_X86|Base Environment: Trap Handling  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 173
           8.8.1    trap_state: saved state format used by the default trap handler  : : : : : : : : : : : : 174
           8.8.2    base_trap_init: initialize the processor trap vectors in the base IDT   : : : : : : : : : 176
           8.8.3    base_trap_inittab: initialization table for the default trap entrypoints   : : : : : : : : 177
           8.8.4    base_trap_handler: pointer to trap handler   : : : : : : : : : : : : : : : : : : : : : : : 178
           8.8.5    trap_dump: dump a saved trap state structure   : : : : : : : : : : : : : : : : : : : : : : 179
           8.8.6    trap_dump_panic: dump a saved trap state structure : : : : : : : : : : : : : : : : : : : 180
    8.9     _____|_X86|Base Environment: Page Translation  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 181
           8.9.1    base_paging_init: create minimal kernel page tables and enable paging  : : : : : : : : 182
           8.9.2    base_pdir_pa: initial kernel page directory   : : : : : : : : : : : : : : : : : : : : : : : : 183
           8.9.3    pdir_find_pde: find an entry in a page directory given a linear address   : : : : : : : : 184
           8.9.4    ptab_find_pte: find an entry in a page table given a linear address : : : : : : : : : : : 185
           8.9.5    pdir_find_pte: look up a page table entry from a page directory  : : : : : : : : : : : : 186
           8.9.6    pdir_get_pte: retrieve the contents of a page table entry   : : : : : : : : : : : : : : : : 187
           8.9.7    ptab_alloc: allocate a page table page and clear it to zero   : : : : : : : : : : : : : : : 188
           8.9.8    ptab_free: free a page table allocated using ptab_alloc   : : : : : : : : : : : : : : : : : 189
           8.9.9    pdir_map_page: map a 4KB page into a linear address space   : : : : : : : : : : : : : : 190
           8.9.10   pdir_unmap_page: unmap a single 4KB page mapping  : : : : : : : : : : : : : : : : : : 191
           8.9.11   pdir_map_range: map a contiguous range of physical addresses  : : : : : : : : : : : : : 192
           8.9.12   pdir_prot_range: change the permissions on a mapped memory range  : : : : : : : : : 193
           8.9.13   pdir_unmap_range: remove a mapped range of linear addresses  : : : : : : : : : : : : : 194
           8.9.14   pdir_dump: dump the contents of a page directory and all its page tables   : : : : : : : 195
           8.9.15   ptab_dump: dump the contents of a page table   : : : : : : : : : : : : : : : : : : : : : : 196
    8.10    ________|_X86BPC|ase Environment: I/O Device Support   : : : : : : : : : : : : : : : : : : : : : : : : : 197
CONTENTS                                                                                                                 7


           8.10.1   base_irq.h: Hardware interrupt definitions for standard PCs  : : : : : : : : : : : : : : 198

           8.10.2   phys_lmm.h: Physical memory management for PCs   : : : : : : : : : : : : : : : : : : : 199
           8.10.3   direct_cons.h: Direct video console  : : : : : : : : : : : : : : : : : : : : : : : : : : : : 200
           8.10.4   com_cons.h: Polling serial (COM) port console  : : : : : : : : : : : : : : : : : : : : : : 201
    8.11    ________|_X86MPC|ultiBoot Startup   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 202
           8.11.1   Startup code organization  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 202
           8.11.2   Startup sequence  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 202
           8.11.3   Memory model  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 203
           8.11.4   Command-line arguments  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 203
           8.11.5   Linking MultiBoot kernels  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 203
           8.11.6   multiboot.h: Definitions of MultiBoot structures and constants   : : : : : : : : : : : : 205
           8.11.7   boot_info: MultiBoot information structure   : : : : : : : : : : : : : : : : : : : : : : : 206
           8.11.8   multiboot_main: general MultiBoot initialization : : : : : : : : : : : : : : : : : : : : : 207
           8.11.9   base_multiboot_init_mem: physical memory initialization  : : : : : : : : : : : : : : : : 208
           8.11.10  base_multiboot_init_cmdline: command-line preprocessing  : : : : : : : : : : : : : : 209
           8.11.11  base_multiboot_find: find a MultiBoot boot module by name  : : : : : : : : : : : : : 210
    8.12    ________|_X86RPC|aw BIOS Startup   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 211
    8.13    ________|_X86DPC|OS Startup : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 212
    8.14   Remote Kernel Debugging with GDB   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 213
           8.14.1   Organization of remote GDB support code   : : : : : : : : : : : : : : : : : : : : : : : : 213
           8.14.2   Using the remote debugging code  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 213
           8.14.3   Debugging address spaces other than the kernel's  : : : : : : : : : : : : : : : : : : : : : 214
           8.14.4   gdb_state: processor register state frame used by GDB   : : : : : : : : : : : : : : : : : 215
           8.14.5   gdb_trap: default trap handler for remote GDB debugging   : : : : : : : : : : : : : : : 216
           8.14.6   gdb_copyin: safely read data from the subject's address space   : : : : : : : : : : : : : 218
           8.14.7   gdb_copyout: safely write data into the subject's address space  : : : : : : : : : : : : : 219
           8.14.8   gdb_trap_recover: recovery pointer for safe memory transfer routines  : : : : : : : : : 220
           8.14.9   gdb_signal: vector to GDB trap/signal handler routine  : : : : : : : : : : : : : : : : : 221
           8.14.10  gdb_set_trace_flag: enable or disable single-stepping in a state frame   : : : : : : : : 222
           8.14.11  gdb_breakpoint: macro to generate a manual instruction breakpoint : : : : : : : : : : 223
    8.15   Serial-line Remote Debugging with GDB  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 224
           8.15.1   Redirecting console output to the remote debugger  : : : : : : : : : : : : : : : : : : : : 224
           8.15.2   gdb_serial_signal: primary event handler in the GDB stub  : : : : : : : : : : : : : : 225
           8.15.3   gdb_serial_exit: notify the remote debugger that the subject is dead  : : : : : : : : : 226
           8.15.4   gdb_serial_putchar: output a character to the remote debugger's console   : : : : : : 227
           8.15.5   gdb_serial_puts: output a line to the remote debugger's console  : : : : : : : : : : : : 228
           8.15.6   gdb_serial_recv: vector to GDB serial line receive function   : : : : : : : : : : : : : : 229
           8.15.7   gdb_serial_send: vector to GDB serial line send function  : : : : : : : : : : : : : : : : 230
           8.15.8   gdb_pc_com_init:  ________|_X86sPC|et up serial-line debugging over a COM port  : : : : : : : : 231
    8.16   Annotations  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 232


9   Symmetric Multi Processing Library (libsmp.a)                                                       233
    9.1    Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 233
    9.2    Supported Systems  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 233
           9.2.1    Intel x86 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 233
    9.3    API reference  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 233
           9.3.1    smp_initialize: Initializes the SMP startup code  : : : : : : : : : : : : : : : : : : : : 234
           9.3.2    smp_find_cur_cpu: Return the processor ID of the current processor. : : : : : : : : : : 235
           9.3.3    smp_find_cpu: Return the next processor ID   : : : : : : : : : : : : : : : : : : : : : : : 236
           9.3.4    smp_start_cpu: Starts a processor running a specified function  : : : : : : : : : : : : : 237
           9.3.5    smp_get_num_cpus: Returns the total number of processors   : : : : : : : : : : : : : : : 238
8                                                                                                                 CONTENTS


10  Flux Device Driver Framework                                                                              239

    10.1   Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 239
           10.1.1   Full versus partial compliance  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 240
    10.2   Organization   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 240
    10.3   Driver Sets   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 242
    10.4   Execution Model  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 242
           10.4.1   Use in multiprocessor kernels  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 243
           10.4.2   Use in preemptive kernels  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 243
           10.4.3   Use in multiple-interrupt-level kernels   : : : : : : : : : : : : : : : : : : : : : : : : : : : 244
           10.4.4   Use in interrupt-model kernels   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 244
           10.4.5   Use in out-of-kernel, user-mode device drivers  : : : : : : : : : : : : : : : : : : : : : : : 245
    10.5   Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 246
    10.6   Device Driver Initialization   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 247
    10.7   Device Classification  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 247
    10.8   Buffer Management : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 248
    10.9   Asynchronous I/O   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 248
    10.10  Other Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 248
    10.11  Common Device Driver Interface  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 249
           10.11.1  fdev.h: common device driver framework definitions  : : : : : : : : : : : : : : : : : : : 250
           10.11.2  fdev_ioctl: control a device using a driver-specific protocol   : : : : : : : : : : : : : : 251
    10.12  Driver Memory Allocation  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 252
           10.12.1  fdev_memflags_t: memory allocation flags   : : : : : : : : : : : : : : : : : : : : : : : : 253
           10.12.2  fdev_mem_alloc: allocate memory for use by device drivers   : : : : : : : : : : : : : : : 255
           10.12.3  fdev_mem_free: free memory allocated with fdev_mem_alloc  : : : : : : : : : : : : : : : 256
           10.12.4  fdev_mem_get_phys: find the physical address of an allocated block  : : : : : : : : : : : 257
           10.12.5  fdev_mem_get_phys_list: find the physical address list of an allocated block   : : : : : 258
    10.13  Hardware Interrupts   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 259
           10.13.1  fdev_intr_disable: prevent interrupts in the driver environment   : : : : : : : : : : : 260
           10.13.2  fdev_intr_enable: allow interrupts in the driver environment   : : : : : : : : : : : : : 261
           10.13.3  fdev_intr_alloc: allocate an interrupt request line   : : : : : : : : : : : : : : : : : : : 262
    10.14  Sleep/Wakeup : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 263
           10.14.1  fdev_sleep_init: prepare to put the current process to sleep  : : : : : : : : : : : : : : 264
           10.14.2  fdev_sleep: put the current process to sleep   : : : : : : : : : : : : : : : : : : : : : : : 265
           10.14.3  fdev_wakeup: wake up a sleeping process   : : : : : : : : : : : : : : : : : : : : : : : : : 266
    10.15  Driver-Kernel Interface: Timing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 267
           10.15.1  fdev_timer_register: start a timer  : : : : : : : : : : : : : : : : : : : : : : : : : : : : 268
           10.15.2  fdev_nanosleep: wait for some amount of time to elapse   : : : : : : : : : : : : : : : : 269
           10.15.3  fdev_nanosleep_nonblock: wait a short time without blocking  : : : : : : : : : : : : : 270
    10.16  Buffer Management : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 271
           10.16.1  fdev_buf_copyin: copy data from an opaque buffer to a driver's buffer : : : : : : : : : 272
           10.16.2  fdev_buf_copyout: copy data from a driver into an opaque buffer   : : : : : : : : : : : 273
           10.16.3  fdev_buf_wire: wire down part of a buffer to physical memory  : : : : : : : : : : : : : 274
           10.16.4  fdev_buf_unwire: unwire previously wired data   : : : : : : : : : : : : : : : : : : : : : 275
           10.16.5  fdev_buf_map: map a buffer into the driver's virtual address space  : : : : : : : : : : : 276
           10.16.6  fdev_buf_unmap: unmap data mapped with fdev_buf_map  : : : : : : : : : : : : : : : : 277
    10.17  Mapping Physical Memory   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 278
           10.17.1  fdev_map_phys_mem: map physical memory into kernel virtual memory  : : : : : : : : : 279
    10.18  Device Registration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 280
           10.18.1  fdev_alloc: allocate a device node  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 281
           10.18.2  fdev_free: free a device node : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 282
    10.19  Block Storage Device Interfaces  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 283
           10.19.1  fdev_blk_read: read data from a device  : : : : : : : : : : : : : : : : : : : : : : : : : : 284
           10.19.2  fdev_blk_write: write data to a device   : : : : : : : : : : : : : : : : : : : : : : : : : : 285
    10.20  Network Device Interfaces  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 286
CONTENTS                                                                                                                 9


           10.20.1  fdev_net_send: send a packet on a network interface : : : : : : : : : : : : : : : : : : : 287

           10.20.2  fdev_net_alloc: allocate an opaque buffer into which to receive data   : : : : : : : : : 288
           10.20.3  fdev_net_recv: notify the OS that data has been received into a buffer   : : : : : : : : 289
           10.20.4  fdev_net_free: free an unused network packet buffer   : : : : : : : : : : : : : : : : : : 290
    10.21  Serial Device Interfaces   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 291
           10.21.1  fdev_serial_set: set standard serial port parameters  : : : : : : : : : : : : : : : : : : 292
           10.21.2  fdev_serial_get: get standard serial port parameters and line status   : : : : : : : : : 293
    10.22  Driver-Kernel Interface:  ________|_X86IPC|SA device registration   : : : : : : : : : : : : : : : : : : : : : 294
           10.22.1  fdev_isa_add: add a device node to an ISA bus   : : : : : : : : : : : : : : : : : : : : : 295
           10.22.2  fdev_isa_remove: remove a device node from an ISA bus  : : : : : : : : : : : : : : : : 296
           10.22.3  fdev_isa_alloc_ports: allocate a range of I/O ports   : : : : : : : : : : : : : : : : : : 297
           10.22.4  fdev_isa_free_ports: release a range of I/O ports  : : : : : : : : : : : : : : : : : : : : 298
           10.22.5  fdev_isa_alloc_physmem: allocate a range of physical memory  : : : : : : : : : : : : : 299
           10.22.6  fdev_isa_free_physmem: release a range of physical memory   : : : : : : : : : : : : : : 300
           10.22.7  fdev_isa_alloc_dma: allocate a DMA channel   : : : : : : : : : : : : : : : : : : : : : : 301
           10.22.8  fdev_isa_free_dma: release a DMA channel  : : : : : : : : : : : : : : : : : : : : : : : : 302
    10.23  Driver-Kernel Interface:  ____|_PC|PCI device registration  : : : : : : : : : : : : : : : : : : : : : : : 303
    10.24  Driver-Kernel Interface: SCSI device registration  : : : : : : : : : : : : : : : : : : : : : : : : : 304
           10.24.1  fdev_scsi_add: add a device node to a SCSI bus  : : : : : : : : : : : : : : : : : : : : : 305
           10.24.2  fdev_scsi_remove: remove a device node from a SCSI bus   : : : : : : : : : : : : : : : 306
    10.25  Error Codes  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 306


11  Device Driver Support Library (libfdev.a)                                                             307
    11.1   Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 307
    11.2   Device Registration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 307
    11.3   Naming   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 307
    11.4   Memory Allocation  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 307
    11.5   Buffer Management : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 308
    11.6   Processor Bus Resource Management : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 308


12  Linux Driver Set (libfdev_linux.a)                                                                         309
    12.1   Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 309
    12.2   Partially-compliant Drivers   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 309
    12.3   Internals  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 309
           12.3.1   Variables   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 310
           12.3.2   Functions  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 311
    12.4   Block device drivers   : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 313
    12.5   Network drivers  : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 313
    12.6   SCSI drivers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 314


13  FreeBSD Driver Set (libfdev_freebsd.a)                                                                 317


14  Novell ODI Network Drivers (libfdev_odi.a)                                                           319


15  Appendix A: Directory Structure                                                                           321
10                                                                                                                CONTENTS






Chapter  1


Introduction



1.1       Goals  and  Scope


The Flux OS Toolkit is a framework and set of modularized library code,  together with extensive docu-
mentation, for the construction of operating system kernels, servers, and other OS-level functionality.  Its
purpose is to provide, as a set of easily reusable modules, much of the infrastructure "grunge" that usually
takes up a large percentage of development time in any operating system or OS-related project, and allow
developers to concentrate their efforts on the unique and interesting aspects of the new OS in question. The
goal is for someone to be able to take the OS toolkit and immediately have a base on which they can start
concentrating on "real" OS issues such as scheduling, VM, IPC, file systems, security, or whatever, instead
of spending six months first writing boot loader code, startup code, device drivers, kernel printf and malloc
code, a kernel debugger, etc.
    The intention of this toolkit is not to "write the OS for you"; we certainly want to leave the OS writing
to the OS writer.  The dividing line between the "OS" and the "OS toolkit," as we see it, is basically the
line between what OS writers want to write and what they would otherwise have to write but don't really
want to.  Naturally this will vary between different OS groups and developers.  If you really want to write
your own x86 protected-mode startup code, or have found a clever way to do it "better," you're perfectly
free to do so and simply not use the corresponding code in our toolkit. However, our goal is that the toolkit
be modular enough that you can still easily use other parts of it to fill in other functional areas you don't
want to have to deal with yourself (or areas that you just don't have time to do "yet").
    As such, the toolkit is designed to be usable either as a whole or in arbitrary subsets, as requirements
dictate. It can be used either as a set of support libraries to be linked into an operating system kernel and/or
its support programs, or it can be used merely as a collection of "spare parts":  example source code to be
ripped apart and cannibalized for whatever purpose. (Naturally, we prefer that the toolkit be used in library
fashion, since this keeps a cleaner interface between the toolkit and the OS and makes them both easier to
maintain; however, we recognize that in some situations this will not be practical for technical or political
reasons.)
    The toolkit is also intended to be useful for things that aren't kernels but are OS-related, such as boot
loaders or OS-level servers running on top of a microkernel.



1.2       Road  Map


Some  of  the  main  components  provided  by  the  Flux  OS  toolkit  are  listed  here,  along  with  the  chapter
numbers in which they are described.  The libraries are described in this document roughly in order from
smallest and simplest to largest and most complex.

    2  liblmm: A flexible memory management library that can be used to manage either physical or virtual
       memory.  This library supports many special features needed by OS-level code,  such as a multiple
       memory types, allocation priorities, and arbitrary alignment and placement constraints for allocated
       blocks.


                                                                11
12                                                                                      CHAPTER 1.  INTRODUCTION


    3  libexec:  A generic executable interpreter and loader that supports popular executable formats such

       as a.out and ELF, either during bootstrap or during general operation.  (Even microkernel systems,
       which normally don't load executables, generally must have a way to load the first user-level program;
       the Flux toolkit's small, simple executable interpreter is ideally suited to this purpose.)

    4  libdiskpart:  A generic library which recognizes various common disk partitioning schemes and pro-
       duces a complete "map" of the organization of any disk. This library provides a simple way for the OS
       to find relevant or "interesting" disk partitions, as well as to easily provide high-level access to arbitrary
       disk partitions through various naming schemes; BSD- and Linux-compatible naming mechanisms are
       provided as defaults.

    5  libfsread: A simple read-only file system interpretation library supporting various common types of
       file systems including BSD FFS, Linux ext2fs, and MINIX file systems. This library is typically used in
       conjunction with the partition library to provide a convenient way for the OS to read programs and data
       off of hard disks or floppies.  Again, this functionality is often needed at boot time even in operating
       systems that otherwise would not require it.  This code is also extremely useful in constructing boot
       loaders.

    6  libmc: A simple, minimal C library which minimizes dependencies with the environment and between
       modules,  to provide common C library services in a restricted OS environment.  For example,  this
       library provides many standard string, memory, and other utility functions, as well as a formatted I/O
       facility (e.g., printf) designed for easy use in restricted environments such as kernels.

    8  libkern: Kernel support code for setting up a basic OS kernel environment, including providing default
       handlers for traps and interrupts and such. This library includes many general utilities useful in kernel
       code, such as functions to access special processor registers, set up and manipulate page tables, and
       switch between processor modes (e.g., between real and protected mode on the x86).  Also includes
       facilities for convenient source-level remote debugging of OS kernels under development.

    9  libsmp:  More kernel support code, this library deals with setting up a multiprocessor system to the
       point where it can be used by the operating system. Also (to be) included are message-passing routines
       and synchronization primitives, which are necessary to flush remote TLBs.

   10  libfdev*:  A generic, well-defined device driver framework that allows existing device drivers to be
       adopted from various sources and used in a variety of kernel and user-space environments.  A support
       library is provided to allow arbitrary OS environments to adopt this framework easily in either kernel
       or user space with a minimum of implementation effort. Several sets of reusable device drivers running
       under this framework are also provided as independent libraries, including robust, well-tested drivers
       taken from Linux and FreeBSD.

    Note:  for  the  x86  architecture,  all  items  but  the  file  system  interpreter  and  device  drivers  are  fully
implemented (and for libexec, the ELF and a.out object formats are implemented), but not all are documented.
The device driver support is a generalization of the Linux-drivers-on-Mach support done by Goel at Columbia
and Utah.  The file system support is implemented but not yet integrated into the OS toolkit in library form.
The SMP support is still in development stages, and is being integrated as it is completed.



1.3       Using  the  OS  Toolkit


To  use  the  OS  toolkit,  simply  link  your  kernel  (or  servers,  or  whatever)  with  the  appropriate  libraries.
Detailed information on how to use each library is provided in the appropriate chapters in this document.
    Linking libraries into the kernel may seem strange at first, since all of the existing OS kernels that we
have encountered seem to have a strong "anti-library" do-everything-yourself attitude.  However, the linker
can link libraries into a kernel just as easily as it can link them into application programs; we believe that
the primary reason existing kernels avoid libraries is because the available libraries aren't designed to be
used in kernels; they make too many assumptions about the environment they run in.  Filling that gap is
the purpose of the OS toolkit.
1.4.  EXAMPLE KERNELS                                                                                             13


    All of the OS toolkit libraries are designed so that individual components of each library can be replaced

easily;  we  have  taken  pains  to  document  the  dependencies  clearly  so  that  clients  can  override  whatever
components they need to,  without causing unexpected results.  In fact,  in many cases it is necessary to
override certain functions or symbols in order to make effective use of the toolkit.  To override a library
function or any other symbol defined by a library, just define your own version of it in your kernel or other
client program; the linker will ensure that your definition is used instead of the library's.



1.4       Example  Kernels


If you are starting a new OS kernel, or just want to experiment with the OS toolkit in a "standalone" fashion,
an easy way to begin is with one of the example "kernels" in the examples directory. These examples are the
kernel equivalent of a "Hello World" application, demonstrating the use of various facilities together such as
the base environment initialization code in libkern, the minimal console driver code, the minimal C library,
and the remote debugging stub.  The code implementing these examples is almost as small and simple as
equivalent ordinary user-level applications would be because they fully rely on the OS toolkit to provide the
underlying infrastructure necessary to get started.  The compilation and linking rules in the GNUmakerules
files for these example programs demonstrate how to link kernels for various startup environments.
    The following example "kernels" are currently provided:

    o   ________|_X86mPC|ultiboot: A simple "Hello World" kernel that can be booted from a MultiBoot boot loader.
       For information on the MultiBoot standard, see ftp://flux.cs.utah.edu/flux/multiboot/.

    o   ________|_X86mPC|ultiboot-gdb: A simple MultiBoot kernel that demonstrates how to use the remote source-
       level debugging support.

    o   ________|_X86mPC|ultiboot-smp:  A simple MultiBoot kernel that demonstrates how to use the SMP support.
       It also uses the remote debugging support.

    o   ________|_X86bPC|iosboot: A "Hello World" boot sector that can be started at BIOS boot time. XXX This is
       functional, but not yet integrated into the main source tree.

    o   ________|_X86dPC|osboot:  A "Hello World" kernel that runs from MS-DOS. XXX This is functional, but not
       yet integrated into the main source tree.

    Later, we will provide other more complex example kernels that take advantage of more of the OS toolkit's
functionality.  Furthermore, our Fluke kernel already makes full use of the OS toolkit, and it will provide an
example of a full-scale real-world client kernel when released.
    Booting the example kernels requires either the grub, Mach, or BSD bootblocks.  grub can boot the
kernels as-is, whereas the other bootblocks need the kernel to be in a different format.  This conversion can
be done with the mkbsdimage script, installed with the OS toolkit when configured for a Mach or BSD host.
This script creates an NMAGIC a.out image from a multiboot image. It does this by using GNU ld to glue
bsdboot.o, the "boot adaptor", onto the front of the multiboot image.  For this reason, the bsdboot.o file
must be built with appropriate a.out tools_but the image it is converting needn't be the same format, it
can be almost any other a.out variant or even ELF. For example,

       % mkbsdimage multiboot

creates a bootable image named "Image."  The mkbsdimage script can do more complex things,  such as
combining an arbitrary number of "boot modules." See 8.11 and the script for more info.



1.5       Overall  Design  Principles


This section describes some of the general principles and policies we followed in designing the components of
the OS toolkit.  This section is most relevant for people developing or modifying the toolkit itself; however,
this information may also help users of the toolkit to understand it better and to be able to use it more
effectively.
14                                                                                      CHAPTER 1.  INTRODUCTION


    o  Document intermodule dependencies within each library.  This policy contrasts to most other third-

       party libraries, which are usually documented "black box" fashion:  you are given descriptions of the
       "public" interfaces, and that's it. Although with such libraries you could in principle override individual
       library components with your own implementations, there is no documentation describing how to do
       so; and even if such documentation existed, these libraries often aren't well modularized internally, so
       replacing one library component would require understanding and dealing with a complicated web of
       relationships with other components.

       The downside of this policy is that exposing the internal composition of the libraries this way leaves less
       room for the implementation to change later without affecting the client-visible interfaces.  However,
       we felt that for the purposes of the OS toolkit, allowing the client more flexibility in using the library
       is more important than hiding implementation details.


    o  Where there is already a standard meaning associated with a symbol, use it. For example, our toolkit
       assumes that putchar() means the same thing as it does under normal POSIX, and is used the same
       way, even if it works very differently in a kernel environment. Similarly, the toolkit's startup code starts
       the kernel by calling the standard main() function with the standard argc and argv parameters, even
       if the kernel was booted straight off the hardware.


    o  Cleanly separate and clearly flag architecture- and platform-specific facilities. Although the OS toolkit
       currently only runs on the x86 architecture,  we plan to port it to other architectures such as PA-
       RISC and PowerPC in the future.  (We will also help with ports by others, e.g., to the DEC Alpha.)
       Architecture-specific and platform-specific features and interfaces are tagged in this document with
       boxed  icons,  e.g.,   _____|_X86|indicating  the  Intel  x86  processor  architecture,  and   ________|_X86rPC|epresenting
       x86-based PC platforms.
1.6       Portability


Although by its nature a large percentage of the implementation of the Flux OS toolkit is inherently machine-
dependent, it is our intention to make its interfaces as machine-independent as is possible without compro-
mising the toolkit's other goals.  However, since our first implementation of the toolkit is for the x86 PC,
biases  in  naming,  specification,  documentation,  approach,  and  coding  have  inevitably  crept  in.  We  will
be grateful if readers will point these out to us and suggest or provide improvements.  Send comments to
oskit@jensen.cs.utah.edu.
1.7       Building  the  Toolkit


The OS toolkit follows the GNU conventions for configuration, building, and installation.  First, you need
to run the configure script in the top-level directory; this script will attempt to guess your system type
and  locate  various  required  tools  such  as  the  C  compiler.   To  cross-compile  the  OS  toolkit  for  another
architecture, you will need to specify the host machine type (the machine that the OS toolkit will run on)
and the build machine type (the machine on which you are building the toolkit), using the --build=machine
and --host=machine options.  Since the OS toolkit is a standalone package and does not use any include
files or libraries other than its own, the operating system component of the host machine type is not directly
relevant to the configuration of the OS toolkit.  However, the host machine designator as a whole is used
by  the  configure  script  as  a  name  prefix  to  find  appropriate  cross-compilation  tools.   For  example,  if
you specify `--host=i486-linux', the configure script will search for build tools called i486-linux-gcc,
i486-linux-ar, i486-linux-ld, etc.  Among other things, which tools are selected determines the object
format of the created images.  For more information on how to run the configure script, see the INSTALL
file in the top-level directory.
    To build the OS toolkit, go to the top-level source directory (or the top-level object directory, if you
configured  the  toolkit  to  build  in  a  separate  object  directory),  and  run  GNU  make  (e.g.,  just  `make'  on
Linux systems, or `gmake' on BSD systems). Note that the OS toolkit requires GNU make: its makefiles are
very unlikely to work with another make utility.  To avoid confusion, the OS toolkit's makefiles are named
1.8.  LINKING ORDER                                                                                                  15


GNUmakefile rather than just Makefile;  this way, if you accidentally run the wrong make utility, it will

simply complain that it can't find any makefile, instead of producing an obscure error.
    Once the toolkit is built, you can install it with `make  install'.  By default, the libraries will go into
/usr/local/lib and the header files into /usr/local/include,  unless you specified a --prefix on the
configure  command  line.  All  of  the  OS  toolkit  header  files  are  installed  in  a  flux/  subdirectory  (e.g.
/usr/local/include/flux), so they should not conflict with any header files already present. XXX should
libraries also be put into a subdirectory, or perhaps be named libflux_* or something like that?



1.8       Linking  Order


Since the OS toolkit is comprised of a number of different libraries, and some of these libraries may call
functions in others, depending on which functions you use out of each one, the order in which the libraries
are linked can be important.  Figure ??  shows these interlibrary dependencies as a partial order over the
libraries; if you follow this order, things should work right.
    XXX draw figure
16                                                                                      CHAPTER 1.  INTRODUCTION






Chapter  2


List-based   Memory   Manager   Library



(liblmm.a)



2.1       Introduction


The list-based memory manager provides simple but extremely generic and flexible memory management
services.  It provides functionality at a lower level than typical ANSI C malloc-style memory allocation
mechanisms.1  For example, the LMM does not keep track of the sizes of allocated memory blocks; that job
is left to the client of the LMM library or other high-level memory allocation mechanisms. (For example, the
default version of malloc() provided by the minimal C library, described in Section 6.4.2, is implemented
on top of the LMM.)
    The LMM attempts to make as few assumptions as possible about the environment in which it runs and
the use to which it is put. For example, it does not assume that all allocatable "heap" memory is contained
in  one  large  continuous  range  of  virtual  addresses,  as  is  the  case  in  typical  Unix  process  environments.
Similarly, it does not assume that the heap can be expanded on demand (although the LMM can certainly
be used in situations in which the heap is expandable).  It does not assume that it is OK to "waste" pages
on the assumption that they will never be assigned "real" physical memory unless they are actually touched.
It does not assume that there is only one "type" of memory, or that all allocatable memory in the program
should be managed as a single heap.  Thus, the LMM is suited for use in a wide variety of environments,
and can be used for both physical and virtual memory management.
    The LMM has the following main features:


    o  Very efficient use of memory.  At most fourteen bytes are wasted in a given allocation (because of
       alignment restrictions); there is no memory overhead for properly-aligned allocations.


    o  Support for allocating memory with specific alignment properties.  Memory can be allocated at any
       given  power-of-two  boundary,  or  at  an  arbitrary  offset  beyond  a  specified  power-of-two  boundary.
       Allocation requests can also be constrained to specific address ranges or even exact addresses.


    o  Support for allocations of memory of a specific "type." For example, on the PC architecture, sometimes
       memory needs to be allocated specifically from the first 16MB of physical memory, or from the first
       1MB of memory.


    o  Support for a concept of allocation priority, which allows certain memory regions to be preferred over
       others for allocation purposes.


    o  The LMM is fully reentrant and does not use any global variables;  thus,  different LMM pools are
       completely independent of each other.
____________________________________________________1
    The LMM is designed quite closely along the lines of the Amiga operating system's low-level memory management system.


                                                                17
18                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


    o  Extremely flexible management of the memory pool. LMM pools can be grown or shrunk at any time,

       under the complete control of the caller.  The client can also "map" the free memory pool, locating
       free memory blocks without allocating them.


    Some of the LMM's (potential) disadvantages with respect to more conventional memory allocators are:


    o  It requires the caller to remember the size of each allocated block, and pass its size back as a parameter
       to lmm_free.  Thus, a malloc implemented on top of this memory manager would have to remember
       the size of each block somewhere.

    o  Since the LMM uses sequential searches through linked lists, allocations are not as blazingly fast as
       in packages that maintain separate free lists for different sizes of memory blocks.  However, perfor-
       mance is still generally acceptable for many purposes, and higher-level code is always free to cache
       allocated blocks of commonly used sizes if extremely high-performance memory allocation is needed.
       (For example, a malloc package built on top of the LMM could do this.)

    o  The LMM does not know how to "grow" the free list automatically (e.g.  by calling sbrk() or some
       equivalent);  if it runs out of memory, the allocation simply fails.  If the LMM is to be used in the
       context of a growable heap, an appropriate grow-and-retry mechanism must be provided at a higher
       level.

    o  In  order  to  avoid  making  the  LMM  dependent  on  threading  mechanisms,  it  does  not  contain  any
       internal synchronization code.  The LMM can be used in multithreaded environments, but the calling
       code must explicitly serialize execution while invoking LMM operations on a particular LMM heap.
       However, LMM operations on different heaps are fully independent and do not need to be synchronized
       with each other.



2.2       Memory  regions


The  LMM  maintains  a  concept  of  a  memory  region,  represented  by  the  data  type  lmm_region_t,  which
represents a range of memory addresses within which free memory blocks may be located. Multiple memory
regions can be attached to a single LMM pool, with different attributes attached to each region.
    The attributes attached to memory regions include a set of caller-defined flags, which typically represent
fundamental properties of the memory described by the region (i.e., the ways in which the region can be
used), and a caller-specified allocation priority, which allows the caller to specify that some regions are to
be preferred over others for satisfying allocation requests.
    It is not necessary for all the memory addresses covered by a region to actually refer to valid memory
locations; the LMM will only ever attempt to access subsections of a region that are explicitly added to the
free memory pool using lmm_add_free. Thus, for example, it is perfectly acceptable to create a single region
covering all virtual addresses from 0 to (vm_offset_t)-1, as long as only the memory areas that are actually
valid and usable are added to the free pool with lmm_add_free.
    The LMM assumes that if more than one region is attached to an LMM pool, the address ranges of those
regions do not overlap each other. Furthermore, the end address of each region must be larger than the start
address, using unsigned arithmetic:  a region must not "wrap around" the top of the address space to the
bottom.  These restrictions are not generally an issue, but can be of importance in some situations such as
when running on the x86 with funny segment layouts.



2.2.1       Region flags

The region flags, of type lmm_flags_t, generally indicate certain features or capabilities of a particular range
of memory.  Allocation requests can specify a mask of flag bits that indicate which region(s) the allocation
may be made from.  For each flag bit set in the allocation request, the corresponding bit must be set in the
region in order for the region to be considered for satisfying the allocation.
    For example, on PCs, the lowest 1MB of physical memory is "special" in that only it can be accessed from
real mode, and the lowest 16MB of physical memory is special in that only it can be accessed by the built-in
2.3.  EXAMPLE USE                                                                                                     19


DMA controller.  Thus, typical behavior on a PC would be to create three LMM regions:  one covering the

lowest 1MB of physical memory, one covering the next 15MB, and one covering all other physical memory.
The first region would have the "1MB memory" and "16MB memory" bits set in its associated flags word,
the second region would have only the "16MB memory" bit set, and the third region would have neither.
Normal allocations would be done with a flags word of zero, which allows the allocation to be satisfied from
any region, but, for example, allocations of DMA buffers would be done with the "16MB memory" flag set,
which will force the LMM to allocate from either the first or second region.  (In fact, this is the default
arrangement used by the libkern library when setting up physical memory for an OS running on a PC; see
Section 8.10.2 for more details.)



2.2.2       Allocation priority

The second attribute associated with each region, the allocation priority, indicates in what order the regions
should be searched for free memory to satisfy memory allocation requests. Regions with a higher allocation
priority value are preferred over regions with a lower priority.
    Allocation priorities are typically useful in two situations.  First,  one section of a machine's physical
memory may provide faster access than other regions for some reason, for example because it is directly
connected to the processor rather than connected over a slower bus of some kind. (For example, the Amiga
has what is known as "fast" memory, which typically supports faster access because it does not contend
with ongoing DMA activity in the system.)  In this case, if it is not likely that all available memory will be
needed, the memory region describing the faster memory might be given higher priority so that the LMM
will allocate from it whenever possible.
    Alternatively, it can be useful to give a region a lower priority because it is in some way more "precious"
than other memory, and should be conserved by satisfying normal allocation requests from other regions
whenever possible.  For example, on the PC, it makes sense to give 16MB memory a lower priority than
"high" memory, and 1MB memory a still lower priority; this will decrease the likelihood of using up precious
"special" memory for normal allocation requests which just need any type of memory, and causing memory
shortages when special memory really is needed.



2.3       Example  use


To make an LMM pool ready for use, a client generally proceeds in three stages:


    1. Initialize the LMM pool, using lmm_init.


    2. Add one or more memory regions to the LMM, using lmm_add_region.


    3. Add some free memory to the pool, using lmm_add_free.  (The free memory added should overlap at
       least one of the regions added in step 2; otherwise it will simply be thrown away.)


    Here is an example initialization sequence that sets up an LMM pool for use in a Unix-like environment,
using an (initially) 1MB memory pool to service allocations. It uses only one region, which covers all possible
memory addresses; this allows additional free memory areas to be added to the pool later regardless of where
they happen to be located.


#include  <flux/lmm.h>


lmm`t  lmm;
lmm`region`t  region;


int  setup`lmm()
-
   unsigned  mem`size  =  1024*1024;
   char  *mem  =  sbrk(mem`size);
   if  (mem  ==  (char*)-1)
20                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


      return  -1;



   lmm`init(&lmm);
   lmm`add`region(&lmm,  &region,  (void*)0,  (vm`size`t)-1,  0,  0);
   lmm`add`free(&lmm,  mem,  mem`size);


   return  0;
"


    After the LMM pool is set up properly, memory blocks can be allocated from it using any of the lmm_alloc
functions described in the reference section below, and returned to the memory pool using the lmm_free
function.

2.4       Restrictions  and  guarantees


This section describes some of the important restrictions the LMM places on its use.  Many of these are
restrictions one would expect to be present; however, they are listed here anyway in order to make them
explicit and to make it more clear in what situations the LMM can and can't be used.

    As mentioned previously, the LMM implements no internal synchronization mechanisms, so if it is used
in a multithreaded environment, the caller must explicitly serialize execution when performing operations
on a particular LMM pool.

    If a client uses multiple LMM memory pools, then each pool must manage disjoint blocks of memory.
In other words, a particular chunk of memory must never be present on two or more LMM pools at once.
However, as long as the actual memory blocks in different pools are disjoint, the overall memory regions
managed by the pools can overlap.  For example, it is OK if pages 1 and 3 are managed by one LMM pool
and page 2 is managed by another, as long as none of those pages are managed by two LMM pools at once.

    The  LMM  uses  the  memory  it  manages  as  storage  space  for  free  list  information.   This  means  that
the LMM is not suitable for managing memory that cannot be accessed directly using normal C pointer
arithmetic in the local address space, or memory with special access semantics, such as flash memory.  In
such a situation, you must use a memory management system that stores free memory metadata separately
from the free memory itself.

    The LMM guarantees that it will not use any memory other than the memory explicitly given to it for
its use through the lmm_init, lmm_add_region, and lmm_add_free calls.  This implies that no "destructor"
functions need to be provided by the library in order to destroy LMM pools, regions, or free lists: an LMM
pool can be "destroyed" by the caller simply by overwriting or reinitializing the memory with something
else.  Of course, it is up to the caller to ensure that no attempts are made to use an LMM pool that has
been destroyed.

2.5       Sanity  checking


When the OS toolkit is compiled with debugging enabled (--enable-debug), a fairly large number of sanity
checks are compiled into the LMM library to help detect memory list corruption bugs and such.  Assertion
failures in the LMM library can indicate bugs either in the LMM itself or in the application using it (e.g.,
freeing blocks twice, overwriting allocated buffers, etc.). In practice such assertion failures usually tend to be
caused by the application, since the LMM library itself is quite well-tested and stable.  For additional help
in debugging memory management errors in applications that use the C-standard malloc/free interfaces, the
OS toolkit's memdebug library can be used as well (see Section 7).

    Note that the sanity checks in the LMM library are likely to slow down the library considerably under
normal use, so it may be a good idea to turn off this debugging support when linking the LMM into "stable"
versions of a program.
2.6.  API REFERENCE                                                                                                  21


2.6       API  reference


The following sections describe the functions exported by the LMM in detail. All of these functions, as well
as the types necessary to use them, are defined in the header file <flux/lmm.h>.
22                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.1       lmm__init:  initialize an LMM pool


Synopsis

       #include  <flux/lmm.h>

       void lmm__init(lmm_t *lmm);


Description

       This function initializes an LMM pool. The caller must provide a pointer to an lmm_t structure,
       which is typically (but doesn't have to be) statically allocated; the LMM system uses this struc-
       ture to keep track of the state of the LMM pool. In subsequent LMM operations, the caller must
       pass back a pointer to the same lmm structure, which acts as a handle for the LMM pool.

       Note that the LMM pool initially contains no regions or free memory; thus, immediate attempts
       to allocate memory from it will fail. The caller must register one or more memory regions using
       lmm_add_region, and then add some free memory to the pool using lmm_add_free, before the
       LMM pool will become useful for servicing allocations.


Parameters

       lmm:     A pointer to an uninitialized structure of type lmm_t which is to be used to represent the
             LMM pool.
2.6.  API REFERENCE                                                                                                  23


2.6.2       lmm__add__region:  register a memory region in an LMM pool


Synopsis

       #include  <flux/lmm.h>

       void  lmm__add__region(lmm_t  *lmm,  lmm_region_t  *region,  void  *addr,  vm_size_t  size,
       lmm_flags_t flags, lmm_pri_t pri);


Description

       This function attaches a new memory region to an LMM pool. The region describes a contiguous
       range of addresses with specific attributes, in which free memory management may need to be
       done.

       The caller must supply a structure of type lmm_region_t in which the LMM can store critical
       state for the region.  This structure must remain available for the exclusive use of the LMM for
       the entire remaining lifetime of the LMM pool to which it is attached. However, the contents of
       the structure is opaque; client code should not examine or modify its contents directly.

       The caller must only ensure that if multiple regions are attached to a single LMM pool, they
       refer to disjoint address ranges.

       Note that this routine does not actually make any free memory available; it merely registers a
       range of addresses in which free memory might be made available later.  Typically this call is
       followed by one or more calls to lmm_add_free, which actually adds memory blocks to the pool's
       free memory list.

       The act of registering a new region does not cause any of the memory described by that region
       to be accessed or modified in any way by the LMM; only the lmm_region_t structure itself is
       modified at this point.  The LMM will only access and modify memory that is explicitly added
       to the free list using lmm_add_free.  This means, for example, that it is safe to create a single
       region with a base of 0 and a size of (vm_size_t)-1, regardless of what parts of that address
       range actually contain valid memory.

       See Section 2.2 for general information on memory regions.


Parameters

       lmm:     The LMM pool to which the region should be added.

       region:    A pointer to a structure in which the LMM maintains the critical state representing
             the region.  The initial contents of the structure don't matter; however, the structure must
             remain available and untouched for the remaining lifetime of the LMM pool to which it is
             attached.

       addr :   The start address of the region to add.  Different regions attached to a single LMM pool
             must cover disjoint areas of memory.

       size:   The size of the region to add. Must be greater than zero, but no more than (vm_offset_t)-1
             -  addr; in other words, the region must not wrap around past the end of the address space.

       flags:   The attribute flags to be associated with the region. Allocation requests will be satisfied
             from this region only if all of the flags specified in the allocation request are also present in
             the region's flags word.

       pri :  The allocation priority for the region, as a signed integer.  Higher priority regions will be
             preferred over lower priority regions for satisfying allocations.
24                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.3       lmm__add__free:  add a block of free memory to an LMM pool


Synopsis

       #include  <flux/lmm.h>

       void lmm__add__free(lmm_t *lmm, void *block, vm_size_t size);


Description

       This routine declares a range of memory to be available for allocation, and attaches that memory
       to the specified LMM pool.  The memory block will be made available to satisfy subsequent
       allocation requests.

       The caller can specify a block of any size and alignment, as long as the block does not wrap
       around the end of the address space.  The LMM may discard a few bytes at the beginning and
       end of the block in order to enforce internal alignment constraints; however, the LMM will never
       touch memory outside the specified block (unless, of course, that memory is part of another free
       block).

       If the block's beginning or end happens to coincide exactly with the beginning or end of a block
       already on the free list, then the LMM will merge the new block with the existing one. Of course,
       the block may be further subdivided or merged later as memory is allocated from the pool and
       returned to it.

       The new free block will automatically be associated with whatever region it happens to fall in.
       If the block crosses the boundary between two regions, then it is automatically split between the
       regions. If part of the block does not fall within any region, then that part of the block is simply
       ignored and forgotten about.  (By extension, if the entire block does not overlap any region, the
       entire block is dropped on the floor.)


Parameters

       lmm:     The LMM pool to add the free memory to.

       block :   The start address of the memory block to add. There are no alignment restrictions.

       size:   The size of the block to add, in bytes.  There are no alignment restrictions, but the size
             must not be so large as to wrap around the end of the address space.
2.6.  API REFERENCE                                                                                                  25


2.6.4       lmm__remove__free:  remove a block of memory from an LMM pool


Synopsis

       #include  <flux/lmm.h>

       void lmm__remove__free(lmm_t *lmm, void *block, vm_size_t size);


Description

       This routine is complementary to lmm_add_free: it removes all free memory blocks in a specified
       address range from an LMM memory pool.  After this call completes, unless the caller subse-
       quently adds memory in this range back onto the LMM pool using lmm_add_free or lmm_free it
       is guaranteed that no subsequent memory allocation will return a memory block that overlaps
       the specified range.

       The address range specified to this routine does not actually all have to be on the free list.  If
       the address range contains several smaller free memory blocks, then all of those free blocks are
       removed from the pool without touching or affecting any memory parts of the address range that
       weren't in the free memory list. Similarly, if a free block crosses the beginning or end of the range,
       then the free block is "clipped" so that the part previously extending into the address range is
       removed and thrown away.

       One use for this routine is to reserve a specific piece of memory for some special purpose, and
       ensure that no subsequent allocations use that region. For example, the example MultiBoot boot
       loaders in the OS toolkit use this routine to reserve the address range that will eventually be
       occupied by the OS executable being loaded, ensuring that none of the information structures to
       be passed to the OS will overlap with the final position of its executable image.


Parameters

       lmm:     The LMM pool from which to remove free memory.

       block :   The start address of the range in which to remove all free memory.

       size:   The size of the address range.
26                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.5       lmm__alloc:  allocate memory


Synopsis

       #include  <flux/lmm.h>

       void  *lmm__alloc(lmm_t *lmm, vm_size_t size, lmm_flags_t flags);


Description

       This is the primary routine used to allocate memory from an LMM pool.  It searches for a free
       memory block of the specified size and with the specified memory type requirements (indicated
       by the flags argument), and returns a pointer to the allocated memory block. If no memory block
       of sufficient size and proper type can be found, then this function returns NULL instead.

       Note that unlike with malloc, the caller must keep track of the size of allocated blocks in order
       to allow them to be freed properly later.


Parameters

       lmm:     The memory pool from which to allocate.

       size:   The number of contiguous bytes of memory needed.

       flags:   The memory type required for this allocation.  For each bit set in the flags parameter,
             the corresponding bit in a region's flags word must also be set in order for the region to be
             considered for allocation. If the flags parameter is zero, memory will be allocated from any
             region.


Returns

       Returns a pointer to the memory block allocated, or NULL if no sufficiently large block of the
       correct type is available.  The returned memory block will be at least doubleword aligned, but
       no other alignment properties are guaranteed by this routine.
2.6.  API REFERENCE                                                                                                  27


2.6.6       lmm__alloc__aligned:  allocate memory with a specific alignment


Synopsis

       #include  <flux/lmm.h>

       void  *lmm__alloc__aligned(lmm_t *lmm, vm_size_t size, lmm_flags_t flags, int align_bits,
       vm_offset_t align_ofs);


Description

       This routine allocates a memory block with specific alignment constraints. It works like lmm_alloc,
       except that it enforces the rule that the lowest align_bits bits of the address of the returned block
       must match the lowest align_bits of align_ofs.  In other words, align_bits specifies an alignment
       boundary as a power of two, and align_ofs specifies an offset from "natural" alignment.  If no
       memory  block  with  the  proper  requirements  can  be  found,  then  this  function  returns  NULL
       instead.


Parameters

       lmm:     The memory pool from which to allocate.

       size:   The number of contiguous bytes of memory needed.

       flags:   The memory type required for this allocation.  For each bit set in the flags parameter,
             the corresponding bit in a region's flags word must also be set in order for the region to be
             considered for allocation. If the flags parameter is zero, memory will be allocated from any
             region.

       align_bits:    The number of low bits of the returned memory block address that must match the
             corresponding bits in align_ofs.

       align_ofs:    The required offset from natural power-of-two alignment. If align_ofs is zero, then the
             returned memory block will be naturally aligned on a 2align_bits boundary.


Returns

       Returns a pointer to the memory block allocated, or NULL if no memory block satisfying the
       specified requirements can be found.
28                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.7       lmm__alloc__gen:  allocate memory with general constraints


Synopsis

       #include  <flux/lmm.h>

       void  *lmm__alloc__gen(lmm_t *lmm, vm_size_t size, lmm_flags_t flags, int align_bits, vm_offset_t
       align_ofs, vm_offset_t in_min, vm_size_t in_size);


Description

       This routine allocates a memory block meeting various alignment and address constraints.  It
       works like lmm_alloc_aligned,  except that as an additional constraint,  the returned memory
       block must fit entirely in the address range specified by the in_min and in_size parameters.

       If in_size is equal to size, then memory will only be allocated if a block can be found at exactly
       the address specified by in_min; i.e. the returned pointer will either be in_min or NULL.


Parameters

       lmm:     The memory pool from which to allocate.

       size:   The number of contiguous bytes of memory needed.

       flags:   The memory type required for this allocation.  For each bit set in the flags parameter,
             the corresponding bit in a region's flags word must also be set in order for the region to be
             considered for allocation. If the flags parameter is zero, memory will be allocated from any
             region.

       align_bits:    The number of low bits of the returned memory block address that must match the
             corresponding bits in align_ofs.

       align_ofs:    The required offset from natural power-of-two alignment. If align_ofs is zero, then the
             returned memory block will be naturally aligned on a 2align_bits boundary.

       in_min:     Start address of the address range in which to search for a free block.  The returned
             memory block, if found, will have an address no lower than in_min.

       in_size:    Size of the address range in which to search for the free block.  The returned memory
             block, if found, will fit entirely within this address range, so that mem_block + size <=
             in_min + in_size.


Returns

       Returns a pointer to the memory block allocated, or NULL if no memory block satisfying all of
       the specified requirements can be found.
2.6.  API REFERENCE                                                                                                  29


2.6.8       lmm__alloc__page:  allocate a page of memory


Synopsis

       #include  <flux/lmm.h>

       void  *lmm__alloc__page(lmm_t *lmm, lmm_flags_t flags);


Description

       This  routine  allocates  a  memory  block  that  is  exactly  one  minimum-size  hardware  page  in
       size,  and  is  naturally  aligned  to  a  page  boundary.   The  same  effect  can  be  achieved  by  call-
       ing  lmm_alloc_aligned  with  appropriate  parameters;  this  routine  merely  provides  a  simpler
       interface for this extremely common action.


Parameters

       lmm:     The memory pool from which to allocate.

       flags:   The memory type required for this allocation.  For each bit set in the flags parameter,
             the corresponding bit in a region's flags word must also be set in order for the region to be
             considered for allocation. If the flags parameter is zero, memory will be allocated from any
             region.


Returns

       Returns a pointer to the memory page allocated, or NULL if no naturally-aligned page can be
       found.
30                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.9       lmm__free:  free previously-allocated memory


Synopsis

       #include  <flux/lmm.h>

       void lmm__free(lmm_t *lmm, void *block, vm_size_t size);


Description

       This routine is used to return a memory block allocated with one of the above lmm_alloc functions
       to the LMM pool from which it was allocated.


Parameters

       lmm:     The memory pool from which the memory block was allocated.

       block :   A pointer to the memory block to free, as returned by one of the lmm_alloc functions.

       size:   The size of the memory block to free, as specified to the allocation function when the block
             was allocated.
2.6.  API REFERENCE                                                                                                  31


2.6.10        lmm__free__page:  free a page allocated with lmm__alloc__page


Synopsis

       #include  <flux/lmm.h>

       void lmm__free__page(lmm_t *lmm, void *block);


Description

       This routine simply calls lmm_free with PAGE_SIZE as the size argument, providing a companion
       to lmm_alloc_page.


Parameters

       lmm:     The memory pool from which the page was allocated.

       block :   A pointer to the page to free, as returned by the lmm_alloc_page function.
32                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.11        lmm__avail:  find the amount of free memory in an LMM pool


Synopsis

       #include  <flux/lmm.h>

       vm_size_t lmm__avail(lmm_t *lmm, lmm_flags_t *flags);


Description

       This routine returns the number of bytes of free memory currently exist in the specified LMM
       memory pool of a certain memory type, specified by the flags argument.

       Note that the returned value does not imply that a block of that size can be allocated; due to
       fragmentation it may only be possible to allocate memory in significantly smaller chunks.


Parameters

       lmm:     The LMM pool in which to tally free memory.

       flags:   The memory type to determine the availability of.  Only memory regions whose flags
             words contain all the bits set in the flags parameter will be considered in counting available
             memory. If flags is zero, then all free memory in the LMM pool will be counted.


Returns

       Returns the number of bytes of free memory available of the requested memory type.
2.6.  API REFERENCE                                                                                                  33


2.6.12        lmm__find__free:  scan a memory pool for free blocks


Synopsis

       #include  <flux/lmm.h>

       void lmm__find__free(lmm_t *lmm, [in/out] vm_offset_t *inout_addr, [out] vm_size_t *out_size,
       [out] lmm_flags_t *out_flags);


Description

       This routine can be used to locate free memory blocks in an LMM pool.  It searches the pool
       for free memory starting at the address specified in *inout_addr, and returns a description of the
       lowest block of available memory starting at at least this address.  The address and size of the
       next block found are returned in *inout_addr and *out_size, respectively, and the memory type
       flags associated with the region in which the block was found are returned in *out_flags.  If no
       further free memory can be found above the specified address, then this routine returns with
       *out_size set to zero.

       If  the  specified  *inout_addr  points  into  the  middle  of  a  free  block,  then  a  description  of  the
       remainder of the block is returned, i.e. the part of the block starting at *inout_addr and extending
       to the end of the free block.

       This routine does not actually cause any memory to be allocated; it merely reports on available
       memory  blocks.  The  caller  must  not  actually  attempt  to  use  or  modify  any  reported  blocks
       without allocating them first.  The caller can allocate a block reported by this routine using
       lmm_alloc_gen, using its in_min and in_size parameters to constrain the address of the allocated
       block to exactly the address reported by lmm_find_free.  If this allocation is done immediately
       after the call to lmm_find_free, without any intervening memory allocations, then the allocation
       is guaranteed to succeed. However, any intervening memory allocation operations will effectively
       invalidate the information returned by this routine, and a subsequent attempt to allocate the
       reported block may fail.


Parameters

       lmm:     The LMM pool in which to search for free memory.

       inout_addr :     On entry, the value pointed to by this parameter must be the address at which to
             start searching for free memory.  On return, it contains the start address of the next free
             block actually found.

       out_size:    On return, the value pointed to by this parameter contains the size of the next free
             memory block found, or zero if no more free blocks could be located.

       out_flags:    On return, the value pointed to by this parameter contains the flags word associated
             with the region in which the next free memory block was found.
34                                  CHAPTER 2.  LIST-BASED MEMORY MANAGER LIBRARY (LIBLMM.A)


2.6.13        lmm__dump:  display the free memory list in an LMM pool


Synopsis

       #include  <flux/lmm.h>

       void lmm__dump(lmm_t *lmm);


Description

       This routine is primarily used for debugging the LMM and the code that uses it. It scans through
       the LMM pool and calls printf to display each attached memory region and all the blocks of
       free memory currently contained in each.



Chapter  3


Executable   Program   Interpreter



(libexec.a)


Note:  for the x86 and ELF and a.out formats, this is completely implemented; just not fully documented.
    The Flux OS Toolkit provides a small library that can recognize and load program executables in a
variety of formats.  It is analogous to the GNU Binary File Descriptor (BFD) library, except that it only
supports loading linked program executables rather than general reading and writing of all types of object
files. For this reason, it is much smaller and simpler than BFD.
    Furthermore, as with the other OS toolkit components, the executable interpreter library is designed
to be as generic and environment-independent as possible, so that it can readily be used in any situation
in  which  it  is  useful.  For  example,  the  library  does  not  directly  do  any  memory  allocation;  it  operates
purely using memory provided to it explicitly.  Furthermore, it does not make any assumptions about how
a program's code and data are to be written into the proper target address space; instead it uses generic
callback functions for this purpose.
                                                                35
36                                   CHAPTER 3.  EXECUTABLE PROGRAM INTERPRETER (LIBEXEC.A)


3.1       Header  Files
3.1.  HEADER FILES                                                                                                    37


3.1.1       exec.h:  definitions for executable interpreter functions
38                                   CHAPTER 3.  EXECUTABLE PROGRAM INTERPRETER (LIBEXEC.A)


3.1.2       a.out.h:  (semi-)standard a.out file format definitions
3.1.  HEADER FILES                                                                                                    39


3.1.3       elf.h:  standard 32-bit ELF file format definitions
40                                   CHAPTER 3.  EXECUTABLE PROGRAM INTERPRETER (LIBEXEC.A)


3.2       Function  Reference
3.2.  FUNCTION REFERENCE                                                                                       41


3.2.1       exec__load:  detect the type of an executable file and load it
42                                   CHAPTER 3.  EXECUTABLE PROGRAM INTERPRETER (LIBEXEC.A)


3.2.2       exec__load__elf:  load a 32-bit ELF executable file
3.2.  FUNCTION REFERENCE                                                                                       43


3.2.3       exec__load__aout:  load an a.out-format executable file


Auto-detects Linux, NetBSD, FreeBSD, and Mach variants.
44                                   CHAPTER 3.  EXECUTABLE PROGRAM INTERPRETER (LIBEXEC.A)






Chapter  4


Disk   Partition   Interpreter



(libdiskpart.a)


Author: Kevin T. Van Maren



4.1       Introduction


The Flux OS Toolkit includes code that understands the various partitioning schemes used to divide disk
drives  into  smaller  pieces  for  use  by  filesystems.  This  code  enables  the  use  of  various  (possibly  nested)
partitioning schemes in an easy manner without requiring knowledge of which partitioning scheme was used,
or how these partitioning schemes work. E.g., you don't need to understand or know the format of a VTOC
to use the partitioning, as the library does all of it for you.



4.2       Supported  Partitioning  Schemes


Supported partitioning schemes are:

    o  BSD Disklabels

    o  IBM-PC BIOS/DOS partitions (including logical)

    o  VTOC labels (Mach).

    o  OMRON  and  DEC  label  support  based  on  old  Mach  code  is  provided,  although  it  is  completely
       untested.



4.3       Example  Use


4.3.1       Reading the partition table

This shows how the partitioning information can be extracted in user-mode (running under Unix).  In the
kernel, it would likely be necessary to pass a driver_info structure to a device-specific read function.  In
this case, driver_info is simply a filename string.


/*  This  is  the  testing  program  for  the  partitioning  code.  */
#include  <flux/diskpart/diskpart.h>
#include  <stdio.h>
#include  <fcntl.h>


#define  FILENAME  "/dev/sd0c"


                                                                45
46                                         CHAPTER 4.  DISK PARTITION INTERPRETER (LIBDISKPART.A)




#define  MAX`PARTS  30
diskpart`t  part`array[MAX`PARTS];


#define  DISK`SIZE  10000
/*
  *  In  this  case,  we  are  defining  the  disk  size  to  be  10000  sectors.
  *  Normally,  this  would  be  the  number  of  physical  sectors  on  the
  *  disk.   If  the  `disk'  is  a  `file',  it  would  be  better  to  get  the
  *  equivalent  number  of  sectors  from  the  file  size.
  *  This  is  only  used  to  fill  in  the  whole-drive  partition  entry.
  */


int  main(int  argc,  char  *argv[])
-
            int  numparts;
            char  *filename;


            if  (argc  ==  2)
                       filename  =  argv[1];
            else
                       filename  =  FILENAME;


            /*  call  the  partition  code  */
            numparts  =  diskpart`get`partition(filename,  (*my`read`fun),  part`array,
                       MAX`PARTS,  DISK`SIZE,  filename);


            printf("%d  partitions  found"n",numparts);
            /*  diskpart`dump(part`array,  0);  */
"


int  my`read`fun(char  *driver`info,  int  start,  char  *buff)
-
            int  han  =  open(driver`info,  O`RDONLY,  775);


            lseek(han,  512*start,  SEEK`SET);
            read(han,  buff,  512);
            close(han);


            /*  Should  bzero  the  result  if  read  error  occurs  */
            return(0);
"



4.3.2       Using Partition Information

The routine diskpart_lookup_bsd_compat is an example of how the old partition naming can be used even
with the new nested structure. This takes two integers representing the slice and partition. The behavior is
intended to be similar to diskpart_lookup_bsd_string (below), using integers as parameters.
    While this `hack' allows two levels of nesting (slice and partition), it is not general enough to support
arbitrary nesting.  Arbitrary nesting support is most easily achieved by passing string names to a lookup
function which can follow the structure down the partition specifications.  For example, `sd0eab' would be
used to specify the second partition in the first partition inside the fifth top-level partition on the first SCSI
disk.  Since the lookup routine doesn't need to know about the disk,  `eab' would be the partition name
passed to the lookup routine.  This naming scheme would work well as long as there are not more than 26
4.4.  RESTRICTIONS                                                                                                    47


partitions at any nesting layer.

    diskpart_lookup_bsd_string does a string lookup using the FreeBSD style slice names.  FreeBSD con-
siders the DOS partitioning to be slices. A slice can contain a BSD disklabel, and if it does, then partitions
can be inside the slice.  If the third DOS partition contains a disklabel, then `s3a' would be partition `a'
inside the disklabel. The slice name without a partition would mean the entire slice. Note also that `a' would
alias to partition `a' in the first BSD slice. If there is no BSD slice, then `a' would be aliased to `s1' instead.
However, to avoid confusion, if slice-naming is used, aliases should only be used to point inside a BSD slice.



4.4       Restrictions


This is a list of known restrictions/limitations of the partitioning library.



4.4.1       Endian

The partitioning code only recognizes labels created with the same endian-ness as the machine it is running
on.  While it is quite possible to detect an endian conflict and interpret the information in the label, the
information stored in the partitions will probably not be very useful, as most filesystems expect the numeric
representations to remain constant.



4.4.2       Nesting

Strict nesting, in which a child is not allowed to extend outside the parent, is not enforced, or even checked
by the library. This allows greater flexibility in the use of nested partitions, while also placing greater respon-
sibility on the user's shoulders to ensure that the partition information on the disk is correct.  Enforcement
of strict nesting, should it be desired, is left to the user.
    Due to previous constraints, the search routine does not yet do a recursive search for all possible nestings,
although all `sensible' ones are searched manually.  This is a change that will be incorporated as soon as
nesting of this type exists and it can be utilized by something.



4.4.3       Lookup

A general lookup routine is not yet part of the library. The diskpart_lookup routine is only able to do one
layer of nesting. More general support may be added in the future, or it may be left to the user to determine
a naming scheme to access the subpartitions.
    Also, the lookup routines currently assume a sector size of 512 bytes.



4.5       API  reference
48                                         CHAPTER 4.  DISK PARTITION INTERPRETER (LIBDISKPART.A)


4.5.1       diskpart__get__partitions:  initialize an array of partition entries


Synopsis

       int diskpart__get__partition(void *driver_info,  int (*bottom_read_fun)(),  struct diskpart
       *array, int array_size, int disk_size);


Description

       This function initializes an array of struct  diskpart entries. The caller must provide a pointer
       to a struct  diskpart array, and a function to read the disk.


Parameters

       driver_info:     A pointer to an initialized structure of user-defined type which is to be used by the
             bottom_read_fun. This is passed unmodified to bottom_read_fun.

       bottom_read_fun:       A function pointer provided by the user which can read a sector given read_info.

       array:    Array of struct  diskpart.

       array_size:     integer containing the number of allocated entries in the array.

       disk_size:    Size of the whole disk, in sectors.


Returns

       Returns an integer count of the number of partition entries that were filled by the library.  If
       there were more partitions found than space available, this will be array_size. Empty partitions
       (unused entries in a BSD disklabel, for example) occupy an entry the same as `used' entries.

       For example, a PC-DOS partition with a single filled entry would still report 4 partitions, as that
       is the size of the DOS partition table.
4.5.  API REFERENCE                                                                                                  49


4.5.2       diskpart__fill__entry:  initialize a single partition entry


Synopsis

       void diskpart__fill__entry(struct  diskpart *array, int start, int size, struct  diskpart
       *subs, int nsubs, short type, short fsys);


Description

       This function initializes a single partition entry.


Parameters

       array:    Pointer to the struct  diskpart entry to be filled

       start :  Starting sector on the disk for the partition.

       size:   Number of sectors in the partition.

       subs:    Pointer to its first child partition.

       nsubs:    Number of sub-partitions.

       type:    Partition type, as defined in diskpart.h

       fsys:   Filesystem in the partition (if known), as defined in diskpart.h


Returns

       Does not return anything.
50                                         CHAPTER 4.  DISK PARTITION INTERPRETER (LIBDISKPART.A)


4.5.3       diskpart__dump:  print a partition entry to stdout


Synopsis

       void diskpart__dump(struct  diskpart *array, int level);


Description

       This function prints a partition entry with indentation and labeling corresponding to its nesting
       level. It also recursively prints any child partitions on separate lines, with level+1.

       This provides valuable diagnostic messages for debugging disk or filesystem problems.


Parameters

       array:    A pointer to the first entry to be printed. It and any sub-partitions are printed.

       level :  int  representing  current  level.   This  controls  indentation  and  naming  of  the  output.
             diskpart_dump  called  with  the  root  struct  diskpart  entry  and  0  will  print  the  entire
             table.


Returns

       Returns nothing, but does write to stdout.
4.5.  API REFERENCE                                                                                                  51


4.5.4       diskpart__lookup__bsd__compat:  search for a partition entry


Synopsis

       struct  diskpart  *diskpart__lookup(struct  diskpart *array, short slice, short part);


Description

       This function is a sample lookup routine which finds a partition given a slice number and partition
       number.

       This  demonstrates  how  a  two-level  naming  scheme  can  be  implemented  using  integers.   This
       was first used in Mach 4 (UK22) to provide support for FreeBSD slices as well as backwards-
       compatibility with previous naming methods.


Parameters

       array:    This should be the pointer to the start of the array.

       slice:   Slice 0 is used as a `compatibility slice', in that it is aliased to a BSD partition, if it exists.
             This allows users to not specify the slice for compatibility.

       part :   Partition 0 is used to represent the whole slice, and Partition 0, Slice 0 is the whole drive.


Returns

       Returns a pointer to the corresponding partition entry, or NULL if it is invalid.
52                                         CHAPTER 4.  DISK PARTITION INTERPRETER (LIBDISKPART.A)


4.5.5       diskpart__lookup__bsd__string:  search for a partition entry


Synopsis

       struct  diskpart  *diskpart__lookup(struct  diskpart *array, char *name);


Description

       This  function  is  a  sample  lookup  routine  which  finds  a  partition  given  a  FreeBSD  style  slice
       string.  If no slice number is given, it defaults to the first BSD partition, and then to the whole
       disk if no BSD partition is found.


Parameters

       array:    This should be the pointer to the start of the array.

       name:     A case-insensitive, NULL-terminated, ASCII string containing an optional Slice specifier
             followed by an optional partition. [s<num>][<part>], where part is a valid partition in the
             BSD slice specified by num (or default).


Returns

       Returns a pointer to the corresponding partition entry, or NULL if it is invalid.
4.5.  API REFERENCE                                                                                                  53


4.5.6       diskpart__get__foo:  Search for foo-type partitions


Synopsis

       int diskpart__get__foo(struct  diskpart *array,  char *buff,  int start,  void *driver_info,
       int (*bottom_read_fun)(), int max_part);


Description

       This function finds foo-type partitions if they are on the disk. These routines would not normally
       be invoked directly.  However,  the API is documented here so that diskpart_lookup can be
       extended easily for future or additional labeling schemes.

       Currently defined functions are: pcbios,  disklabel,  vtoc,  dec, and omron.

       They should return immediately if bottom_read_fun returns non-zero, and return that error code.


Parameters

       array:    Pointer to the start of preallocated storage.

       buff :   Pointer to a sector-sized scratch area.

       start :  Offset from start of disk the partition starts.

       driver_info:     See diskpart_get_partition.

       bottom_read_fun:       See diskpart_get_partition.

       max_part :     Maximum number of partition entries that can be filled. This will generally be equal
             to the number of pre-allocated entries that are available.


Returns

       Returns the number of partition entries of that type found. If none were found, it returns 0.

       If the return value is equal to max_part then it is possible that there were more partitions than
       space for them. It is up to the user to ensure that adequate storage is passed to diskpart_get_partitions.
54                                         CHAPTER 4.  DISK PARTITION INTERPRETER (LIBDISKPART.A)






Chapter  5


File   System   Reader   (libfsread.a)


This library is implemented and used in other kernels, but not yet integrated into the framework or docu-
mented.  Source code implementing this functionality can currently be found in the "libfs" oskit subdirectory.
                                                                55
56                                                         CHAPTER 5.  FILE SYSTEM READER (LIBFSREAD.A)






Chapter  6


Minimal   C   Library   (libmc.a)



Note:  this library is implemented, just not fully documented.
6.1       Introduction


The Flux OS Toolkit's minimal C library is a subset of a standard ANSI/POSIX C library designed specifi-
cally for use in kernels or other restricted environments in which a "full-blown" C library cannot be used. The
minimal C library provides many simple standard functions such as string, memory, and formatted output
functions:  functions that are often useful in kernels as well as application programs, but because ordinary
application-oriented C libraries are unusable in kernels, must usually be reimplemented or manually "pasted"
into the kernel sources with appropriate modifications to make them usable in the kernel environment. The
versions of these functions provided by the Flux minimal C library, like the other components of the OS
toolkit, are designed to be as generic and context-independent as possible, so that they can be used in arbi-
trary environments without the developer having to resort to the traditional manual cut-and-paste methods.
This cleaner strategy brings with it the well-known advantages of careful code reuse: the kernel itself becomes
smaller and simpler due to fewer extraneous "utility" functions hanging around in the sources; it is easier
to maintain both the kernel, for the above reason, and the standard utility functions it uses, because there
is only one copy of each to maintain; finally, the kernel can easily adopt new, improved implementations of
common performance-critical functions as they become available, simply by linking against a new version of
the minimal C library (e.g., new versions of memcpy or bzero optimized for particular architectures or newer
family members of a given architecture).
    In general, the minimal C library provides only functions specified in the ANSI C or POSIX.1 standards,
and only a subset thereof. Furthermore, the provided implementations of these functions are designed to be
as independent as possible from each other and from the environment in which they run, allowing arbitrary
subsets of these functions to be used when needed without pulling in any more functionality than necessary
and without requiring the OS developer to provide significant support infrastructure.  For example, all of
the "simple" functions which merely perform some computation on or manipulation of supplied data, such
as the string instructions, are guaranteed to be completely independent of each other.
    The functions that are inherently environment-dependent in some way, such as printf, which assumes
the existence of some kind of "standard output" or "console,"  are implemented in terms of other ANSI
C or POSIX functions, such as putchar in this example.  Thus, in order to use the minimal C library's
implementation of printf, the OS developer must provide an appropriate putchar routine to be used to
write characters to whatever acts as the "standard output" in the current environment. All such dependencies
between C library functions are explicitly stated in this document, so that it is always clear what additional
functions the developer must supply in order to make use of a set of functions provided by the minimal C
library.
    Since  almost  all  of  the  functions  and  definitions  provided  by  the  Flux  minimal  C  library  implement
well-known, well-defined ANSI and POSIX C library interfaces which are amply documented elsewhere, we
do not attempt to describe the purpose and behavior of each function in this chapter.  Instead, only the


                                                                57
58                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


peculiarities relevant to the minimal C library, such as implementation interdependencies and side effects,

are described here.
    Note that many files and functions in the minimal C library are derived or taken directly from other
source code bases, particularly Mach and BSD. Specific attributions are made in the source files themselves.



6.2       Unsupported  Features


The following features in many C libraries are deliberately unsupported by the minimal C library, for reasons
described below, and will remain unsupported unless a compelling counterargument arises:

    o  Locales: Typical programs that use the minimal C library, particularly kernels, are generally not the
       kinds of programs that need extensive internationalization support from the C library functions they
       use. In practice, the string-related minimal C library functions are typically used for printing diagnostic
       messages and allowing the user to select boot time parameters such as the root partition; for these
       purposes, simplicity and compactness are generally more important than multilingual flexibility.  If a
       particular (rare) kernel does want full internationalization support in the C library functions it uses,
       and is prepared to pay the price in size and complexity, then it can instead use the full internationalized
       implementations from standard application-oriented C libraries, rather than the simple ones provided
       by the minimal C library.

    o  Multibyte characters: These are not supported for basically the same reasons as for locales.

    o  I/O  buffering:  Although  the  Flux  minimal  C  library  provides  high-level  I/O  functions  such  as
       fprintf, fputc, fread, etc., these functions do no buffering, and instead simply translate directly
       into calls to low-level I/O routines (e.g., read and write). We chose this strategy because typical pro-
       grams that use the minimal C library only want to use high-level I/O functions for the convenience they
       provide (particularly formatted I/O), not for the performance benefits of buffering. Full I/O buffering
       generally comes with a great deal of C library code size and complexity,  and add many additional
       dependencies to the environment (e.g.,  memory allocation for buffers,  detection of line disciplines).
       Furthermore, the mere act of buffering I/O implies a major assumption about the environment and
       the use of these functions: in particular, it assumes that the underlying low-level I/O operations have
       high per-invocation overhead and that the high-level I/O operations are called at fine enough granu-
       larity to make this overhead a problem in practice.  This assumption is often invalid for clients of the
       minimal C library, which generally use I/O functions only sporadically if at all, rather than intensively
       as many user-level applications do; and in any case, one of the primary goals of the minimal C library
       is to avoid such assumptions in the first place. For these reasons, we felt that I/O buffering is neither
       necessary nor appropriate for the minimal C library to perform.

    o  Floating-point math support: In general, most kernels and other programs likely to use the minimal
       C library do not perform much, if any, floating point arithmetic; in many cases they never even access
       the FPU other than to save and restore its state on context switches. For this reason, all of the floating-
       point math functions that are a standard part of most C libraries are omitted from the minimal C
       library.
6.3.  HEADER FILES                                                                                                    59


6.3       Header  Files


When the Flux OS toolkit is installed using make  install, a set of stardard ANSI/POSIX-defined header
files, containing definitions and function prototypes for the minimal C library, are installed in the selected
include directory under the subdirectory flux/c/.  For example,  the version of the ANSI C header file
string.h provided with the minimal C library is installed as prefix/include/flux/c/string.h.  These
header files are installed in a subdirectory rather than in the top level include directory so that if the
OS toolkit is installed in a standard place shared by other packages and/or system files, such as /usr or
/usr/local,  the  minimal  C  library's  header  files  will  not  conflict  with  header  files  provided  by  normal
application-oriented C libraries, nor will applications "accidentally" use the minimal C library's header files
when they really want the normal C library's header files.
    There are two main ways a kernel or other program can explicitly use the Flux minimal C library's
header files.  The first is by including the flux/c/ prefix directly in all relevant #include statements; e.g.,
`#include  <flux/c/string.h>' instead of `#include  <string.h>'. However, since this method effectively
makes the client code somewhat specific to the Flux minimal C library by hard-coding Flux OS toolkit-
specific pathnames into the #include statements, this method should generally only be used if for some
reason the code in question is extremely dependent on the Flux minimal C library in particular,  and it
would never make sense for it to include corresponding header files from a different C library.
    For typical code using the minimal C library, which simply needs "a printf" or "a strcpy," the preferred
method of including the library's header files is to code the #include lines without the flux/c/ prefix, just as
in application code using an ordinary C library, and then add an appropriate -I (include directory) directive
to the compiler command line so that the flux/c/ directory will be scanned automatically for these header
files before the top-level include directory and other include directories in the system are searched. Typically
this -I directive can be added to the CFLAGS variable in the Makefile used to build the program in question.
In fact, the OS toolkit itself uses this method to allow code in other toolkit components and in the minimal
C library itself to make use of definitions and functions provided by the minimal C library. (Of course, these
dependencies are clearly documented, so that if you want to use other OS toolkit components but not the
minimal C library, or only part of the minimal C library, it is possible to do so cleanly.)
    Except  when  otherwise  noted,  all  of  the  definitions  and  functions  described  in  this  section  are  very
simple, have few dependencies, and behave as in ordinary C libraries. Functions that are not self-contained
and interact with the surrounding environment in non-trivial ways (e.g., the memory allocation functions)
are described in more detail in later sections.
60                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.1       assert.h:  program diagnostics facility


Description

       This header file provides a standard assert macro as described in the C standard. It is compiled
       out (generates no code) if the preprocessor symbol NDEBUG is defined before this header file is
       included.
6.3.  HEADER FILES                                                                                                    61


6.3.2       ctype.h:  character handling functions


Description

       This header file provides implementations of the following standard character handling functions:

       isalnum:      Tests if a character is alphanumeric.

       isalpha:      Tests if a character is alphabetic.

       iscntrl:      Tests if a character is a control character.

       isdigit:      Tests if a character is a decimal digit.

       isgraph:      Tests if a character is a printable non-space character.

       islower:      Tests if a character is a lowercase letter.

       isprint:      Tests if a character is a printable character, including space.

       ispunct:      Tests if a character is a punctuation mark.

       isspace:      Tests if a character is a whitespace character of any kind.

       isupper:      Tests if a character is a uppercase letter.

       isxdigit:      Tests if a character is a hexadecimal digit.

       tolower:      Converts a character to lowercase.

       toupper:      Converts a character to uppercase.

       The implementations of these functions provided by the minimal C library are directly-coded
       inline functions, and do not reference any global data structures such as character type arrays.
       They do not support locales (see Section 6.2), and only recognize the basic 7-bit ASCII character
       set (all characters above 126 are considered to be control characters).
62                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.3       errno.h:  error numbers


Description

       This file declares the global errno variable, and defines symbolic constants for all the errno values
       defined in the ISO/ANSI C, POSIX.1, and POSIX.1b standards.  They are provided mainly for
       the convenience of clients that can benefit from standardized error codes and do not already have
       their own error handling scheme and error code namespace.  Very few functions in the minimal
       C library depend on these codes (and those that do are clearly documented as doing so), and
       none of the functions in other components of the toolkit do, so the use of this header is strictly
       optional.

       XXX currently these numbers are assigned to fit into the Mach error code scheme; perhaps they
       should be changed to a more Unix-compatible scheme.
6.3.  HEADER FILES                                                                                                    63


6.3.4       fcntl.h:  POSIX low-level file control


Description

       This header file defines prototypes for the low-level POSIX functions creat and open, and pro-
       vides symbolic constants for the POSIX open mode flags (O_*).

       The minimal C library provides an implementation of creat, which merely calls open with the
       proper arguments.  However, the minimal C library does not implement open, since there is no
       sufficiently context-independent way to implement it.  Therefore, if you want to use it (or, more
       likely, if you want to use the high-level I/O routines provided by the minimal C library such as
       fopen and fprintf, which translate to calls to the low-level POSIX routines), you will have to
       implement it yourself.

       The  open  mode  constants  defined  by  this  header  are  provided  mainly  for  the  convenience  of
       clients that can use them and don't already have their own definitions. The only functions in the
       OS toolkit that depend on them are the default implementations of creat (Section 6.7.4) and
       fopen (Section ?? ).
64                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.5       limits.h:  architecture-specific limits


Description

       XXX
6.3.  HEADER FILES                                                                                                    65


6.3.6       setjmp.h:  nonlocal jumps
66                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.7       signal.h:  signal handling
6.3.  HEADER FILES                                                                                                    67


6.3.8       stdarg.h:  variable arguments
68                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.9       stddef.h:  common definitions


Description

       This header file defines the symbol NULL and the type size_t if they haven't been defined already.
6.3.  HEADER FILES                                                                                                    69


6.3.10        stdio.h:  standard input/output
70                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.11        stdlib.h:  standard library functions


Description

       This header file defines the symbol NULL and the type size_t if they haven't been defined already,
       and provides prototypes for the following functions in the minimal C library:

       atoi:    Convert an ASCII decimal number into an int.

       atol:    Convert an ASCII decimal number into a long.

       strtol:     Convert an ASCII number into a long.

       strtoul:      Convert an ASCII number into an unsigned  long.

       rand:    Compute a pseudo-random integer. Not thread safe; uses static data.

       srand:     Seed the pseudo-random number generator. Not thread safe; uses static data.

       exit:    Cause normal program termination; see Section 6.6.1.

       abort:     Cause abnormal program termination; see Section 6.6.2.

       panic:     Cause  abnormal  termination  and  print  a  message.   Not  a  standard  C  function;  see
             Section 6.6.3.

       getenv:     Search for a string in the environment; see Section 6.7.3.

       qsort:     Sort an array of objects.

       abs:    Compute the absolute value of an integer.
6.3.  HEADER FILES                                                                                                    71


6.3.12        string.h:  string handling functions


Description

       This header file defines the symbol NULL if it hasn't been defined already, and provides prototypes
       for the following functions in the minimal C library:

       memcpy:     Copy data from one location in memory to another.

       memset:     Set the contents of a block of memory to a uniform value.

       strlen:     Find the length of a null-terminated string.

       strcpy:     Copy a string to another location in memory.

       strncpy:      Copy a string, up to a specified maximum length.

       strdup:     Return a copy of a string in newly-allocated memory. Depends on malloc, Section 6.4.2.

       strcat:     Concatenate a second string onto the end of a first.

       strncat:      Concatenate two strings, up to a specified maximum length.

       strcmp:     Compare two strings.

       strncmp:      Compare two strings, up to a specified maximum length.

       strchr:     Find the first occurrence of a character in a string.

       strrchr:      Find the last occurrence of a character in a string.

       strstr:     Find the first occurrence of a substring in a larger string.

       strtok:     Scan for tokens in a string. Not thread safe; uses static data.

       strpbrk:      Locate the first occurrence in a string of one of several characters.

       strspn:     Find the length of an initial span of characters in a given set.

       strcspn:      Measure a span of characters not in a given set.

       The following deprecated functions are provided for compatibility with existing code:

       bcopy:     Copy data from one location in memory to another.

       bzero:     Clear the contents of a memory block to zero.

       index:     Find the first occurrence of a character in a string.

       rindex:     Find the last occurrence of a character in a string.
72                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.13        strings.h:  string handling functions (deprecated)


Description

       For compatibility with existing software, a header file called strings.h is provided which acts
       as a synonym for string.h (Section 6.3.12).
6.3.  HEADER FILES                                                                                                    73


6.3.14        sys/gmon.h:  GNU profiling support definitions


XXX check this out further - is it appropriate?
74                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.15        sys/ioctl.h:  I/O control definitions
6.3.  HEADER FILES                                                                                                    75


6.3.16        sys/mman.h:  memory management and mapping definitions
76                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.17        sys/reboot.h:  memory management and mapping definitions


XXX deprecated; used only by boot code. Should it be here at all?
6.3.  HEADER FILES                                                                                                    77


6.3.18        sys/signal.h:  signal handling (deprecated)
78                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.19        sys/stat.h:  file statistics
6.3.  HEADER FILES                                                                                                    79


6.3.20        sys/termios.h:  terminal handling functions and definitions
80                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.21        sys/time.h:  timing functions


XXX what exactly is with this file?
6.3.  HEADER FILES                                                                                                    81


6.3.22        sys/types.h:  general POSIX types
82                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.3.23        termios.h:  terminal handling functions and definitions
6.3.  HEADER FILES                                                                                                    83


6.3.24        unistd.h:  traditional Unix definitions
84                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4       Memory  Allocation


All of the default memory allocation functions in the minimal C library are built on top of the List Memory
Manager, described in Section 2.
    There are two families of memory allocation routines available in the minimal C library.  First is the
standard malloc, realloc, calloc, and free. These work as in any standard C library.
    The second family, smalloc, smemalign, and sfree, assume that the caller will keep track of the size of
allocated memory blocks.  Chunks allocated with smalloc-style functions must be freed with sfree rather
than the normal free.  These functions are not part of the POSIX standard, but are much more memory
efficient when allocating many power-of-two-size chunks naturally aligned to their size (e.g., when allocating
naturally-aligned pages or superpages).  The normal memalign function attaches a prefix to each allocated
block to keep track of the block's size, and the presence of this prefix makes it impossible to allocate naturally-
aligned, natural-sized blocks successively in memory; only every other block can be used, greatly increasing
fragmentation and effectively halving usable memory. (Note that this fragmentation property is not peculiar
to the OS toolkit's implementation of memalign; most versions of memalign produce have this effect.)
    All of the memory management functions, if they are unable to allocate a block out of the lmm pool,
call the morecore function and then retry the allocation if morecore returns nonzero. The default behavior
for this function is simply to return 0, signifying that no more memory is available.  In environments in
which a dynamically growable heap is available, you can override the morecore function to grow the heap
as appropriate.
    All of the memory allocation functions make calls to mem_lock and mem_unlock to protect access to the
LMM pool under all of these services.  The default implementation of these synchronization functions is to
do nothing; however, they can be overridden with functions that acquire and release a lock of some kind
appropriate to the environment in order to make the allocation functions thread- or SMP-safe.
6.4.  MEMORY ALLOCATION                                                                                        85


6.4.1       malloc__lmm:  LMM pool used by the default memory allocation functions
86                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.2       malloc:  allocate uninitialized memory
6.4.  MEMORY ALLOCATION                                                                                        87


6.4.3       memalign:  allocate aligned memory
88                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.4       calloc:  allocate cleared memory
6.4.  MEMORY ALLOCATION                                                                                        89


6.4.5       realloc:  change the size of an existing memory block
90                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.6       free:  release an allocated memory block
6.4.  MEMORY ALLOCATION                                                                                        91


6.4.7       smalloc:  allocated  uninitialized  memory.   Caller  must  keep  track  of  size

            of the allocation.
92                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.8       smemalign:  allocate aligned memory.  Caller must keep track of size of the

            allocation.
6.4.  MEMORY ALLOCATION                                                                                        93


6.4.9       sfree:  release a memory block allocated with smalloc or smemalign.  Caller

            must proivde size of the block being freed.
94                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.10        mem__lock:  Lock access to malloc__lmm.
6.4.  MEMORY ALLOCATION                                                                                        95


6.4.11        mem__unlock:  Unlock access to malloc__lmm.
96                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.4.12        morecore:  grow the heap
6.5.  STANDARD I/O FUNCTIONS                                                                                  97


6.5       Standard  I/O  Functions


The versions of sprintf, vsprintf, sscanf, and vsscanf provided in the OS toolkit's minimal C library
are completely self-contained;  they do not pull in the code for printf,  fprintf,  or other "file-oriented"
standard I/O functions.  Thus, they can be used in any environment, regardless of whether some kind of
console or file I/O is available.
    The version of printf in the minimal C library is implemented in terms of the routines putchar and
puts,  rather  than  in  terms  of  vfprintf.   Furthermore,  the  default  implementation  of  puts  provided  is
implemented only in terms of putchar.  This means that you can get working formatted "console" output
merely by providing an appropriate implementation of putchar; it is unnecessary to provide a working write
function or other file descriptor-based low-level I/O functions.
    The standard I/O functions that actually take a FILE* argument, such as fprintf and fwrite, and as
such are fundamentally dependent on the notion of files, are implemented in terms of the low-level POSIX
file I/O functions such as write, which the developer must supply in order to use these functions. However,
unlike in "real" C libraries, the high-level file I/O functions provided by the minimal C library only implement
the minimum of functionality to provide the basic API: in particular, they do no buffering, so for example
an fwrite translates directly to a write.  This design reduces code size and minimizes interdependencies
between functions, while still providing familiar, useful services such as formatted file I/O.
    XXX currently ungetc isn't supported; should we support it? (Requires one-character buffering.)


creat.c
doprnt.c
doprnt.h
doscan.c
doscan.h
fclose.c
fgetc.c
fopen.c
fprintf.c
fputc.c
fread.c
fscanf.c
fseek.c
ftell.c
fwrite.c
printf.c
puts.c
remove.c
rewind.c
sprintf.c
sscanf.c
stderr.c
stdin.c
stdout.c
vfprintf.c
98                                                                  CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.6       Termination  Functions
6.6.  TERMINATION FUNCTIONS                                                                                   99


6.6.1       exit:  terminate normally
100                                                                CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.6.2       abort:  terminate abnormally
6.6.  TERMINATION FUNCTIONS                                                                                 101


6.6.3       panic:  terminate abnormally with an error message
102                                                                CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.7       Miscellaneous  Functions
6.7.  MISCELLANEOUS FUNCTIONS                                                                              103


6.7.1       ntohl:  convert 32-bit long word from network byte order
104                                                                CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.7.2       ntohs:  convert 16-bit short word from network byte order


XXX what about hton? include flux/c/endian.h?
6.7.  MISCELLANEOUS FUNCTIONS                                                                              105


6.7.3       getenv:  search for an environment variable


getenv: depends on strncmp, environ
106                                                                CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)


6.7.4       creat:  create a file
6.7.  MISCELLANEOUS FUNCTIONS                                                                              107


6.7.5       hexdump:  print a buffer as a hexdump


Synopsis

       #include  <flux/c/stdio.h>

       void hexdump(void *buf, int len);


Description

       This function prints out a buffer as a hexdump. For example:


       .---------------------------------------------------------------------------.
       _  00000000          837c240c  00741dc7  05007010  00000000          ._$..t....p.....  _
       _  00000010          008b4424  0ca30470  10008b04  24a30870          ..D$...p....$..p  _
       _  00000020          1000eb2c  c7050070  10000100  0000833c          ...,...p.......<  _
       _  00000030          2400740a  c7050070  10000200  00008b44          $.t....p.......D  _
       `---------------------------------------------------------------------------'


       The box is included.
108                                                                CHAPTER 6.  MINIMAL C LIBRARY (LIBMC.A)






Chapter  7


Memory   Debug   Utilities   Library



(libmemdebug.a)



7.1       Introduction


The Memory Debug Utilities Library is a set of functions which replace the standard memory allocation
functions, see Section 6.4, of the minimal C library. The replacement routines detect problems with memory
allocation, and can print out file and line information, along with a backtrace to the offending allocation.
    All of the standard functions are covered: malloc, memalign, calloc, realloc, free, and smalloc, smemalign,
and sfree.
    To use the library, just include -lmemdebug on the linker command line before the standard C library (or
wherever it is the standard allocation routines are coming from.)
    libmemdebug implements a fence-post style malloc debug library. It detects the following problems:

    o  Overruns  and  underruns  Overruns  and  underruns  of  allocated  memory  blocks  are  detected  by
       "fence-posts" at each end of every allocated block of memory.

    o  Allocation/release  style  mismatches Mismatches between malloc() style and smalloc() style
       allocations and the respective free() function are detected. This type of error is correctable.

    o  Memory  use  after  it  is  free()'d  Memory  is  wiped  to  a  recognizable  (nonzero)  bit  pattern  on
       allocation and when it is freed, to force bugs to show up when memory is used after it is freed.

    o  Incorrect size passed to sfree() The sfree() size is checked against that used when the block is
       created.

    o  free() called twice on the same block. Double frees are detected.

    Whenever  a  problem  is  encountered  a  backtrace  (in  the  form  of  program  counter  values)  is  dumped
(backtracing from the allocation of the memory). File and line number information from where the allocation
call was made are also printed (if available.)  If the failure was detected in a call to free, the file and line of
that call are printed.
    When correctable errors are detected (eg, sfree()'ing a malloc'd block, or sfree()'ing with the wrong
size block.) the correct thing will be done, and the program will continue as normal (except for the bogosity
dump.)
    Note that file and line number information is only available if you're using the macro wrappers for the
allocaters defined in malloc_debug.h.
    There are a few auxiliary functions useful for detecting errors. First is memdebug_ptrchk() which takes a
pointer and runs a set of sanity checks on it and the fence posts. memdebug_mark() and memdebug_check() are
useful for narrowing down leaks. Place a memdebug_mark() and an memdebug_check() at points in the code
where no new allocations after the memdebug_mark() should be around at the point of the memdebug_check().
memdebug_sweep() runs a sanity check on all currently allocated memory blocks.


                                                               109
110                                CHAPTER 7.  MEMORY DEBUG UTILITIES LIBRARY (LIBMEMDEBUG.A)


    There is one configure option in the private debug_malloc.h header file-it controls whether running out

of memory is instantly fatal, or if null should be returned.
    Note that the default malloc() and memdebug_alloc() CAN HAVE DIFFERENT POLICIES. memdebug_alloc()
uses the lmm library directly, in the same way as the default malloc().  But, malloc() can be replaced,
and memdebug_alloc() will not change.
    All of the routines use an overridable memdebug_printf() to print output, you should override this if
you cannot guarantee that vfprintf() calls will not allocate memory.
    The  allocation  routines  also  all  call  mem_lock()  and  mem_unlock()  to  protect  access  to  the  global
malloc_lmm.  See the minimal C library's section on Memory Allocation (Section 6.4) for more informa-
tion on these functions.
7.2.  DEBUGGING VERSIONS OF STANDARD ROUTINES                                                  111


7.2       Debugging  versions  of  standard  routines


If the header file, memdebug.h is included, these functions are macros which generate calls that include file
and line information.  If the header file is not included, these functions are called directly without file and
line information.
    They are drop-in replacements for the allocation functions described in Section 6.4.


    o  malloc

    o  realloc

    o  calloc

    o  memalign

    o  free

    o  smalloc

    o  smemalign

    o  sfree
112                                CHAPTER 7.  MEMORY DEBUG UTILITIES LIBRARY (LIBMEMDEBUG.A)


7.3       Additional  debugging  utilities


These routines provide additional features useful for tracking down memory leaks and dynamic memory
corruption.
7.3.  ADDITIONAL DEBUGGING UTILITIES                                                                    113


7.3.1       memdebug__mark:  Mark all currently allocated blocks.
114                                CHAPTER 7.  MEMORY DEBUG UTILITIES LIBRARY (LIBMEMDEBUG.A)


7.3.2       memdebug__check:  Look  for  blocks  allocated  since  mark  that  haven't  been

            freed.
7.3.  ADDITIONAL DEBUGGING UTILITIES                                                                    115


7.3.3       memdebug__ptrchk:  Check validity of a pointer's fence-posts
116                                CHAPTER 7.  MEMORY DEBUG UTILITIES LIBRARY (LIBMEMDEBUG.A)


7.3.4       memdebug__sweep:  Check validity of all allocated block's fence-posts



Chapter  8


Kernel   Support   Library   (libkern.a)



8.1       Introduction


The kernel support library, libkern.a, supplies a variety of functions and other definitions that are primarily
of use in OS kernels. (In contrast, the other parts of the Flux OS toolkit are more generic components useful
in a variety of environments including, but not limited to, OS kernels.) The kernel support library contains
all the code necessary to create a minimal working "kernel" that boots and sets up the machine for a generic
"OS-friendly" environment.  For example, on the x86, the kernel support library provides code to get into
protected mode, set up default descriptor tables, etc.  The library also includes a remote debugging stub,
providing convenient source-level debugging of the kernel over a serial line using GDB's serial-line remote
debugging protocol. As always, all components of this library are optional and replaceable, so although some
pieces may be unusable in some environments, others should still work fine.



8.1.1       Machine-dependence of code and interfaces

This library contains a much higher percentage of machine-dependent code than the other libraries in the
toolkit, primarily because this library deals with heavily machine-dependent facilities such as page tables,
interrupt vector tables, trap handling, etc.  The library attempts to hide some machine-dependent details
from the OS by providing generic, machine-independent interfaces to machine-dependent library code.  For
example, regardless of the architecture and boot loading mechanism in use, the kernel startup code included
in the library always sets up a generic C-compatible execution environment and starts the kernel by calling the
well-known main routine, just as in ordinary C programs. However, the library makes no attempt to provide
a complete architecture-independence layer, since such a layer would have to make too many assumptions
about the OS that is using it. For example, although the library provides page table management routines,
these routines have fairly low-level, architecture-specific interfaces.



8.1.2       Generic versus Base Environment code

The functionality provided by the kernel support library is divided into two main classes: the generic support
code, and the base environment. The generic support contains simple routines and definitions that are almost
completely independent of the particular OS environment in which they are used:  for example, the generic
support includes symbolic definitions for bits in processor registers and page tables, C wrapper functions to
access special-purpose processor registers, etc.  The generic support code should be usable in any OS that
needs it.
    The base environment code, on the other hand, is somewhat less generic in that it is designed to create,
and function in, a well-defined default or "base" kernel execution environment.  Out of necessity, this code
makes more assumptions about how it is used, and therefore it is more likely that parts of it will not be
usable to a particular client OS. For example, on the x86 architecture, the base environment code sets up a
default global descriptor table containing a "standard" set of basic, flat-model segment descriptors, as well
as a few extra slots reserved for use by the client OS. This "base GDT" is likely to be sufficient for many


                                                               117
118                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


kernels, but may not be usable to kernels that make more exotic uses of the processor's GDT. In order to

allow piecemeal replacement of the base environment as necessary, the assumptions made by the code and
the intermodule dependencies are clearly documented in the sections covering the base environment code.



8.1.3       Road Map

Following is a brief summary of the main facilities provided by the library, indexed by the section numbers
of the sections describing each facility:


  8.2  Machine-independent Facilities:  Types and constants describing machine-dependent information
       such as word size and page size. For example, types are provided which, if used properly, allow machine-
       independent code to compile easily on both 32-bit and 64-bit architectures. Also, functions are provided
       for various generic operations such as primitive multiprocessor synchronization and efficient bit field
       manipulation.


  8.3    _____|_X86|Generic  Low-level  Definitions:  Header files describing x86 processor data structures and
       registers, as well as functions to access and manipulate them. Includes:


          -  Bit definitions of the contents of the flags, control, debug, and floating point registers.

          -  Inline functions and macros to read and write the flags, control, debug, segment registers, and
             descriptor registers (IDTR, GDTR, LDTR, TR).

          -  Macros to read the Pentium timestamp counter (useful for fine-grained timing and benchmarking)
             and the stack pointer.

          -  Structure  definitions  for  architectural  data  structures  such  as  far  pointers,  segment  and  gate
             descriptors, task state structures, floating point save areas, and page tables, as well as generic
             functions to set up these structures.

          -  Symbolic definitions of the processor trap vectors.

          -  Macros to access I/O ports using the x86's in and out instructions.

          -  Assembly language support macros to smooth over the differences in target object formats, such
             as ELF versus a.out.


  8.4   ________|_X86GPC|eneric  Low-level  Definitions:  Generic definitions for standard parts of the PC archi-
       tecture,  such as IRQ assignments,  the programmable interrupt controller (PIC), and the keyboard
       controller.


  8.5   _____|_X86|Processor Identification and Management:  Functions to identify the CPU and available
       features, to enter and leave protected mode, and to enable and disable paging.


  8.6   _____|_X86|Base Environment Setup:  Functions that can be used individually or as a unit to set up a
       basic, minimal kernel execution environment on x86 processors: e.g., a minimal GDT, IDT, TSS, and
       kernel page tables.


 8.10   ________|_X86BPC|ase Environment Setup:  Functions to set up a PC's programmable interrupt controller
       (PIC) and standard IRQ vectors, to manage a PC's low (1MB), middle (16MB) and upper memory,
       and to provide simple non-interrupt-driven console support.


 8.11    ________|_X86MPC|ultiBoot  Startup:  Complete startup code to allow the kernel to be booted from any
       MultiBoot-compliant boot loader easily.  Includes code to parse options and environment variables
       passed to the kernel by the boot loader, and to find and use boot modules loaded with the kernel.


 8.12   ________|_X86RPC|aw BIOS Startup: Complete startup code for boot loaders and other programs that need
       to be loaded directly by the BIOS at boot time. This startup code takes care of all aspects of switching
       from real to protected mode and setting up a 32-bit environment, and provides mechanisms to call
       back to 16-bit BIOS code by running the BIOS in either real mode or v86 mode (your choice).
8.1.  INTRODUCTION                                                                                                 119


 8.13   ________|_X86DPC|OS Startup:  This startup code is similar to the BIOS startup code, but it expects to be

       loaded in a 16-bit DOS environment: useful for DOS-based boot loaders, DOS extenders, or prototype
       kernels that run under DOS. Again, this code fully handles mode switching and provides DOS/BIOS
       callback mechanisms.

 8.14  Kernel Debugging Facilities: A generic, machine-independent remote GDB stub is provided which
       supports the standard serial-line GDB protocol. In addition, machine-dependent default trap handling
       and fault-safe memory access code is provided to allow the debugging stub to be used "out of the box"
       on x86 PCs.

 8.16  XXX Annotations. Not yet documented.
120                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.2       Machine-independent  Facilities


XXX
8.2.  MACHINE-INDEPENDENT FACILITIES                                                                    121


8.2.1       types.h:  C-language machine-dependent types


Synopsis

       #include  <flux/machine/types.h>


Description

       This header provides a number of C types that hide differences in word size and language con-
       ventions  across  different  architectures  and  compilers.   For  example,  by  making  proper  use  of
       these  types,  machine-independent  code  can  be  made  to  work  cleanly  on  both  32-  and  64-bit
       architectures.

       The following types are defined as integers of the same size as a pointer, which is also assumed
       to  be  the  machine's  natural  word  size.  These  types  should  be  used  by  machine-independent
       code that manipulates pointers as integers or is otherwise dependent on the architectural pointer
       size. Note that on architectures that have both 32-bit and 64-bit variants, such as PowerPC and
       pa-risc, these types may have different sizes depending on the configuration of the OS toolkit.

       integer_t:      A signed pointer-size integer.

       natural_t:      An unsigned pointer-size integer.

       vm_offset_t:       An unsigned offset into virtual memory space.

       vm_size_t:      An unsigned size in virtual memory space, e.g., a difference between two vm_offset_t's.

       The following types are defined to be exactly of the size and variety their names imply, regardless
       of processor architecture or compiler:

       signed8_t:      A signed 8-bit integer.

       signed16_t:       A signed 16-bit integer.

       signed32_t:       A signed 32-bit integer.

       signed64_t:       A signed 64-bit integer.

       unsigned8_t:       An unsigned 8-bit integer.

       unsigned16_t:        An unsigned 16-bit integer.

       unsigned32_t:        An unsigned 32-bit integer.

       unsigned64_t:        An unsigned 64-bit integer.

       float32_t:      A 32-bit floating point number.

       float64_t:      A 64-bit floating point number.

       Finally, the following types are defined to be the "most efficient" type of at least the indicated
       size on a given architecture.  For example, in 32-bit code on the x86, accessing 8-bit or 32-bit
       values is very quick, but accessing 16-bit values is significantly slower; this property is reflected
       in the x86 definitions of these types.  These types are to be used for variables that must be a
       certain minimum size, but for which a larger size is acceptable if that size is more convenient for
       the processor.

       signed_min8_t:        A signed integer at least 8 bits wide.

       signed_min16_t:        A signed integer at least 16 bits wide.

       signed_min32_t:        A signed integer at least 32 bits wide.

       signed_min64_t:        A signed integer at least 64 bits wide.

       unsigned_min8_t:         An unsigned integer at least 8 bits wide.

       unsigned_min16_t:         An unsigned integer at least 16 bits wide.

       unsigned_min32_t:         An unsigned integer at least 32 bits wide.

       unsigned_min64_t:         An unsigned integer at least 64 bits wide.

       This file was originally derived from Mach's vm_types.h.
122                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.2.2       page.h:  Page size definitions


Description

       This file provides of following symbols, which define the architectural page size of the architecture
       for which the OS toolkit is configured:

       PAGE_SIZE:      The number of bytes on each page. It can always be assumed to be a power of two.

       PAGE_SHIFT:      The number of low address bits not translated by the MMU hardware. PAGE_SIZE
             is always 2PAGE_SHIFT .

       PAGE_MASK:      A  bit  mask  with  the  low-order  PAGE_SHIFT  address  bits  set.   Always  equal  to
             PAGE_SIZE    - 1.

       In addition, the following macros are provided for convenience in performing page-related ma-
       nipulations of addresses:

       atop(addr):       Converts a byte address into a page frame number, by dividing by PAGE_SIZE.

       ptoa(page):       Converts a page frame number into an integer (vm_offset_t) byte address,  by
             multiplying by PAGE_SIZE.

       trunc_page(addr):         Returns addr rounded down to the next lower page boundary.  If addr is
             already on a page boundary, it is returned unchanged.

       round_page(addr):         Returns  addr  rounded  up  to  the  next  higher  page  boundary.   If  addr  is
             already on a page boundary, it is returned unchanged.

       page_aligned(addr):          Evaluates to true (nonzero) if addr is page aligned, or false (zero) if it
             isn't.

       Note that many modern architectures support multiple page sizes.  On such architectures, the
       page size defined in this file is the minimum architectural page size, i.e., the finest granularity
       over which the MMU has control. Since there seems to be no sufficiently generic and useful way
       that this header file could provide symbols indicating which "other" page sizes the architecture
       supports, making good use of larger pages probably must be done in machine-dependent code.

       Some operating systems on some architectures do not actually support the minimum architectural
       page size in software; instead, they aggregate multiple architectural pages together into larger
       "logical pages" managed by the OS software.  On such operating systems, it would be inappro-
       priate for general OS or application code to use the PAGE_SIZE value provided by flux/page.h,
       since this value would be smaller (more fine-grained) than the OS software actually supports,
       and therefore inappropriate. However, this is purely a high-level OS issue; like other parts of the
       toolkit, no one is required to use this header file if it is inappropriate in a particular situation.

       This file was originally derived from Mach's vm_param.h.
8.2.  MACHINE-INDEPENDENT FACILITIES                                                                    123


8.2.3       bitops.h:  efficient bit field operations


Synopsis

       #include  <flux/bitops.h>


Description

       XXX currently proc_ops.h
124                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.2.4       spin__lock.h:  Spin locks


Synopsis

       #include  <flux/machine/spin_lock.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.
8.2.  MACHINE-INDEPENDENT FACILITIES                                                                    125


8.2.5       debug.h:  debugging support facilities


Synopsis

       #include  <flux/debug.h>



Description

       This file contains simple macros and functions to assist in debugging.  Many of these facilities
       are intended to be used to "annotate" programs permanently or semi-permanently in ways that
       reflect the code's proper or desired behavior.  These facilities typically change their behavior
       depending on whether the preprocessor symbol DEBUG is defined: if it is defined, then extra code
       is introduced to check invariants and such; when DEBUG is not defined, all of this debugging code
       is "compiled out" so that it does not result in any size increase or efficiency loss in the resulting
       compiled code.

       The following macros and functions are intended to be used as permanent- or semi-permanent
       annotations to be sprinkled throughout ordinary code to increase its robustness and clarify its
       invariants and assumptions to human readers:


       assert(cond):        This is a standard assert macro, like (and compatible with) the one provided
             in flux/c/assert.h. If DEBUG is defined, this macro produces code that evaluates cond and
             calls panic (see Section 6.6.3) if the result is false (zero). When an assertion fails and causes
             a panic, the resulting message includes the source file name and line number of the assertion
             that failed, as well as the text of the cond expression used in the assertion.  If DEBUG is not
             defined, this macro evaluates to nothing (an empty statement), generating no code.

             Assertions are typically used to codify assumptions made by a code sequence, e.g., about the
             parameters to a function or the conditions on entry to or exit from a loop. By placing explicit
             assert statements in well-chosen locations to verify that the code's invariants indeed hold,
             a thicker "safety net" is woven into the code, which tends to make bugs manifest themselves
             earlier and in much more obvious ways, rather than allowing incorrect results to "trickle"
             through the program's execution for a long time, sometimes resulting in completely baffling
             behavior.  Assertions can also act as a form of documentation, clearly describing to human
             readers the exact requirements and assumptions in a piece of code.

       otsan():      If DEBUG is defined, this macro unconditionally causes a panic with the message "off
             the straight and narrow!," along with the source file name and line number, if it is ever
             executed.  It is intended to be placed at code locations that should never be reached if the
             code is functioning properly; e.g., as the default case of a switch statement for which the
             result of the conditional expression should always match one of the explicit case values.  If
             DEBUG is not defined, this macro evaluates to nothing.

       do_debug(stmt):        If DEBUG is defined,  this macro evaluates to stmt;  otherwise it evaluates to
             nothing. This macro is useful in situations where an #ifdef  DEBUG : : :#endif block would
             otherwise be used over just a few lines of code or a single statement:  it produces the same
             effect, but is smaller and less visually intrusive.


       The following macros and functions are primarily intended to be used as temporary scaffolding
       during debugging, and removed from production code:


       void  dump_stack_trace(void):             This function dumps a human-readable backtrace of the cur-
             rent function call stack to the console, using printf.  The exact content and format of the
             printed data is architecture-specific;  however, the output is typically a list of instruction
             pointer or program counter values, each pointing into a function on the call stack, presum-
             ably to the return point after the function call to the next level.  You can find out what
             function these addresses reside in by running the Unix nm utility on the appropriate exe-
             cutable file image, sorting the resulting symbol list if necessary, and looking up the address
126                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


             in the sorted list. Alternatively, for more precise details, you can look up the exact instruc-

             tion addresses in a disassembly of the executable file, e.g., by using GNU objdump with the
             `-d' option.

       here():     This macro generates code that simply prints the source file name and line number at
             which the macro was used.  This macro can be extremely useful when trying to nail down
             the precise time or code location at which a particular bug manifests itself, or to determine
             the sequence of events leading up to it.  By sprinkling around calls to the here macro in
             appropriate places, the program will dump regular status reports of its location every time
             it hits one of these macros, effectively producing a log of "interesting" events ("interesting"
             being defined according to the placement of the here macro invocations).  Using the here
             macro this way is equivalent to the common practice of sprinkling printf's around and
             watching the output, except it is easier because the here invocation in each place does not
             have to be "tailored" to make it distinguishable from the other locations:  each use of the
             here macro is self-identifying.

             If DEBUG is not defined, the here macro is not defined at all; this makes it obvious when
             you've accidentally left invocations of this macro in a piece of code after it has been debugged.

       debugmsg(printfargs):            This macro is similar to here, except it allows a formatted message
             to be printed along with the source file name and line number.  printfargs is a complete
             set of arguments to be passed to the printf function, including parentheses:  for example,
             `debugmsg(("foo  is  %d",  foo));'.  A newline is automatically appended to the end of
             the message. This macro is generally useful as a wrapper for printf for printing temporary
             run-time status messages during execution of a program being debugged.

             As with here, if DEBUG is not defined, the debugmsg macro is not defined at all, in order to
             make it obvious if any invocations are accidentally left in production code.

       Note that only panic and dump_stack_trace are real functions; the others are simply macros.
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              127


8.3         _____|_X86|Generic  Low-level  Definitions


XXX
128                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.1       asm.h:  assembly language support macros


Synopsis

       #include  <flux/x86/asm.h>


Description

       This file contains convenience macros useful when writing x86 assembly language code in AT&T/GAS
       syntax.  This header file is directly derived from Mach, and similar headers are used in various
       BSD kernels.



       Symbol name extension:   The following macros allow assembly language code to be written
       that coexists with C code compiled for either ELF or a.out format. In a.out format, by conven-
       tion an underscore (_) is prefixed to each public symbol referenced or defined by the C compiler;
       however, the underscore prefix is not used in ELF format.

       EXT(name):       Evaluates to _name in a.out format, or just name in ELF. This macro is typically
             used when referring to public symbols defined in C code.

       LEXT(name):       Evaluates to _name:  in a.out format, or name:  in ELF. This macro is generally
             used when defining labels to be exported to C code.

       SEXT(name):       Evaluates to the string literal "_name" in a.out format, or "name" in ELF. This
             macro can be used in GCC inline assembly code, where the code is contained in a string
             constant; for example: asm("...;  call  "SEXT(foo)";  ...");



       Alignment:   The following macros relate to alignment of code and data:

       TEXT_ALIGN:      Evaluates to the preferred alignment of instruction entrypoints (e.g., functions or
             branch targets),  as a power of two.  Currently evaluates to 4 (16-byte alignment) if the
             symbol i486 is defined, or 2 (4-byte alignment) otherwise.

       ALIGN:     A synonym for TEXT_ALIGN.

       DATA_ALIGN:      Evaluates to the preferred minimum alignment of data structures.  Currently it is
             always defined as 2, although in some cases a larger value may be preferable, such as the
             processor's cache line size.

       P2ALIGN(alignment):           Assembly language code can use this macro to work around the fact that
             the .align directive works differently in different x86 environments: sometimes .align takes
             a byte count, whereas other times it takes a power of two (bit count).  The P2ALIGN macro
             always takes a power of two: for example, P2ALIGN(2) means 4-byte alignment. By default,
             the P2ALIGN macro uses the .p2align directive supported by GAS; if a different assembler
             is being used,  then P2ALIGN should be redefined as either .align  alignment or .align
             1<<(alignment), depending on the assembler's interpretation of .align.

       XXX S_ARG, B_ARG, frame stuff, ...

       XXX need to make the macros more easily overridable, using ifdefs.

       XXX need to clean out old trash still in the header file

       XXX IODELAY macro
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              129


8.3.2       eflags.h:  Processor flags register definitions


Synopsis

       #include  <flux/x86/eflags.h>


Description

       XXX

       This header file can be used in assembly language code as well as C.
130                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.3       proc__reg.h:  Processor register definitions and accessor functions


Synopsis

       #include  <flux/x86/proc_reg.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              131


8.3.4       debug__reg.h:  Debug register definitions and accessor functions


Synopsis

       #include  <flux/x86/debug_reg.h>


Description

       XXX
132                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.5       fp__reg.h:  Floating point register definitions and accessor functions


Synopsis

       #include  <flux/x86/fp_reg.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              133


8.3.6       far__ptr.h:  Far (segment:offset) pointers


Synopsis

       #include  <flux/x86/far_ptr.h>


Description

       XXX
134                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.7       pio.h:  Programmed I/O functions


Synopsis

       #include  <flux/x86/pio.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.

       XXX out?_p functions? iodelay
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              135


8.3.8       seg.h:  Segment descriptor data structure definitions and constants


Synopsis

       #include  <flux/x86/seg.h>


Description

       XXX

       This header file is based on a file in CMU's Mach kernel.
136                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.9       gate__init.h:  Gate descriptor initialization support


Description

       XXX

       See oskit/libkern/x86/base_trap_inittab.S for example code that uses this facility.
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              137


8.3.10        trap.h:  Processor trap vectors


Synopsis

       #include  <flux/x86/trap.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.
138                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.3.11        paging.h:  Page translation data structures and constants


Description

       XXX

       This header file is derived from Mach's intel/pmap.h.
8.3.   _____|_X86|GENERIC LOW-LEVEL DEFINITIONS                                                              139


8.3.12        tss.h:  Processor task save state structure definition


Synopsis

       #include  <flux/x86/tss.h>


Description

       XXX

       XXX only the 32-bit version

       This header file is taken from CMU's Mach kernel.
140                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.4         ________|_X86GPC|eneric  Low-level  Definitions


XXX
8.4.   ________|_X86GPC|ENERIC LOW-LEVEL DEFINITIONS                                                           141


8.4.1       irq__list.h:  Standard hardware interrupt assignments
142                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.4.2       pic.h:  Programmable Interrupt Controller definitions
8.4.   ________|_X86GPC|ENERIC LOW-LEVEL DEFINITIONS                                                           143


8.4.3       keyboard.h:  PC keyboard definitions


Synopsis

       #include  <flux/x86/pc/keyboard.h>


Description

       XXX

       This header file is taken from CMU's Mach kernel.
144                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.4.4       rtc.h:  NVRAM Register locations


Synopsis

       #include  <flux/x86/pc/rtc.h>


Description

       This file is taken from FreeBSD (XXX cite?) and contains definitions for the standard NVRAM,
       or Real Time Clock, register locations.

       These registers can be accessed with rtcin and rtcout.  XXX xref to these; in which section
       should they go?
8.5.   _____|_X86|PROCESSOR IDENTIFICATION AND MANAGEMENT                                          145


8.5         _____|_X86|Processor  Identification  and  Management
146                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.5.1       cpu__info:  CPU identification data structure


Synopsis

       #include  <flux/x86/cpuid.h>

       struct  cpu__info  {
            unsigned       stepping  :   4;                  /*  Stepping ID                 */
            unsigned       model  :   4;                     /*  Model                       */
            unsigned       family  :   4;                    /*  Family                      */
            unsigned       type  :   2;                      /*  Processor type              */
            unsigned       feature_flags;                    /*  Features supported          */
            char           vendor_id[12];                    /*  Vendor ID string            */
            unsigned       char  cache_config[16];           /*  Cache information           */
       };


Description

       This structure is used to hold identification information about x86 processors, such as information
       returned  by  the  CPUID  instruction.   The  cpuid  toolkit  function,  described  below,  fills  in  an
       instance of this structure with information about the current processor.

       Note that it is expected that the cpu_info structure will continue to grow in the future as new
       x86-architecture processors are released, so client code should not depend on this structure in
       ways that will break if the structure's size changes.

       The family field describes the processor family:


       CPU_FAMILY_386:        A 386-class processor.

       CPU_FAMILY_486:        A 486-class processor.

       CPU_FAMILY_PENTIUM:          A Pentium-class ("586") processor.

       CPU_FAMILY_PENTIUM_PRO:           A Pentium Pro-class ("686") processor.


       The type field is one of the following:


       CPU_TYPE_ORIGINAL:         Original OEM processor.

       CPU_TYPE_OVERDRIVE:          OverDrive upgrade processor.

       CPU_TYPE_DUAL:       Dual processor.


       The feature_flags field is a bit field containing the following bits:


       CPUF_ON_CHIP_FPU:        Set if the CPU has a built-in floating point unit.

       CPUF_VM86_EXT:       Set if the virtual 8086 mode extensions are supported, i.e., the VIF and VIP
             flags register bits, and the VME and PVI bits in CR4.

       CPUF_IO_BKPTS:       Set  if  I/O  breakpoints  are  supported,  i.e.,  the  DR7_RW_IO  mode  defined  in
             x86/debug_reg.h.

       CPUF_4MB_PAGES:        Set  if  4MB  superpages  are  supported,  i.e.,  the  INTEL_PDE_SUPERPAGE  page
             directory entry bit defined in x86/paging.h.

       CPUF_TS_COUNTER:        Set if the on-chip timestamp counter and the RDTSC instruction are available.

       CPUF_PENTIUM_MSR:         Set if the Pentium model specific registers are available.

       CPUF_PAGE_ADDR_EXT:         Set if the Pentium Pro's page addressing extensions (36-bit physical ad-
             dresses and 2MB pages) are available.

       CPUF_MACHINE_CHECK_EXCP:           Set if the processor supports the Machine Check exception (vector
             18, or T_MACHINE_CHECK in x86/trap.h).
8.5.   _____|_X86|PROCESSOR IDENTIFICATION AND MANAGEMENT                                          147


       CPUF_CMPXCHG8B:        Set if the processor supports the CMPXCHG8B instruction (also known as "double-

             compare-and-swap").

       CPUF_LOCAL_APIC:        Set if the processor has a built-in local APIC (Advanced Programmable In-
             terrupt Controller), for symmetric multiprocessor support.

       CPUF_MEM_RANGE_REGS:          Set if the processor supports the memory type range registers.

       CPUF_PAGE_GLOBAL_EXT:          Set if the processor supports the global global paging extensions, i.e.,
             the INTEL_PDE_GLOBAL page table entry bit defined in x86/paging.h.

       CPUF_MACHINE_CHECK_ARCH:           Set if the processor supports Intel's machine check architecture and
             the MCG_CAP model-specific register.

       CPUF_CMOVCC:       Set if the processor supports the CMOVcc instructions.

       The cpuid.h header file also contains symbolic definitions for other constants such as the cache
       configuration descriptor values;  see the header file and the Intel documentation for details on
       these.
148                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.5.2       cpuid:  identify the current CPU


Synopsis

       #include  <flux/x86/cpuid.h>

       void cpuid([out] struct  cpu_info *out_info);


Description

       This function identifies the CPU on which it is running using Intel's recommended CPU identi-
       fication procedure, and fills in the supplied structure with the information found.

       Note that since the cpuid function is 32-bit code, it wouldn't run on anything less than an 80386
       in the first place; therefore it doesn't bother to check for earlier processors.
8.5.   _____|_X86|PROCESSOR IDENTIFICATION AND MANAGEMENT                                          149


8.5.3       cpu__info__format:  output a cpu__info structure in ASCII form


Synopsis

       #include  <flux/x86/cpuid.h>

       void  cpu__info__format(struct  cpu_info  *info,  void  (*formatter)(void  *data,  const  char
       *fmt, ...), void *data);


Description

       This function takes the information in a cpu_info structure and formats it as human-readable
       text.  The formatter should be a pointer to a printf-like function to be called to format the
       output data.  The formatter function may be called multiple times to output all the relevant
       information.


Parameters

       info:   The filled-in CPU information structure to output.

       formatter :    The printf-style formatted output function to call. It will be called with the opaque
             data pointer provided, a standard C format string (fmt), and optionally a set of data items
             to format.

       data:    An opaque pointer which is simply passed on to the formatter function.
150                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.5.4       cpu__info__dump:  pretty-print a cpu__info structure to the console


Synopsis

       #include  <flux/x86/cpuid.h>

       void cpu__info__dump(struct  cpu_info *info);


Description

       This function is merely a convenient front-end to cpu_info_format; it simply formats the CPU
       information and outputs it to the console using printf.
8.5.   _____|_X86|PROCESSOR IDENTIFICATION AND MANAGEMENT                                          151


8.5.5       i16__enter__pmode:  enter protected mode


Synopsis

       #include  <flux/x86/pmode.h>

       void i16__enter__pmode(int prot_cs);


Description

       This 16-bit function switches the processor into protected mode by turning on the Protection
       Enable (PE) bit in CR0. The instruction that sets the PE bit is followed immediately by a jump
       instruction to flush the prefetch buffer, as recommended by Intel documentation.

       The function also initializes the CS register with the appropriate new protected-mode code seg-
       ment, whose selector is specified in the prot_cs parameter.  The prot_cs must evaluate to a con-
       stant, as it is used as an immediate operand in an inline assembly language code fragment.

       This routine does not perform any of the other steps in Intel's recommended mode switching
       procedure, such as setting up the GDT or reinitializing the data segment registers; these steps
       must be performed separately.  The overall mode switching sequence is necessarily much more
       dependent on various OS-specific factors such as the layout of the GDT; therefore the OS toolkit
       does not attempt to provide a "generic" function to perform the entire switch.  Instead, the full
       switching sequence is provided as part of the base environment setup code; see Section ??  for
       more details.
152                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.5.6       i16__leave__pmode:  leave protected mode


Synopsis

       #include  <flux/x86/pmode.h>

       void i16__leave__pmode(int real_cs);


Description

       This 16-bit function switches the processor out of protected mode and back into real mode by
       turning off the Protection Enable (PE) bit in CR0.  The instruction that clears the PE bit is
       followed  immediately  by  a  jump  instruction  to  flush  the  prefetch  buffer,  as  recommended  by
       Intel documentation.  At the same time, this function also initializes the CS register with the
       appropriate real-mode code segment, specified by the real_cs parameter.

       This routine does not perform any of the other steps in Intel's recommended mode switching
       procedure, such as reinitializing the data segment registers; these steps must be performed sep-
       arately. See Section ??  for information on the full mode switch implementation provided by the
       base environment.
8.5.   _____|_X86|PROCESSOR IDENTIFICATION AND MANAGEMENT                                          153


8.5.7       paging__enable:  enable page translation


Synopsis

       #include  <flux/x86/paging.h>

       void paging__enable(vm_offset_t pdir);


Description

       XXX

       The  caller  must  already  have  created  and  initialized  an  appropriate  initial  page  directory  as
       described in Intel documentation. The OS toolkit provides convenient facilities that can be used
       to create x86 page directories and page tables; for more information, see Section 8.9.
154                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.5.8       paging__disable:  disable page translation


Synopsis

       #include  <flux/x86/paging.h>

       void paging__disable(void);


Description

       XXX
8.6.   _____|_X86|BASE ENVIRONMENT                                                                                   155


8.6         _____|_X86|Base  Environment


The base environment code for the x86 architecture is designed to assist the OS developer in dealing with
much of the "x86 grunge" that OS developers typically would rather not worry about.  The OS toolkit
provides easy-to-use primitives to set up and maintain various common flavors of x86 kernel environments
without  unnecessarily  constraining  the  OS  implementation.   The  base  environment  support  on  the  x86
architecture is divided into the three main categories:  segmentation, paging, and trap handling.  The base
environment support code in each category is largely orthogonal and easily separable, although it is also
designed to work well together.



8.6.1       Memory model

The x86 architecture supports a very complex virtual memory model involving both segmentation and paging;
one of the goals of the OS toolkit's base environment support for the x86 is to smooth over some of this
complexity, hiding the details that the OS doesn't want to deal with while still allowing the OS full freedom
to use the processor's virtual memory mechanisms as it sees fit.  This section describes the memory models
supported and assumed by the base environment.
    First, here is a summary of several important terms that are used heavily used in the following text; for
full details on virtual, linear, and physical addresses on the x86 architecture, see the appropriate processor
manuals.

    o  Physical addresses are the actual addresses seen on external I/O and memory busses, after segmentation
       and paging transformations have been applied.

    o  Linear addresses are absolute 32-bit addresses within the x86's paged address space,  after segmen-
       tation has been applied but before page translation.  The virtual addresses of simple "paging-only"
       architectures such as mips correspond to linear addresses on the x86.

    o  Virtual  addresses  are  the  logical  addresses  used  by  program  code  to  access  memory.   To  read  an
       instruction or access a data item, the processor first converts the virtual address into a linear address
       using  the  segmentation  mechanism,  then  translates  the  linear  address  to  a  physical  address  using
       paging.

    o  Kernel virtual addresses are the virtual addresses normally used by kernel code to access its own func-
       tions and data structures: in other words, addresses accessed through the kernel's segment descriptors.

    The OS toolkit provides a standard mechanism, defined in base_vm.h (see Section 8.6.2), which is used
throughout the base environment to maintain considerable independence from the memory model in effect.
These facilities allow the base environment support code to avoid various assumptions about the relationships
between kernel virtual addresses, linear addresses, and physical addresses.  Client OS code can use these
facilities as well if desired.
    Of  course,  it  is  impractical  for  the  base  environment  code  to  avoid  assumptions  about  the  memory
model completely.  In particular, the code assumes that, for "relevant" code and data (e.g., the functions
implementing the base environment and the data structures they manipulate), kernel virtual addresses can
be converted to and from linear or physical addresses by adding or subtracting an offset stored in a global
variable.   However,  the  code  does  not  assume  that  these  offsets  are  always  the  same  (the  client  OS  is
allowed to change them dynamically),  or that all available physical memory is mapped into the kernel's
virtual address space, or that all linear memory is accessible through the kernel's data segment descriptors.
Detailed information about the memory model assumptions made by particular parts of the base environment
support are documented in the appropriate API sections.
    If the OS toolkit's startup code is being used to start the OS, then the specific memory model in effect
initially depends on the startup environment, described in later the appropriate sections.  For example, for
kernels booted from a MultiBoot boot loader, in the initial memory environment virtual addresses, linear
addresses, and physical addresses are all exactly equal (the offsets are zero). On the other hand, for kernels
loaded from DOS, linear addresses and physical addresses will still be equal but kernel virtual addresses will
be at some offset depending on where in physical memory the kernel was loaded.  Regardless of the initial
memory setup, the client OS is free to change the memory model later as necessary.
156                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


    XXX example memory maps

    XXX explain how to change memory models at run-time
8.6.   _____|_X86|BASE ENVIRONMENT                                                                                   157


8.6.2       base__vm.h:  definitions for the base virtual memory environment


Description

       This header file provides generic virtual memory-related definitions commonly used throughout
       the base environment,  which apply to both segmentation and paging.  In particular,  this file
       defines a set of macros and global variables which allow the rest of the base environment code in
       the toolkit (and the client OS, if it chooses) to maintain independence from the memory model in
       effect.  These facilities allow code to avoid various assumptions about the relationships between
       kernel virtual addresses, linear addresses, and physical addresses.

       The following variable and associated macros are provided to convert between linear and kernel
       virtual addresses.

       linear_base_va:        This global variable defines the address in kernel virtual memory that cor-
             responds  to  address  0  in  linear  memory.  It  is  used  by  the  following  conversion  macros;
             therefore, changing this variable changes the behavior of the associated macros.

       lintokv(la):        This macro converts linear address la into a kernel virtual address and returns
             the result as a vm_offset_t.

       kvtolin(va):        For example, the segmentation initialization code uses kvtolin() to calculate
             the linear addresses of segmentation structures to be used in segment descriptor or pseudo-
             descriptor structures provided to the processor.

       Similarly, the following variable and associated macros convert between physical and kernel virtual
       addresses. (Conversions between linear and physical addresses can be done by combining the two
       sets of macros.)

       phys_mem_va:       This global variable defines the address in kernel virtual memory that corresponds
             to address 0 in physical memory.  It is used by the following conversion macros; therefore,
             changing this variable changes the behavior of the associated macros.

       phystokv(pa):        This macro converts physical address pa into a kernel virtual address and returns
             the result as a vm_offset_t.  The macro makes the assumption that the specified physical
             address can be converted to a kernel virtual address this way:  in OS kernels that do not
             direct-map  all  physical  memory  into  the  kernel's  virtual  address  space,  the  caller  must
             ensure that the supplied pa refers to a physical address that is mapped.  For example, the
             primitive page table management code provided by the OS toolkit's base environment uses
             this macro to access page table entries given the physical address of the page table; therefore,
             these functions can only be used if page tables are allocated from physical pages that are
             direct-mapped into the kernel's address space.

       kvtophys(va):        This macro converts kernel virtual address va into a physical address and returns
             the result as a vm_offset_t. The macro assumes that the virtual address can be converted
             directly to a physical address this way;  the caller must ensure that this is the case.  For
             example, some operating systems only direct-map the kernel's code and statically allocated
             data; in such kernels, va should only refer to statically-allocated variables or data structures.
             This is generally sufficient for the OS toolkit's base environment code, which mostly operates
             on  statically-allocated  data  structures;  however,  the  OS  must  of  course  take  its  chosen
             memory model into consideration if it uses these macros as well.

       XXX real_cs

       Note that there is nothing in this header file that defines or relates to "user-mode" address spaces.
       This is because the base environment code in the OS toolkit is not concerned with user mode
       in any way; in fact, it doesn't even care whether or not the OS kernel implements user address
       spaces at all.  For example,  boot loaders or unprotected real-time kernels built using the OS
       toolkit probably do not need any notion of user mode at all.
158                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.6.3       base__cpu__setup:  initialize and activate the base CPU environment


Synopsis

       #include  <flux/machine/base_cpu.h>

       void base__cpu__setup(void);


Description

       This function provides a single entrypoint to initialize and activate all of the processor struc-
       tures necessary for ordinary execution.  This includes identifying the CPU, and initializing and
       activating the base GDT, IDT, and TSS, and reloading all segment registers as recommended by
       Intel.  The call returns with the CS segment set to KERNEL_CS (the default kernel code segment;
       see 8.7.1 for details), DS, ES, and SS set to KERNEL_DS (the default kernel data segment), and FS
       and GS set to 0.  After the base_cpu_setup call completes, a full working kernel environment is
       in place: segment registers can be loaded, interrupts and traps can be fielded by the OS, privilege
       level changes can occur, etc.

       This function does not initialize or activate the processor's paging mechanism, since unlike the
       other mechanisms, paging is optional on the x86 and not needed in some environments (e.g., boot
       loaders or embedded kernels).

       The base_cpu_setup function is actually just a simple wrapper that calls base_cpu_init followed
       by base_cpu_load.

       Note that it is permissible to call this function (and/or the more primitive functions it is built
       on) more than once. This is particularly useful when reconfiguring the kernel memory map. For
       example, a typical MultiBoot (or other 32-bit) kernel generally starts out with paging disabled, so
       it must run in the low range of linear/physical memory. However, after enabling page translation,
       the OS may later want to relocate itself to run at a higher address in linear memory so that
       application programs can use the low part (e.g., v86-mode programs). An easy way to do this with
       the OS toolkit is to call base_cpu_setup once at the very beginning, to initialize the basic unpaged
       kernel environment, and then later, after paging is enabled and appropriate mappings have been
       established in high linear address space, modify the linear_base_va variable (Section 8.6.2) to
       reflect the kernel's new linear address base, and finally call base_cpu_setup again to reinitialize
       and reload the processor tables according to the new memory map.


Dependencies

       base_cpu_init:        8.6.4

       base_cpu_load:        8.6.5
8.6.   _____|_X86|BASE ENVIRONMENT                                                                                   159


8.6.4       base__cpu__init:  initialize the base environment data structures


Synopsis

       #include  <flux/machine/base_cpu.h>

       void base__cpu__init(void);


Description

       This function initializes all of the critical data structures used by the base environment, including
       base_cpuid, base_idt, base_gdt, and base_tss, but does not actually activate them or otherwise
       modify  the  processor's  execution  state.   The  base_cpu_load  function  must  be  called  later  to
       initialize the processor with these structures. Separate initialization and activation functions are
       provided to allow the OS to customize the processor data structures if necessary before activating
       them.


Dependencies

       cpuid:     8.5.2

       base_trap_init:        8.8.2

       base_gdt_init:        8.7.2

       base_tss_init:        8.7.7
160                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.6.5       base__cpu__load:  activate the base processor execution environment


Synopsis

       #include  <flux/machine/base_cpu.h>

       void base__cpu__init(void);


Description

       This function loads the critical base environment data structures (in particular, the GDT, IDT,
       and TSS) into the processor, and reinitializes all segment registers from the new GDT as recom-
       mended in Intel processor documentation.  The structures must already have been set up by a
       call to base_cpu_init and/or custom initialization code in the client OS.

       This function returns with the CS segment set to KERNEL_CS (the default kernel code segment; see
       Section 8.7.1 for details), DS, ES, and SS set to KERNEL_DS (the default kernel data segment), and
       FS and GS set to 0. After the base_cpu_load call completes, a full working kernel environment is
       in place: segment registers can be loaded, interrupts and traps can be fielded by the OS, privilege
       level changes can occur, etc.


Dependencies

       base_gdt_load:        8.7.3

       base_idt_load:        8.7.5

       base_tss_load:        8.7.8
8.6.   _____|_X86|BASE ENVIRONMENT                                                                                   161


8.6.6       base__cpuid:  global variable describing the processor


Synopsis

       #include  <flux/machine/base_cpu.h>

       extern  struct  cpu_info  base_cpuid;


Description

       This is a global variable that is filled in by base_cpu_init with information about the processor
       on which base_cpu_init was called.  (Alternatively, it can also be initialized manually by the
       OS simply by calling cpuid(&base_cpuid)).  This structure is used by other parts of the kernel
       support library to determine whether or not certain processor features are available, such as 4MB
       superpages. See 8.5.1 for details on the contents of this structure.

       Note that in a multiprocessor system, this variable will reflect the boot processor. This is generally
       not  a  problem,  since  most  SMPs  use  identical  processors,  or  at  least  processors  in  the  same
       generation, so that they appear equivalent to OS software. (For example, it is very unlikely that
       you'd find an SMP that mixes 486 and Pentium processors), However, if this ever turns out to
       be a problem, the OS can always override the cpuid or base_cpu_init function, or just modify
       the contents of the base_cpuid variable after calling base_cpu_init so that it reflects the least
       common denominator of all the processors.
162                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.6.7       base__stack.h:  default kernel stack


XXX
8.7.   _____|_X86|BASE ENVIRONMENT: SEGMENTATION SUPPORT                                            163


8.7         _____|_X86|Base  Environment:  Segmentation  Support


Although  most  modern  operating  systems  use  a  simple  "flat"  address  space  model,  the  x86  enforces  a
segmentation model which cannot be disabled directly; instead, it must be set up to emulate a flat address
space model if that is what the OS desires.  The base environment code provides functionality to set up a
simple flat-model processor environment suitable for many types of kernels, both "micro" and "macro." For
example, it provides a default global descriptor table (GDT) containing various flat-model segments for the
kernel's use, as well as a default task state segment (TSS).
    Furthermore, even though this base environment is often sufficient, the client OS is not limited to using
it  exactly  as  provided  by  default:  the  client  kernel  is  given  the  flexibility  to  tweak  various  parameters,
such as virtual and linear memory layout, as well as the freedom to operate completely outside of the base
environment when necessary.  For example, although the base environment provides a default TSS, the OS
is free to create its own TSS structures and use them when running applications that need special facilities
such as v86 mode or I/O ports.  Alternatively, the OS could use the default processor data structures only
during startup, and switch to its own complete, customized set after initialization.
    The base environment code in the OS toolkit generally assumes that it is running in a simple flat model,
in which only one code segment and one data segment are used for all kernel code and data, respectively,
and that the code and data segments are synonymous (they each map to the same range of linear addresses).
The OS is free to make more exotic uses of segmentation if it so desires, as long as the OS toolkit code is
run in a consistent environment.
    XXX diagram of function call tree?
    The base segmentation environment provided by the OS toolkit is described in more detail in the following
API sections.
164                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.7.1       base__gdt:  default global descriptor table for the base environment


Synopsis

       #include  <flux/x86/base_gdt.h>

       extern  struct  x86_desc  base_gdt[GDTSZ];


Description

       This variable is used in the base environment as the default global descriptor table.  The de-
       fault base_gdt definition contains GDTSZ selector slots, including the Intel-reserved, permanently
       unused slot 0.

       The following symbols are defined in base_gdt.h to be segment selectors for the descriptors in the
       base GDT. These selectors can be converted to indices into the GDT descriptor array base_gdt
       by dividing by 8 (the processor reserves the low three bits of all selectors for other information).

       BASE_TSS:     A selector for the base task state segment (base_tss).  The BASE_TSS segment de-
             scriptor is initialized by base_gdt_init, but the base_tss structure itself is initialized by
             base_tss_init and loaded into the processor by base_tss_load; see Section 8.7.6 for more
             details.

       KERNEL_CS:      This is the default kernel code segment selector.  It is initialized by base_gdt_init
             to be a flat-model, 4GB, readable, ring 0 code segment; base_gdt_load loads this segment
             into the CS register while reinitializing the processor's segment registers.

       KERNEL_DS:      This is the default kernel data segment selector.  It is initialized by base_gdt_init
             to be a flat-model, 4GB, writable, ring 0 data segment; base_gdt_load loads this segment
             into the DS, ES, and SS registers while reinitializing the processor's segment registers.

       KERNEL_16_CS:       This selector is identical to KERNEL_CS except that it is a 16-bit code segment
             (the processor defaults to 16-bit operand and addressing modes rather than 32-bit while
             running code in this segment), and it has a 64KB limit rather than 4GB. This selector is
             used when switching between real and protected mode, to provide an intermediate 16-bit
             protected mode execution context.  It is unused in kernels that never execute in real mode
             (e.g., typical MultiBoot kernels).

       KERNEL_16_DS:       This selector is a data segment synonym for KERNEL_16_CS; it is generally only
             used when switching from protected mode back to real mode.  It is used to ensure that
             the segment registers contain sensible real-mode values before performing the switch,  as
             recommended in Intel literature.

       LINEAR_CS:      This selector is set up to be a ring 0 code segment that directly maps the entire
             linear address space:  in other words,  it has an offset of zero and a 4GB limit.  In some
             environments, where kernel virtual addresses are the same as linear addresses, this selector
             is a synonym for KERNEL_CS.

       LINEAR_DS:      This is a data segment otherwise identical to LINEAR_CS.

       USER_CS:     This selector is left unused and uninitialized by the OS toolkit; nominally, it is intended
             to be used as a code segment for unprivileged user-level code.

       USER_DS:     This selector is left unused and uninitialized by the OS toolkit; nominally, it is intended
             to be used as a data segment for unprivileged user-level code.

       If the client OS wants to make use of the base GDT but needs more selector slots for its own
       purposes, it can define its own instance of the base_gdt variable so that it has room for more
       than GDTSZ elements; base_gdt_init will initialize only the first "standard" segment descriptors,
       leaving the rest for the client OS's use.

       On multiprocessor systems, the client OS may want each processor to have its own GDT. In this
       case, the OS can create a separate clone of the base GDT for each additional processor besides
       the boot processor, and leave the boot processor using the base GDT. Alternatively, the OS could
8.7.   _____|_X86|BASE ENVIRONMENT: SEGMENTATION SUPPORT                                            165


       use the base GDT only during initialization, and switch all processors to custom GDTs later;

       this approach provides the most flexibility to the OS, since the custom GDTs can be arranged
       in whatever way is most convenient.
166                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.7.2       base__gdt__init:  initialize the base GDT to default values


Synopsis

       #include  <flux/x86/base_gdt.h>

       void base__gdt__init(void); void i16__base__gdt__init(void);


Description

       This function initializes the standard descriptors in the base GDT as described in Section 8.7.1.

       For  all  of  the  standard  descriptors  except  LINEAR_CS  and  LINEAR_DS,  the  kvtolin  macro  is
       used to compute the linear address to plug into the offset field of the descriptor:  for BASE_TSS,
       this is kvtolin(&base_tss); for the kernel code and data segments, it is kvtolin(0) (i.e., the
       linear address corresponding to the beginning of kernel virtual address space).  LINEAR_CS and
       LINEAR_DS are always given an offset of 0.

       A 16-bit version of this function, i16_base_gdt_init, is also provided so that the GDT can be
       initialized properly before the processor has been switched to protected mode.  (Switching to
       protected mode on the x86 according to Intel's recommended procedure requires a functional
       GDT to be already initialized and activated.)


Dependencies

       fill_descriptor:         8.3.8

       kvtolin:      8.6.2

       base_gdt:      8.7.1

       base_tss:      8.7.6
8.7.   _____|_X86|BASE ENVIRONMENT: SEGMENTATION SUPPORT                                            167


8.7.3       base__gdt__load:  load the base GDT into the CPU


Synopsis

       void base__gdt__load(void); void i16__base__gdt__load(void);


Description

       This function loads the base GDT into the processor's GDTR, and then reinitializes all segment
       registers from the descriptors in the newly loaded GDT. It returns with the CS segment set to
       KERNEL_CS (the default kernel code segment; see Section 8.7.1 for details), DS, ES, and SS set to
       KERNEL_DS (the default kernel data segment), and FS and GS set to 0.


Dependencies

       kvtolin:      8.6.2

       base_gdt:      8.7.1
168                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.7.4       base__idt:  default interrupt descriptor table


Synopsis

       #include  <flux/x86/base_idt.h>

       extern  struct  x86_desc  base_idt[IDTSZ];


Description

       This global variable is used in the base environment as the default interrupt descriptor table.
       The default definition of base_idt in the library contains the architecturally-defined maximum
       of 256 interrupt vectors (IDTSZ).1

       The  base_idt.h  header  file  does  not  define  any  symbols  representing  interrupt  vector  num-
       bers.   The  lowest  32  vectors  are  the  processor  trap  vectors  defined  by  Intel;  since  these  are
       not  specific  to  the  base  environment,  they  are  defined  in  the  generic  header  file  x86/trap.h
       (see  Section  8.3.10).   Standard  hardware  interrupt  vectors  are  PC-specific,  and  therefore  are
       defined separately in x86/pc/irq_list.h (see Section 8.4.1).  For the same reason, there is no
       base_idt_init function, only separate functions to initialize the trap vectors in the base IDT
       (base_trap_init, Section 8.8.2), and hardware interrupt vectors in the IDT (XXX).
____________________________________________________1
    Rationale:    Although simple x86 PC kernels often only use the 32 processor trap vectors plus 16 interrupt vectors,
which set of vectors are used for hardware interrupts tends to differ greatly between kernels.  Some kernels also want to use
well-known vectors for efficient system call emulation, such as 0x21 for DOS or 0x80 for Linux. Some bootstrap mechanisms,
such as VCPI on DOS, must determine at run-time the set of vectors used for hardware interrupts, and therefore potentially
need all 256 vectors to be available. Finally, making use of the enhanced interrupt facilities on Intel SMP Standard-compliant
multiprocessors generally requires use of higher vector numbers, since vector numbers are tied to interrupt priorities.  For all
these reasons, we felt the default IDT should be of the maximum size, even though much of it is usually wasted.
8.7.   _____|_X86|BASE ENVIRONMENT: SEGMENTATION SUPPORT                                            169


8.7.5       base__idt__load:  load the base IDT into the current processor


Synopsis

       #include  <flux/x86/base_idt.h>

       void base__idt__load(void);


Description

       This  function  loads  the  base_idt  into  the  processor,  so  that  subsequent  traps  and  hardware
       interrupts will vector through it. It uses the kvtolin macro to compute the proper linear address
       of the IDT to be loaded into the processor.


Dependencies

       kvtolin:      8.6.2

       base_idt:      8.7.4
170                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.7.6       base__tss:  default task state segment


Synopsis

       #include  <flux/x86/base_tss.h>

       extern  struct  x86_tss  base_tss;


Description

       The base_tss variable provides a default task state segment that the OS can use for privilege level
       switching if it does not otherwise use the x86's task switching mechanisms. The x86 architecture
       requires every protected-mode OS to have at least one TSS even if no task switching is done;
       however, many x86 kernels do not use the processor's task switching features because it is faster
       to  context  switch  manually.   Even  if  special  TSS  segments  are  used  sometimes  (e.g.,  to  take
       advantage of the I/O bitmap feature when running MS-DOS programs), the OS can still use a
       common TSS for all tasks that do not need to use these special features; this is the strategy taken
       by the Mach kernel, for example.  The base_tss provided by the toolkit serves in this role as a
       generic "default" TSS.

       The base_tss is a minimal TSS, in that it contains no I/O bitmap or interrupt redirection map.
       XXX The toolkit also supports an alternate default TSS with a full I/O permission bitmap, but
       it isn't fully integrated or documented yet.
8.7.   _____|_X86|BASE ENVIRONMENT: SEGMENTATION SUPPORT                                            171


8.7.7       base__tss__init:  initialize the base task state segment


Synopsis

       #include  <flux/x86/base_tss.h>

       void base__tss__init(void);


Description

       The base_tss_init function initializes the base_tss to a valid minimal state.  It sets the I/O
       permission bitmap offset to point past the end of the TSS, so that it will be interpreted by the
       processor as empty (no permissions for any I/O ports). It also initializes the ring 0 stack segment
       selector (ss0) to KERNEL_DS, and the ring 0 stack pointer (esp0) to the current stack pointer
       value at the time of the function call, to provide a minimal working context for trap handling.
       Once the OS kernel sets up a "real" kernel stack, it should reinitialize base_tss.esp0 to point
       to that.


Dependencies

       base_tss:      8.7.6
172                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.7.8       base__tss__load:  initialize the base task state segment


Synopsis

       #include  <flux/x86/base_tss.h>

       void base__tss__load(void);


Description

       This function activates the base_tss in the processor using the LTR instruction, after clear the
       busy bit in the BASE_TSS segment descriptor to ensure that a spurious trap isn't generated.


Dependencies

       base_gdt:      8.7.1
8.8.   _____|_X86|BASE ENVIRONMENT: TRAP HANDLING                                                         173


8.8         _____|_X86|Base  Environment:  Trap  Handling


XXX diagram of function call tree?
    XXX options:  use everything (i.e.  when "OS" doesn't handle traps), set the base_trap_handler (to use
the default state frame), override the base_trap_inittab (to use a different state frame), replace everything.
174                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.8.1       trap__state:  saved state format used by the default trap handler


Synopsis

       #include  <flux/x86/base_trap.h>


       struct  trap`state
       -
       /*  Saved  segment  registers  */
       unsigned  int      gs;
       unsigned  int      fs;
       unsigned  int      es;
       unsigned  int      ds;


       /*  PUSHA  register  state  frame  */
       unsigned  int      edi;
       unsigned  int      esi;
       unsigned  int      ebp;
       unsigned  int      cr2;  /*  we  save  cr2  over  esp  for  page  faults  */
       unsigned  int      ebx;
       unsigned  int      edx;
       unsigned  int      ecx;
       unsigned  int      eax;


       /*  Processor  trap  number,  0-31.   */
       unsigned  int      trapno;


       /*  Error  code  pushed  by  the  processor,  0  if  none.   */
       unsigned  int      err;


       /*  Processor  state  frame  */
       unsigned  int      eip;
       unsigned  int      cs;
       unsigned  int      eflags;
       unsigned  int      esp;
       unsigned  int      ss;


       /*  Virtual  8086  segment  registers  */
       unsigned  int      v86`es;
       unsigned  int      v86`ds;
       unsigned  int      v86`fs;
       unsigned  int      v86`gs;
       ";


Description

       This structure defines the saved state frame pushed on the stack by the default trap entrypoints
       provided by the base environment (see Section 8.8.3).  It is also used by the trap_dump rou-
       tine, which is used in the default environment to dump the saved register state and panic if an
       unexpected trap occurs; and by gdb_trap, the default trap handler for remote GDB debugging.

       This client OS is not obligated to use this structure as the saved state frame for traps it handles;
       if this structure is not used, then the OS must also override (or not use) the dependent routines
       mentioned above.

       The structure elements from err down corresponds to the basic trap frames pushed on the stack
       by the x86 processor. (For traps in which the processor does not push an error code, the default
8.8.   _____|_X86|BASE ENVIRONMENT: TRAP HANDLING                                                         175


       trap entrypoint code sets err to zero.)  The structure elements from esp down are only pushed

       by traps from lower privilege (rings 1-3), and the structure elements from v86_es down are only
       pushed by traps from v86 mode.

       The rest of the state frame is pushed manually by the default trap entrypoint code.  The saved
       integer register state is organized in a format compatible with the processor's PUSHA instruction.
       However, in the slot that would otherwise hold the pushed ESP (which is useless since it is the trap
       handler's stack pointer rather than the trapping code's stack pointer), the default trap handler
       saves the CR2 register (page fault linear address) during page faults.

       This trap state structure is borrowed from Mach.
176                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.8.2       base__trap__init:  initialize the processor trap vectors in the base IDT


Synopsis

       #include  <flux/x86/base_trap.h>

       void base__trap__init(void);


Description

       This function initializes the processor trap vectors in the base IDT to the default trap entrypoints
       defined in base_trap_inittab.


Dependencies

       gate_init:      8.3.9

       base_idt:      8.7.4

       base_trap_inittab:         8.8.3
8.8.   _____|_X86|BASE ENVIRONMENT: TRAP HANDLING                                                         177


8.8.3       base__trap__inittab:  initialization table for the default trap entrypoints


Synopsis

       #include  <flux/x86/base_trap.h>

       extern  struct  gate_init_entry  base_trap_inittab[];


Description

       This gate initialization table (see Section 8.3.9) encapsulates the base environment's default trap
       entrypoint  code.  This  module  provides  IDT  entrypoints  for  all  of  the  processor-defined  trap
       vectors; each entrypoint pushes a standard state frame on the stack (see Section 8.8.1), and then
       calls the C function pointed to by the global variable base_trap_handler (see Section 8.8.4).
       Through these entrypoints,  the OS toolkit provides the client OS with a convenient,  uniform
       method of handling all processor traps in ordinary high-level C code.

       If a trap occurs and the trap entrypoint code finds that the base_trap_handler pointer is null (as
       is the case by default if the client OS never sets this pointer), or if it points to a handler routine but
       the handler returns a nonzero value indicating failure, the entrypoint code calls trap_dump_panic
       (see Section 8.8.6) to dump the register state to the console and panic the kernel. This behavior
       is typically appropriate in kernels that do not expect traps to occur during proper operation (e.g.,
       boot loaders or embedded operating systems), where a trap probably indicates a serious software
       bug.

       On the other hand, if a trap handler is present and returns success (zero), the entrypoint code
       restores  the  saved  state  and  resumes  execution  of  the  trapping  code.  The  trap  handler  may
       change the contents of the trap_state structure passed by the entrypoint code; in this case, final
       contents of the structure on return from the trap handler will be the state restored.

       All of the IDT entries initialized by the base_trap_inittab are trap gates rather than interrupt
       gates; therefore, if hardware interrupts are enabled when a trap occurs, then interrupts will still
       be enabled during the trap handler unless the trap handler explicitly disables them.  If the OS
       wants interrupts to be disabled during trap handling, it can change the processor trap vectors
       in the IDT (vectors 0-31) into interrupt gates, or it can simply use its own trap entrypoint code
       instead.


Dependencies

       struct  trap_state:         8.8.1

       base_trap_handler:         8.8.4

       trap_dump_panic:         8.8.6
178                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.8.4       base__trap__handler:  pointer to trap handler


Synopsis

       #include  <flux/x86/base_trap.h>

       extern  int  (*base_trap_handler)(struct  trap_state  *state);


Description

       This global variable points to the trap handler function that the default entrypoint code calls to
       handle a trap (see Section 8.8.3). By default, this variable is null, indicating that there is no trap
       handler; if a trap occurs and there is no trap handler, the entrypoint code will simply dump the
       register state to the console and panic.  The client OS can set this variable to point to its own
       trap handler function, or to an alternative trap handler supplied by the OS toolkit, such as the
       remote GDB debugging trap handler, gdb_trap (see Section 8.14.5).


Parameters

       state:   A pointer to the trap state structure to dump.


Returns

       The trap handler returns zero (success) to resume execution, or nonzero (failure) to cause the
       entrypoint code to dump the register state and panic the kernel.
8.8.   _____|_X86|BASE ENVIRONMENT: TRAP HANDLING                                                         179


8.8.5       trap__dump:  dump a saved trap state structure


Synopsis

       #include  <flux/x86/base_trap.h>

       void trap__dump(const  struct  trap_state *state);


Description

       This function dumps the contents of the specified trap state frame to the console using the printf
       function, in a simple human-readable form. The function is smart enough to determine whether
       the trap occurred from supervisor mode, user mode, or v86 mode, and interpret the saved state
       accordingly.  For example,  for traps from rings 1-3 or from v86 mode,  the the original stack
       pointer is part of the saved state frame; however, for traps from ring 0, the original stack pointer
       is simply the end of the stack frame pushed by the processor, since no stack switch occurs in this
       case.

       In addition, for traps from ring 0, this routine also provides a hex dump of the top of the kernel
       stack as it appeared when the trap occurred; this stack dump can aid in tracking down the cause
       of a kernel bug. trap_dump does not attempt to dump the stack for traps from user or v86 mode,
       because there seems to be no sufficiently generic way for it to access the appropriate user stack;
       in addition, in this case the trap might have been caused by a user-stack-related exception, in
       which case attempting to dump the user stack could lead to a recursive trap.


Parameters

       state:   A pointer to the trap state structure to dump.


Dependencies

       struct  trap_state:         8.8.1

       printf:     6.5
180                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.8.6       trap__dump__panic:  dump a saved trap state structure


Synopsis

       #include  <flux/x86/base_trap.h>

       void trap__dump__panic(const  struct  trap_state *state);


Description

       This function simply calls trap_dump (Section 8.8.5) to dump the specified trap state frame, and
       then calls panic (Section 6.6.3). It is invoked by the default trap entrypoint code (Section 8.8.3)
       if a trap occurs when there is no interrupt handler,  or if there is an interrupt handler but it
       returns a failure indication.


Dependencies

       trap_dump:      8.8.5

       panic:     6.6.3
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    181


8.9         _____|_X86|Base  Environment:  Page  Translation


XXX diagram of function call tree?
    XXX Although a "base" x86 paging environment is defined, it is not automatically initialized by base_cpu_init,
and paging is not activated by base_cpu_load.  This is because unlike segmentation, paging is an optional
feature on the x86 architecture, and many simple "kernels" such as boot loaders would prefer to ignore it
completely.  Therefore, client kernels that do want the base paging environment must call the functions to
initialize and activate it manually, after the basic CPU segmentation environment is set up.
    XXX describe assumptions made about use of page tables, e.g.  4MB pages whenever possible, always
modify/unmap _exactly_ the region that was mapped.
    XXX assumes that mappings are only changed or unmapped with the same size and offset as the original
mapping.
    XXX does not attempt to support page table sharing in any way, since this code has no clue about the
relationship between address spaces; it only knows about page directories and page tables.
182                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.1       base__paging__init:  create minimal kernel page tables and enable paging


Synopsis

       #include  <flux/x86/base_paging.h>

       void base__paging__init(void);


Description

       This function can be used to set up a minimal paging environment.  It first allocates and clears
       an initial page directory using ptab_alloc (see Section 8.9.7), sets base_pdir_pa to point to it
       (see Section 8.9.2), then direct-maps all known physical memory into this address space starting
       at linear address 0, allocating additional page tables as needed. Finally, this function enables the
       processor's paging mechanism, using the base page directory as the initial page directory.

       The global variable phys_mem_max (see Section ?? ) is assumed to indicate the top of physical
       memory; all memory from 0 up to at least this address is mapped. The function actually rounds
       phys_mem_max up to the next 4MB superpage boundary, so that on Pentium and higher processors,
       all physical memory can be mapped using 4MB superpages even if known physical memory does
       not end exactly on a 4MB boundary. Note that phys_mem_max does not necessarily need to reflect
       all physical memory in the machine; for example, it is perfectly reasonable for the client OS to
       set it to some artificially lower value so that only that part of physical memory is direct-mapped.

       On Pentium and higher processors, this function sets the PSE (page size extensions) bit in CR4
       in addition to the PG (paging) bit, so that the 4MB page mappings used to map physical memory
       will work properly.


Dependencies

       base_pdir_pa:       8.9.2

       ptab_alloc:       8.9.7

       pdir_map_range:        8.9.11

       base_cpuid:       8.6.6

       paging_enable:        8.5.7
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    183


8.9.2       base__pdir__pa:  initial kernel page directory


Synopsis

       #include  <flux/x86/base_paging.h>

       extern  vm_offset_t  base_pdir_pa;


Description

       This variable is initialized by base_paging_init (see Section 8.9.1) to contain the physical address
       of the base page directory.  This is the value that should be loaded into the processor's page
       directory base register (CR3) in order to run in the linear address space defined by this page
       directory.  (The base page directory is automatically activated in this way during initialization;
       the client OS only needs to load the CR3 register itself if it wants to switch among multiple linear
       address spaces.) The pdir_find_pde function (Section 8.9.3) and other related functions can be
       used to manipulate the page directory and its associated page tables.

       Initially, the base page directory and its page tables directly map physical memory starting at
       linear address 0. The client OS is free to change the mappings after initialization, for example by
       adding new mappings outside of the physical address range, or by relocating the physical memory
       mappings to a different location in the linear address space as described in Section 8.6.3.

       Most "real" operating systems will need to create other, separate page directories and associated
       page tables to represent different address spaces or protection domains. However, the base page
       directory may still be useful, e.g., as a template for initializing the common kernel portion of
       other page directories, or as a "kernel-only" address space for use by kernel tasks, etc.
184                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.3       pdir__find__pde:  find an entry in a page directory given a linear address


Synopsis

       #include  <flux/x86/base_paging.h>

       pd_entry_t  *pdir__find__pde(vm_offset_t pdir_pa, vm_offset_t la);


Description

       This primitive macro uses the appropriate bits in linear address la (bits 22-31) to look up a
       particular entry in the specified page directory. Note that this function takes the physical address
       of a page directory, but returns a kernel virtual address (i.e., an ordinary pointer to the selected
       page directory entry).


Parameters

       pdir_pa:    Physical address of the page directory.

       la:   Linear address to be used to select a page directory entry.


Returns

       Returns a pointer to the selected page directory entry.


Dependencies

       phystokv:      8.6.2
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    185


8.9.4       ptab__find__pte:  find an entry in a page table given a linear address


Synopsis

       #include  <flux/x86/base_paging.h>

       pd_entry_t  *ptab__find__pte(vm_offset_t ptab_pa, vm_offset_t la);


Description

       This  macro  uses  the  appropriate  bits  in  la  (bits  12-21)  to  look  up  a  particular  entry  in  the
       specified page table. This macro is just like pdir_find_pde, except that it selects an entry based
       on the page table index bits in the linear address rather than the page directory index bits (bits
       22-31).  Note that this function takes the physical address of a page table, but returns a kernel
       virtual address (an ordinary pointer).


Parameters

       ptab_pa:    Physical address of the page table.

       la:   Linear address to be used to select a page table entry.


Returns

       Returns a pointer to the selected page table entry.


Dependencies

       phystokv:      8.6.2
186                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.5       pdir__find__pte:  look up a page table entry from a page directory


Synopsis

       #include  <flux/x86/base_paging.h>

       pt_entry_t  *pdir__find__pte(vm_offset_t pdir_pa, vm_offset_t la);


Description

       This function is a combination of pdir_find_pde and ptab_find_pte:  it descends through both
       levels  of  the  x86  page  table  hierarchy  and  finds  the  page  table  entry  for  the  specified  linear
       address.

       This function assumes that if the page directory entry selected by bits 22-31 of la is valid (the
       INTEL_PDE_VALID bit is set), then that entry actually refers to a page table, and is not a 4MB
       page mapping. The caller must ensure that this is the case.


Parameters

       pdir_pa:    Physical address of the page directory.

       la:   Linear address to use to select the appropriate page directory and page table entries.


Returns

       Returns a pointer to the selected page table entry, or NULL if there is no page table for this
       linear address.


Dependencies

       pdir_find_pde:        8.9.3

       ptab_find_pte:        8.9.4
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    187


8.9.6       pdir__get__pte:  retrieve the contents of a page table entry


Synopsis

       #include  <flux/x86/base_paging.h>

       pt_entry_t pdir__get__pte(vm_offset_t pdir_pa, vm_offset_t la);


Description

       This function is a simple extension of pdir_find_pte:  instead of returning the address of the
       selected page table entry, it returns the contents of the page table entry:  i.e., the physical page
       frame in bits 12-31 and the associated INTEL_PTE_* flags in bits 0-11.  If there is no page table
       in the page directory for the specified linear address, then this function returns 0, the same as if
       there was a page table but the selected page table entry was zero (invalid).

       As with pdir_find_pte, this function assumes that if the page directory entry selected by bits
       22-31 of la is valid (the INTEL_PDE_VALID bit is set), then that entry actually refers to a page
       table, and is not a 4MB page mapping.


Parameters

       pdir_pa:    Physical address of the page directory.

       la:   Linear address to use to select the appropriate page directory and page table entries.


Returns

       Returns the selected page table entry, or zero if there is no page table for this linear address.
       Also returns zero if the selected page table entry exists but is zero.


Dependencies

       pdir_find_pte:        8.9.5
188                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.7       ptab__alloc:  allocate a page table page and clear it to zero


Synopsis

       #include  <flux/x86/base_paging.h>

       int ptab__alloc([out] vm_offset_t *out_ptab_pa);


Description

       All  of  the  following  page  mapping  routines  call  this  function  to  allocate  new  page  tables  as
       needed to create page mappings. It attempts to allocate a single page of physical memory, and if
       successful, returns 0 with the physical address of that page in *out_ptab_pa. The newly allocated
       page is cleared to all zeros by this function. If this function is unsuccessful, it returns nonzero.

       The default implementation of this function assumes that the OS toolkit's minimal C library
       (libmc) and list-based memory manager (liblmm) are being used to manage physical memory,
       and allocates page table pages from the malloc_lmm memory pool (see Section 6.4.1).  However,
       in more complete OS environments, e.g., in which low physical memory conditions should trigger
       a page-out rather than failing immediately, this routine can be overridden to provide the desired
       behavior.


Parameters

       out_ptab_pa:     The address of a variable of type vm_offset_t into which this function will deposit
             the physical address of the allocated page, if the allocation was successful.


Returns

       Returns zero if the allocation was successful, or nonzero on failure.


Dependencies

       lmm_alloc_page:        2.6.8

       malloc_lmm:       6.4.1

       memset:     6.3.12

       kvtophys:      8.6.2
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    189


8.9.8       ptab__free:  free a page table allocated using ptab__alloc


Synopsis

       #include  <flux/x86/base_paging.h>

       void ptab__free(vm_offset_t ptab_pa);


Description

       The page mapping and unmapping functions described in the following sections call this routine
       to free a page table that is no longer needed; thus, this function is the partner of ptab_alloc (see
       Section 8.9.7).  The default implementation again assumes that the malloc_lmm memory pool is
       being used to manage physical memory. If the client OS overrides ptab_alloc to use a different
       allocation mechanism, it should also override ptab_free correspondingly.


Parameters

       ptab_pa:    The physical address of the page table page to free.


Dependencies

       lmm_free_page:        2.6.10
190                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.9       pdir__map__page:  map a 4KB page into a linear address space


Synopsis

       #include  <flux/x86/base_paging.h>

       int pdir__map__page(vm_offset_t pdir_pa, vm_offset_t la, pt_entry_t mapping);


Description

       This  function  creates  a  single  4KB  page  mapping  in  the  linear  address  space  represented  by
       the  specified  page  directory.   If  the  page  table  covering  the  specified  linear  address  does  not
       exist (i.e., the selected page directory entry is invalid), then a new page table is allocated using
       ptab_alloc and inserted into the page directory before the actual page mapping is inserted into
       the page table. Any new page tables created by this function are mapped into the page directory
       with permissions INTEL_PTE_USER  _  INTEL_PTE_WRITE: full permissions are granted at the page
       directory level, although the specified mapping value, which is inserted into the selected page
       table entry, may restrict permissions at the individual page granularity.

       This function assumes that if the page directory entry selected by bits 22-31 of la is valid (the
       INTEL_PDE_VALID bit is set), then that entry actually refers to a page table, and is not a 4MB
       page mapping.  In other words, the caller should not attempt to create a 4KB page mapping
       in a part of the linear address space already covered by a valid 4MB superpage mapping.  The
       caller must first unmap the 4MB superpage mapping, then map the 4KB page (which will cause
       a page table to be allocated).  If the caller follows the guidelines described in Section 8.9, then
       this requirement should not be a problem.


Parameters

       pdir_pa:    Physical address of the page directory acting as the root of the linear address space in
             which to make the requested page mapping.

       la:   Linear address at which to make the mapping. Only bits 12-31 are relevant to this function;
             bits 0-11 are ignored.

       mapping:      Contains the page table entry value to insert into the appropriate page table entry:
             the page frame number is in bits 12-31, and the INTEL_PTE_* flags are in bits 0-11.  XXX
             The caller must include INTEL_PTE_VALID; other flags may be set according to the desired
             behavior. (To unmap pages, use pdir_unmap_page instead; see Section 8.9.10)


Returns

       If all goes well and the mapping is successful, this function returns zero.  If this function needed
       to allocate a new page table but the ptab_alloc function failed (returned nonzero), then this
       function passes back the return value from ptab_alloc.


Dependencies

       pdir_find_pde:        8.9.3

       ptab_find_pte:        8.9.4

       ptab_alloc:       8.9.7
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    191


8.9.10        pdir__unmap__page:  unmap a single 4KB page mapping


XXX not implemented yet
192                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.11        pdir__map__range:  map a contiguous range of physical addresses


Synopsis

       #include  <flux/x86/base_paging.h>

       int pdir__map__range(vm_offset_t pdir_pa, vm_offset_t la, vm_offset_t pa, vm_size_t size,
       pt_entry_t mapping_bits);


Description

       This function maps a range of linear addresses in the linear address space represented by the spec-
       ified page directory onto a contiguous range of physical addresses.  The linear (source) address,
       physical (destination) address, and mapping size must be multiples of the 4KB architectural page
       size, but other than that no restrictions are imposed on the location or size of the mapping range.
       If the processor description in the global base_cpuid variable (see Section 8.6.6) indicates that
       page size extensions are available, and the physical and linear addresses are properly aligned,
       then this function maps as much of the range as possible using 4MB superpage mappings instead
       of 4KB page mappings. Where 4KB page mappings are needed, this function allocates new page
       tables as necessary using ptab_alloc. Any new page tables created by this function are mapped
       into the page directory with permissions INTEL_PTE_USER  _  INTEL_PTE_WRITE: full permissions
       are granted at the page directory level, although the mapping_bits may specify more restricted
       permissions for the actual page mappings.

       This function assumes that no valid mappings already exist in the specified linear address range;
       if any mappings do exist, this function may not work properly. If the caller follows the guidelines
       described in Section 8.9, always unmapping previous mappings before creating new ones, then
       this requirement should not be a problem.


Parameters

       pdir_pa:    Physical address of the page directory acting as the root of the linear address space in
             which to make the requested mapping.

       la:   Starting linear address at which to make the mapping. Must be page-aligned.

       pa:   Starting physical address to map to. Must be page-aligned.

       size:   Size of the linear-to-physical mapping to create. Must be page-aligned.

       mapping_bits:      Permission bits to OR into each page or superpage mapping entry.  The caller
             must include INTEL_PTE_VALID; other flags may be set according to the desired behavior.
             (To unmap ranges, use pdir_unmap_range instead; see Section 8.9.13)


Returns

       If all goes well and the mapping is successful, this function returns zero.  If this function needed
       to allocate a new page table but the ptab_alloc function failed (returned nonzero), then this
       function passes back the return value from ptab_alloc.


Dependencies

       pdir_find_pde:        8.9.3

       ptab_find_pte:        8.9.4

       ptab_alloc:       8.9.7

       base_cpuid:       8.6.6
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    193


8.9.12        pdir__prot__range:  change the permissions on a mapped memory range


Synopsis

       #include  <flux/x86/base_paging.h>

       void  pdir__prot__range(vm_offset_t  pdir_pa,  vm_offset_t  la,  vm_size_t  size,  pt_entry_t
       new_mapping_bits);


Description

       This function can be used to modify the permissions and other attribute bits associated with a
       mapping range previously created with pdir_map_range.  The la and size parameters must be
       exactly the same as those passed to the pdir_map_range used to create the mapping.


Parameters

       pdir_pa:    Physical address of the page directory acting as the root of the linear address space
             containing the mapping to modify.

       la:   Starting linear address of the mapping to modify. Must be exactly the same as the address
             specified to the pdir_map_range call used to create this mapping.

       size:   Size of the mapping to modify.  Must be exactly the same as the size specified to the
             pdir_map_range call used to create this mapping.

       new_mapping_bits:        New permission flags to insert into each page or superpage mapping entry.
             The caller must include INTEL_PTE_VALID; other flags may be set according to the desired
             behavior. (To unmap ranges, use pdir_unmap_range; see Section 8.9.13)


Dependencies

       pdir_find_pde:        8.9.3

       ptab_find_pte:        8.9.4
194                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.13        pdir__unmap__range:  remove a mapped range of linear addresses


Synopsis

       #include  <flux/x86/base_paging.h>

       void pdir__unmap__range(vm_offset_t pdir_pa, vm_offset_t la, vm_size_t size);


Description

       This function removes a mapping range previously created using pdir_map_range.  The la and
       size parameters must be exactly the same as those passed to the pdir_map_range used to create
       the mapping.


Parameters

       pdir_pa:    Physical address of the page directory acting as the root of the linear address space
             containing the mapping to destroy.

       la:   Starting linear address of the mapping to destroy. Must be exactly the same as the address
             specified to the pdir_map_range call used to create this mapping.

       size:   Size of the mapping to destroy.  Must be exactly the same as the size specified to the
             pdir_map_range call used to create this mapping.


Dependencies

       pdir_find_pde:        8.9.3

       ptab_find_pte:        8.9.4
8.9.   _____|_X86|BASE ENVIRONMENT: PAGE TRANSLATION                                                    195


8.9.14        pdir__dump:  dump the contents of a page directory and all its page tables


Synopsis

       #include  <flux/x86/base_paging.h>

       void pdir__dump(vm_offset_t pdir_pa);


Description

       This function is primarily intended for debugging purposes:  it dumps the mappings described
       by the specified page directory and all associated page tables in a reasonably compact, human-
       readable form, using printf. 4MB superpage as well as 4KB page mappings are handled properly,
       and contiguous ranges of identical mappings referring to successive physical pages or superpages
       are  collapsed  into  a  single  line  for  display  purposes.   The  permissions  and  other  page  direc-
       tory/page table entry flags are expanded out as human-readable flag names.


Parameters

       pdir_pa:    Physical address of the page directory describing the linear address space to dump.


Dependencies

       ptab_dump:      8.9.15

       printf:     6.5

       phystokv:      8.6.2
196                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.9.15        ptab__dump:  dump the contents of a page table


Synopsis

       #include  <flux/x86/base_paging.h>

       void ptab__dump(vm_offset_t ptab_pa, vm_offset_t base_la);


Description

       This is primarily a helper function for pdir_dump, but it can also be used independently, to dump
       the contents of an individual page table. For output purposes, the page table is assumed to reside
       at base_la in "some" linear address space:  in other words, this parameter provides the topmost
       ten bits in the linear addresses dumped by this routine. Contiguous ranges of identical mappings
       referring to successive physical pages are collapsed into a single line for display purposes.  The
       permissions and other page directory/page table entry flags are expanded out as human-readable
       flag names.


Parameters

       pdir_pa:    Physical address of the page table to dump.

       base_la:    Linear address at which this page table resides, for purposes of displaying linear source
             addresses. Must be 4MB aligned.


Dependencies

       printf:     6.5

       phystokv:      8.6.2
8.10.   ________|_X86BPC|ASE ENVIRONMENT: I/O DEVICE SUPPORT                                              197


8.10          ________|_X86BPC|ase  Environment:  I/O  Device  Support


XXX implemented, but undocumented
198                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.10.1        base__irq.h:  Hardware interrupt definitions for standard PCs


Description

       XXX see also irq_list.h


XXX implemented, but undocumented
8.10.   ________|_X86BPC|ASE ENVIRONMENT: I/O DEVICE SUPPORT                                              199


8.10.2        phys__lmm.h:  Physical memory management for PCs


XXX implemented, but undocumented
200                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.10.3        direct__cons.h:  Direct video console


XXX implemented, but undocumented
8.10.   ________|_X86BPC|ASE ENVIRONMENT: I/O DEVICE SUPPORT                                              201


8.10.4        com__cons.h:  Polling serial (COM) port console


XXX implemented, but undocumented
202                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.11          ________|_X86MPC|ultiBoot  Startup



MultiBoot is a standardized interface between boot loaders and 32-bit operating systems on x86 PC plat-
forms, which attempts to solve the traditional problem that every single operating system tends to come
with its own boot loader or set of boot loaders which are completely incompatible with boot loaders written
for any other operating system. The MultiBoot standard allows any MultiBoot-compliant operating system
to be loaded from any MultiBoot-supporting boot loader.  MultiBoot is also designed to provide advanced
features needed by many modern operating systems, such as direct 32-bit protected-mode startup, and sup-
port for boot modules, which are arbitrary files loaded by the boot loader into physical memory along with
the kernel and passed to the kernel on startup.  These boot modules may be dynamically loadable device
drivers, application program executables, files on an initial file system, or anything else the OS may need
before it has full device access.  The MultiBoot standard is already supported by several boot loaders and
operating systems, and is gradually becoming more widespread.  For more information on the MultiBoot
standard, including the latest specification, see ftp://flux.cs.utah.edu/flux/multiboot.
    The MultiBoot standard is separate from and largely independent of the Flux OS toolkit.  However,
if MultiBoot is used, the toolkit can leverage it to provide a powerful, flexible, and extremely convenient
method of booting custom operating systems that use the OS toolkit.  The toolkit provides startup code
which allows MultiBoot-compliant OS kernels to be built easily, and which handles the details of finding and
managing physical memory on startup, interpreting the command line passed by the boot loader, finding and
using boot modules, etc.  If you use the OS toolkit's MultiBoot startup support, your kernel automatically
inherits a complete, full-featured 32-bit protected-mode startup environment and the ability to use various
existing boot loaders, without being constrained by the limitations of traditional OS-specific boot loaders.



8.11.1        Startup code organization

The MultiBoot startup code in the OS toolkit has two components.  The first component is contained in
the object file multiboot.o, installed by the toolkit in the prefix/lib/fluxcrt0/ directory.  This object
file contains the actual MultiBoot header and entrypoint; it must be linked into the kernel as the very first
object file, so that its contents will be at the very beginning of the resulting executable.  (This object file
takes the place of the crt0.o or crt1.o normally used when linking ordinary applications in a Unix-like
system.) The second component is contained in the libkern.a library; it contains the rest of the MultiBoot
startup code as well as various utility routines for the use of the client OS.
    XXX diagram of MultiBoot kernel executable image
    The toolkit's MultiBoot startup code will work when using either ELF or a.out format. ELF is the format
recommended for kernel images by the MultiBoot standard; however, the a.out format is also supported
through the use of some special header information embedded in the multiboot.o code linked at the very
beginning of the kernel's text segment. This information allows the MultiBoot boot loader to determine the
location and sizes of the kernel's text, data, and bss sections in the kernel executable without knowing the
details of the particular a.out flavor in use (e.g., Linux, NetBSD, FreeBSD, Mach, VSTa, etc.), all of which
are otherwise mutually incompatible.



8.11.2        Startup sequence

After the MultiBoot boot loader loads the kernel executable image, it searches through the beginning of
the image for the MultiBoot header which provides important information about the OS being loaded. The
boot loader performs its activities, then shuts itself down and jumps to the OS kernel entrypoint defined in
the kernel's MultiBoot header.  In one processor register the boot loader passes to the kernel the address of
a MultiBoot information structure, containing various information passed from the boot loader to the OS,
organized in a standardized format defined by the MultiBoot specification.
    In the OS toolkit's MultiBoot startup code, the kernel entrypoint is a short code fragment in multiboot.o
which sets up the initial stack and performs other minimal initialization so that ordinary 32-bit C code can
be run safely. This code fragment then calls the C function multiboot_main, with a pointer to the MultiBoot
information structure as its argument.  Normally, the multiboot_main function comes from libkern.a; it
8.11.   ________|_X86MPC|ULTIBOOT STARTUP                                                                             203


performs other high-level initialization to create a convenient, stable 32-bit environment, and then calls the

familiar main routine, which the client OS must provide.



8.11.3        Memory model

Once the OS kernel receives control in its main routine, the processor has been set up in the base environment
defined earlier in Section 8.6. The base_gdt, base_idt, and base_tss have been set up and activated, so that
segmentation operations work and traps can be handled.  Paging is disabled, and all kernel code and data
segment descriptors are set up with an offset of zero, so that virtual addresses, linear addresses, and physical
addresses are all the same. The client OS is free to change this memory layout later, e.g., by enabling paging
and reorganizing the linear address space as described in Section 8.6.3.
    As part of the initialization performed by multiboot_main,  the OS toolkit's MultiBoot startup code
uses  information  passed  to  the  OS  by  the  boot  loader,  describing  the  location  and  amount  of  physical
memory available, to set up the malloc_lmm memory pool (see Section 6.4.1).  This allows the OS kernel
to allocate and manage physical memory using the normal C-language memory allocation mechanisms, as
well as directly using the underlying LMM memory manager library functions. The physical memory placed
on the malloc_lmm pool during initialization is guaranteed not to contain any of the data structures passed
by the boot loader which the OS may need to use, such as the command line or the boot modules;  this
way, the kernel can freely allocate and use memory right from the start without worrying about accidentally
"stepping on" boot loader data that it will need to access later on. In addition, the physical memory placed
on the malloc_lmm is divided into the three separate regions defined in phys_lmm.h (see Section 8.10.2): one
for low memory below 1MB, one for "DMA" memory below 16MB, and one for all physical memory above
this line.  This division allows the kernel to allocate "special" memory when needed for device access or for
calls to real-mode BIOS routines, simply by specifying the appropriate flags in the LMM allocation calls.



8.11.4        Command-line arguments

The MultiBoot specification allows an arbitrary ASCII string to be passed from the boot loader to the OS
as a "command line" for the OS to interpret as it sees fit. As passed from the boot loader to the OS, this is a
single null-terminated ASCII string. However, the default MultiBoot initialization code provided by the OS
toolkit performs some preprocessing of the command line before the actual OS receives control in its main
routine. In particular, it parses the single command line string into an array of individual argument strings
so that the arguments can be passed to the OS through the normal C-language argc/argv parameters to
main.  In addition, any command-line arguments containing an equals sign (`=') are added to the environ
array rather than the argv array, effectively providing the OS with a minimal initial environment that can be
specified by the user (through the boot loader) and examined by the OS using the normal getenv mechanism
(see Section 6.7.3).
    Note that this command-line preprocessing mechanism matches the kernel command-line conventions
established  by  Linux,  although  it  provides  more  convenience  and  flexibility  to  the  OS  by  providing  this
information  to  the  OS  through  standard  C-language  facilities,  and  by  not  restricting  the  "environment
variables" to be comma-separated lists of numeric constants, as Linux does.  This mechanism also provides
much more flexibility than traditional BSD/Mach command-line mechanisms, in which the boot loader itself
does most of the command-line parsing, and basically only passes a single fixed "flags" word to the OS.



8.11.5        Linking MultiBoot kernels

Since MultiBoot kernels initially run in physical memory, with paging disabled and segmentation effectively
"neutralized," the kernel must be linked at an address within the range of physical memory present on typical
PCs. Normally the best place to link the kernel is at 0x100000, or 1MB, which is the beginning of extended
memory just beyond the real-mode ROM BIOS. Since the processor is already in 32-bit protected mode
when the MultiBoot boot loader starts the OS, running above the 1MB "boundary" is not a problem.  By
linking at 1MB, the kernel has plenty of "room to grow," having essentially all extended memory available
to it in one contiguous chunk.
204                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


    In some cases, it may be preferable to link the kernel at a lower address, below the 1MB boundary, for

example if the kernel needs to run on machines without any extended memory, or if the kernel contains code
that needs to run in real mode.  This is also allowed by the MultiBoot standard.  However, note that the
kernel should generally leave at least the first 0x500 bytes of physical memory untouched, since this area
contains important BIOS data structures that will be needed if the kernel ever makes calls to the BIOS, or
if it wants to glean information about the machine from this area such as hard disk configuration data.
8.11.   ________|_X86MPC|ULTIBOOT STARTUP                                                                             205


8.11.6        multiboot.h:  Definitions of MultiBoot structures and constants


Synopsis

       #include  <flux/x86/multiboot.h>


Description

       This header file is not specific to the MultiBoot startup code provided by the OS toolkit;  it
       merely contains generic symbolic structure and constant definitions corresponding to the data
       structures specified in the MultiBoot specification. The following C structures are defined:

       struct  multiboot_header:            Defines the MultiBoot header structure which is located near the
             beginning of all MultiBoot-compliant kernel executables.

       struct  multiboot_info:           Defines the general information structure passed from the boot loader
             to the OS when control is passed to the OS.

       struct  multiboot_module:            One of the elements of the multiboot_info structure is an optional
             array of boot modules which the boot loader may provide; each element of the boot module
             array is reflected by this structure.

       struct  multiboot_addr_range:             Another optional component of the multiboot_info structure
             is a pointer to an array of address range descriptors, described by this structure, which define
             the layout of physical memory on the machine. (XXX name mismatch.)

       For more information on these structures and the associated constants,  see the multiboot.h
       header file and the MultiBoot specification.

       XXX should move this to x86/pc/multiboot.h?
206                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.11.7        boot__info:  MultiBoot information structure


Synopsis

       #include  <flux/x86/pc/base_multiboot.h>

       extern  struct  multiboot_info  boot_info;


Description

       The first thing that multiboot_main does on entry from the minimal startup code in multiboot.o
       is copy the MultiBoot information structure passed by the boot loader into a global variable in
       the kernel's bss segment.  Copying the information structure this way allows it to be accessed
       more conveniently by the kernel, and makes it unnecessary for the memory initialization code
       (base_multiboot_init_mem; see Section 8.11.9) to carefully "step over" the information structure
       when determining what physical memory is available for general use.

       After the OS has received control in its main routine, it is free to examine the boot_info structure
       and use it to locate other data passed by the boot loader, such as the boot modules.  The client
       OS must not attempt to access the original copy of the information structure passed by the boot
       loader, since that copy of the structure may be overwritten as memory is dynamically allocated
       and used.  However, this should not be a problem, since a pointer to the original copy of the
       multiboot_info structure is never even passed to the OS by the MultiBoot startup code; it is
       only accessible to the OS if it overrides the multiboot_main function.
8.11.   ________|_X86MPC|ULTIBOOT STARTUP                                                                             207


8.11.8        multiboot__main:  general MultiBoot initialization


Synopsis

       #include  <flux/x86/pc/base_multiboot.h>

       void multiboot__main(vm_offset_t boot_info_pa);


Description

       This is the first C-language function to run, invoked by the minimal startup code fragment in
       multiboot.o.  The default implementation merely copies the MultiBoot information structure
       passed by the boot loader into the global variable boot_info (see Section 8.11.7), and then calls
       the following routines to set up the base environment and start the OS:

       base_cpu_setup:        Initializes the base GDT, IDT, and TSS, so that the processor's segmentation
             facilities can be used and processor traps can be handled.

       base_multiboot_init_mem:            Finds all physical memory available for general use and adds it to
             the malloc_lmm so that OS code can allocate memory dynamically.

       base_multiboot_init_cmdline:              Performs basic preprocessing on the command line string passed
             by the boot loader, splitting it up into standard C argument and environment variable lists.

       main:    This call is what invokes the actual OS code, using standard C-language startup conven-
             tions.

       exit:    As per C language conventions, if the main routine ever returns, exit is called immedi-
             ately, using the return value from main as the exit code.

       If the client OS does not wish some or all of the above to be performed, it may override the
       multiboot_main function with a version that does what it needs, or, alternatively, it may instead
       override the specific functions of interest called by multiboot_main.


Parameters

       boot_info_pa:     The physical address of the MultiBoot information structure as created and passed
             by the boot loader.


Returns

       This function had better never return.


Dependencies

       phystokv:      8.6.2

       boot_info:      8.11.7

       base_cpu_setup:        8.6.3

       base_multiboot_init_mem:            8.11.9

       base_multiboot_init_cmdline:              8.11.10

       exit:    6.6.1
208                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.11.9        base__multiboot__init__mem:  physical memory initialization


Synopsis

       #include  <flux/x86/pc/base_multiboot.h>

       void base__multiboot__init__mem(void);


Description

       This function finds all physical memory available for general use and adds it to the malloc_lmm
       pool, as described in Section 8.11.3.  It is normally called automatically during initialization by
       multiboot_main (see Section 8.11.8).

       This function uses the lower and upper memory size fields in the MultiBoot information structure
       to determine the total amount of physical memory available; it then adds all of this memory to
       the malloc_lmm pool except for the following "special" areas:

          o  The first 0x500 bytes of physical memory are left untouched, since this area contains BIOS
             data structures which the OS might want to access (or the BIOS itself, if the OS makes any
             BIOS calls).

          o  The  area  from  0xa0000  to  0x100000  is  the  I/O  and  ROM  area,  and  therefore  does  not
             contain usable physical memory.

          o  The memory occupied by the kernel itself is skipped, so that the kernel will not trash its
             own code, data, or bss.

          o  All  interesting  boot  loader  data  structures,  which  can  be  found  through  the  MultiBoot
             information structure, are skipped, so that the OS can examine them later.  This includes
             the kernel command line, the boot module information array, the boot modules themselves,
             and the strings associated with the boot modules.

       This function uses phys_lmm_init to initialize the malloc_lmm, and phys_lmm_add to add avail-
       able physical memory to it (see Section 8.10.2); as a consequence, this causes the physical memory
       found to be split up automatically according to the three main functional "classes" of PC mem-
       ory:  low 1MB memory accessible to real-mode software,  low 16MB memory accessible to the
       built-in DMA controller, and "all other" memory.  This division allows the OS to allocate "spe-
       cial" memory when needed for device access or for calls to real-mode BIOS routines, simply by
       specifying the appropriate flags in the LMM allocation calls.

       XXX currently doesn't use the memory range array.


Dependencies

       phystokv:      8.6.2

       boot_info:      8.11.7

       phys_lmm_init:        8.10.2

       phys_lmm_add:       8.10.2

       strlen:     6.3.12
8.11.   ________|_X86MPC|ULTIBOOT STARTUP                                                                             209


8.11.10        base__multiboot__init__cmdline:  command-line preprocessing


Synopsis

       #include  <flux/x86/pc/base_multiboot.h>

       void base__multiboot__init__cmdline(void);


Description

       This function breaks up the kernel command line string passed by the boot loader into inde-
       pendent C-language-compatible argument strings.  Option strings are separated by any normal
       whitespace characters (spaces, tabs, newlines, etc.). In addition, strings containing an equals sign
       (`=') are added to the environ array rather than the argv array, effectively providing the OS
       with a minimal initial environment that can be specified by the user (through the boot loader)
       and examined by the OS using the normal getenv mechanism (see Section 6.7.3).

       XXX example.

       XXX currently no quoting support.

       XXX currently just uses "kernel" as argv[0].


Dependencies

       phystokv:      8.6.2

       strlen:     6.3.12

       strtok:     6.3.12

       malloc:     6.4.2

       memcpy:     6.3.12

       panic:     6.6.3
210                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.11.11        base__multiboot__find:  find a MultiBoot boot module by name


Synopsis

       #include  <flux/x86/pc/base_multiboot.h>

       struct  multiboot_module  *base__multiboot__find(const  char *string);


Description

       This is not an initialization function, but rather a utility function for the use of the client OS.
       Given a particular string, it searches the array of boot modules passed by the boot loader for
       a boot module with a matching string.  This function can be easily used by the OS to locate
       specific boot modules by name.

       If multiple boot modules have matching strings,  then the first one found is returned.  If any
       boot modules have no strings attached (no pun intended), then those boot modules will never be
       "found" by this function, although they can still be found by hunting through the boot module
       array manually.


Parameters

       string:    The string to match against the strings attached to the boot modules.


Returns

       If successful, returns a pointer to the multiboot_module entry matched; from this structure, the
       actual boot module data can be found using the mod_start and mod_end elements, which contain
       the start and ending physical addresses of the boot module data, respectively.

       If no matching boot module can be found, this function returns NULL.


Dependencies

       phystokv:      8.6.2

       boot_info:      8.11.7

       strcmp:     6.3.12
8.12.   ________|_X86RPC|AW BIOS STARTUP                                                                                211


8.12          ________|_X86RPC|aw  BIOS  Startup


The BIOS startup code is written and functional but not yet documented or integrated into the oskit source
tree. The source code implementing this functionality can currently be found in the oskit "source" subdirectory
containing unintegrated code, in source/x86/pc/i16 and related directories.
212                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.13          ________|_X86DPC|OS  Startup


The DOS startup code is written and functional but not yet documented or integrated into the oskit source
tree.  The source code implementing this functionality The source code implementing this functionality can
currently be found in the oskit "source" subdirectory containing unintegrated code,  in source/x86/doc/i16
and related directories.
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           213


8.14         Remote  Kernel  Debugging  with  GDB


In addition to the libkern functionality described above which is intended to facilitate implementing kernels,
the library also provides complete, easy-to-use functionality to facilitate debugging kernels.  The OS toolkit
does not itself contain a complete kernel debugger (at least, not yet), but it contains extensive support for
remote debugging using GDB, the GNU debugger.  This remote debugging support allows you to run the
debugger on one machine, and run the actual OS kernel being debugged on a different machine.  The two
machines can be of different architectures. A small "debugging stub" is linked into the OS kernel; this piece
of code handles debugging-related traps and interrupts and communicates with the remote debugger, acting
as a "slave" that simply interprets and obeys the debugger's commands.
    This section describes remote debugging in general, applicable to any mechanism for communicating with
the remote kernel (e.g., serial line or ethernet).  The next section (8.15) describes kernel debugging support
specific to the serial line mechanism (currently the only one implemented).
    XXX diagram
    One of the main advantages of remote debugging is that you can use a complete, full-featured source-level
debugger, since it can run on a stable, well-established operating system such as Unix; a debugger running
on the same machine as the kernel being debugged would necessarily have to be much smaller and simpler
because of the lack of a stable underlying OS it can rely on.  Another advantage is that remote debugging
is less invasive:  since most of the debugging code is on a different machine, and the remote debugging stub
linked into the OS is much smaller than even a simple stand-alone debugger, there is much less that can "go
wrong" with the debugging code when Strange Things start to happen due to subtle kernel bugs. The main
disadvantage of remote debugging, of course, is that it requires at least two machines with an appropriate
connection between them.
    The GNU debugger, GDB, supports a variety of remote debugging protocols.  The most common and
well-supported is the serial-line protocol, which operates over an arbitrary serial line (typically null-modem)
connection  operating  at  any  speed  supported  by  the  two  machines  involved.   The  serial-line  debugging
protocol supports a multitude of features such as multiple threads, signals, and data compression. GDB also
supports an Ethernet-based remote debugging protocol and a variety of existing vendor- and OS-specific
protocols.
    Ths OS kit's GDB support has been tested with GDB versions 4.15 and 4.16; probably a version >=
4.15 is required.



8.14.1        Organization of remote GDB support code

The GDB remote debugging support provided by the OS toolkit is broken into two components: the protocol-
independent component and the protocol-specific component.  The protocol-independent component encap-
sulates  all  the  processor  architecture-specific  code  to  handle  processor  traps  and  convert  them  into  the
"signals" understood by GDB, to convert saved state frames to and from GDB's standard representation for
a given architecture, and to perform "safe" memory reads and writes on behalf of the remote user so that
faulting accesses will terminate cleanly without causing recursive traps.
    The protocol-specific component of the toolkit's remote GDB support encapsulates the code necessary
to talk to the remote debugger using the appropriate protocol. Although this code is specific to a particular
protocol, it is architecture-neutral. The OS toolkit currently supports only the standard serial-line protocol,
although support for other protocols is planned (particularly the remote Ethernet debugging protocol) and
should be easy to add.



8.14.2        Using the remote debugging code

If you are using the base environment's default trap handler, then activating the kernel debugger is extremely
easy: it is simply necessary to call an appropriate initialization routine near the beginning of your kernel code;
all subsequent traps that occur will be dispatched to the remote debugger. For example, on a PC, to activate
serial-line debugging over COM1 using default serial parameters, simply make the call `gdb_pc_com_init(1,
0)'. Some example kernels are provided with the OS toolkit that demonstrate how to initialize and use the
remote debugging facilities; see Section 1.4 for more information.
214                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


    If  you  want  a  trap  to  occur  immediately  after  initialization  of  the  debugging  mechanism,  to  transfer

control to the remote debugger from the start and give you the opportunity to set breakpoints and such,
simply invoke the gdb_breakpoint macro immediately after the call to initialize the remote debugger (see
Section 8.14.11).
    If your kernel uses its own trap entrypoint mechanisms or its own serial line communication code (e.g.,
"real" interrupt-driven serial device drivers instead of the simple polling code used by default by the toolkit),
then you will have to write a small amount of "glue" code to interface the generic remote debugging support
code in the toolkit with your specific OS mechanisms. However, this glue code should generally be extremely
small and simple, and you can use the default implementations in the OS toolkit as templates to work from
or use as examples.



8.14.3        Debugging address spaces other than the kernel's

Although the OS toolkit's remote debugging support code is most directly and obviously useful for debugging
the OS kernel itself, most of the code does not assume that the kernel is the entity being debugged. In fact,
it is quite straightforward to adapt the mechanism to allow remote debugging of other entities, such as user-
level programs running on top of the kernel. To make the debugging stub operate on a different address space
than the kernel's, it is simply necessary to override the gdb_copyin and gdb_copyout routines with alternate
versions that transfer data to or from the appropriate address space.  Operating systems that support a
notion of user-level address spaces generally have some kind of "copyin" and "copyout" routines anyway to
provide safe access to user address spaces; the replacement gdb_copyin and gdb_copyout routines can call
those standard user space access routines. In addition, the trap handling mechanism may need to be set up
so that only traps occurring in a particular context (e.g., within a particular user process or thread) will be
dispatched to the remote debugger.
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           215


8.14.4        gdb__state:  processor register state frame used by GDB


Synopsis

       #include  <flux/gdb.h>

       struct  gdb__state  {


       };


Description

       This  structure  represents  the  processor  register  state  for  the  target  architecture  in  the  form
       in  which  GDB  expects  it.   GDB  uses  a  standard  internal  data  structure  for  each  processor
       architecture to represent the register state of a program being debugged, and most of GDB's
       architecture-neutral  remote  debugging  protocols  use  this  standard  structure.   The  gdb_state
       structure  defined  by  the  OS  toolkit  is  defined  to  match  GDB's  corresponding  register  state
       structure for each supported architecture.
216                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.14.5        gdb__trap:  default trap handler for remote GDB debugging


Synopsis

       #include  <flux/gdb.h>

       int gdb__trap(struct  trap_state *trap_state);


Description

       This function is intended to be installed as the kernel trap handler by setting the base_trap_handler
       variable to point to it (see Section 8.8.4), when remote GDB debugging is desired. (Alternatively,
       the client OS can use its own trap handler which chains to gdb_trap when appropriate.)  This
       function converts the contents of the trap_state structure saved by the base trap entrypoint code
       into the gdb_state structure used by GDB. It also converts the architecture-specific processor
       trap vector number into a suitable machine-independent signal number which can be interpreted
       by the remote debugger.

       After  converting  the  register  state  and  trap  vector  appropriately,  this  function  calls  the  ap-
       propriate protocol-specific GDB stub through the gdb_signal function pointer variable (see Sec-
       tion 8.14.9). Finally, it converts the final register state, possibly modified by the remote debugger,
       back into the original trap_state format and returns an appropriate success or failure code as
       described below.

       On architectures that don't provide a way for the kernel to "validate" memory accesses before
       performing them, such as the x86, this function also provides support for "recovering" from fault-
       ing memory accesses during calls to gdb_copyin or gdb_copyout (see Sections 8.14.6 and 8.14.7).
       This is typically implemented using a "recovery pointer" which is set before a "safe" memory
       access and cleared afterwards; gdb_trap checks this recovery pointer, and if set, modifies the trap
       state appropriately and returns from the trap without invoking the protocol-specific GDB stub.

       If the client OS uses its own trap entrypoint code which saves register state in a different format
       when handling traps, then the client OS will also need to override the gdb_trap function with a
       version that understands its custom saved state format.


Parameters

       trap_state:    A pointer to the saved register state representing the processor state at the time the
             trap occurred.  The saved state must be in the default format defined by the OS toolkit's
             base environment.


Returns

       The gdb_trap function returns success (zero) when the remote debugger instructs the local stub
       to resume execution at the place it was stopped and "consume" the trap that caused the debugger
       to be invoked; this is the normal case.

       This function returns failure (nonzero) if the remote debugger passed the same or a different
       signal back to the local GDB stub, instructing the local kernel to handle the trap (signal) itself.
       If the default trap entrypoint mechanism provided by the base environment in use,  then this
       simply causes the kernel to panic with a register dump,  since the default trap code does not
       know how to "handle" signals by itself.  However, if the client OS uses its own trap entrypoint
       mechanism or interposes its own trap handler over gdb_trap, then it may wish to interpret a
       nonzero return code from gdb_trap as a request for the trap to be handled using the "normal"
       mechanism, (e.g., dispatched to the application being debugged).


Dependencies

       trap_state:       8.8.1
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           217


       gdb_state:      8.14.4

       gdb_signal:       8.14.9

       gdb_trap_recover:         8.14.8
218                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.14.6        gdb__copyin:  safely read data from the subject's address space


Synopsis

       #include  <flux/gdb.h>

       int gdb__copyin(vm_offset_t src_va, void *dest_buf, vm_size_t size);


Description

       The protocol-specific local GDB stub calls this function in order to read data in the address
       space of the program being debugged.  The default implementation of this function provided by
       libkern assumes that the kernel itself is the program being debugged; thus, it acts basically like
       an ordinary memcpy.  However, the client can override this function with a version that accesses
       a  different  address  space,  such  as  a  user  process's  address  space,  in  order  to  support  remote
       debugging of entities other than the kernel.

       If a fault occurs while trying to read the specified data, this function catches the fault cleanly
       and returns an error code rather than allowing a recursive trap to be dispatched to the debugger.
       This way, if the user of the debugger accidentally attempts to follow an invalid pointer or display
       unmapped or nonexistent memory, it will merely cause the debugger to report an error rather
       than making everything go haywire.


Parameters

       src_va:    The virtual address in the address space of the program being debugged (the kernel's
             address space, by default) from which to read data.

       dest_buf :    A pointer to the kernel buffer to copy data into. This buffer is provided by the caller,
             typically the local GDB stub,

       size:   The number of bytes of data to read into the destination buffer.


Returns

       Returns zero if the transfer completed successfully, or nonzero if some or all of the source region
       is not accessible.


Dependencies

       gdb_trap_recover:         8.14.8
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           219


8.14.7        gdb__copyout:  safely write data into the subject's address space


Synopsis

       #include  <flux/gdb.h>

       int gdb__copyout(const  void *src_buf, vm_offset_t dest_va, vm_size_t size);


Description

       The protocol-specific local GDB stub calls this function in order to write data into the address
       space of the program being debugged.  The default implementation of this function provided by
       libkern assumes that the kernel itself is the program being debugged; thus, it acts basically like
       an ordinary memcpy.  However, the client can override this function with a version that accesses
       a  different  address  space,  such  as  a  user  process's  address  space,  in  order  to  support  remote
       debugging of entities other than the kernel.

       If a fault occurs while trying to write the specified data, this function catches the fault cleanly
       and returns an error code rather than allowing a recursive trap to be dispatched to the debugger.
       This way, if the user of the debugger accidentally attempts to write to unmapped or nonexistent
       memory, it will merely cause the debugger to report an error rather than making everything go
       haywire.


Parameters

       src_buf :   A pointer to the kernel buffer containing the data to write.

       dest_va:    The virtual address in the address space of the program being debugged (the kernel's
             address space, by default) at which to write the data.

       size:   The number of bytes of data to transfer.


Returns

       Returns zero if the transfer completed successfully, or nonzero if some or all of the destination
       region is not writable.


Dependencies

       gdb_trap_recover:         8.14.8
220                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.14.8        gdb__trap__recover:  recovery pointer for safe memory transfer routines
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           221


8.14.9        gdb__signal:  vector to GDB trap/signal handler routine


Synopsis

       #include  <flux/gdb.h>

       extern  void  (*gdb_signal)(int  *inout_signo,  struct  gdb_state  *inout_gdb_state);


Description

       Before gdb_trap is called for the first time,  this function pointer must be initialized to point
       to an appropriate GDB debugging stub, such as gdb_serial_signal (see Section 8.15.2).  This
       function is called to notify the remote debugger that a relevant processor trap or interrupt has
       occurred,  and to wait for further instructions from the remote debugger.  When the function
       returns, execution will be resumed as described in Section 8.14.5.


Parameters

       inout_signo:     On entry,  the variable referenced by this pointer contains the signal number to
             transmit  to  the  remote  debugger.   On  return,  this  variable  may  have  been  modified  to
             indicate what signal should be dispatched to the program being debugged.  For example, if
             the variable is the same on return as on entry, then it means the remote debugger instructed
             the stub to "pass through" the signal to the application. If *signo is 0 on return from this
             function,  it means the remote debugger has "consumed" the signal and execution of the
             subject program should be resumed immediately.

       inout_gdb_state:      On entry, this structure contains a snapshot of the processor state at the time
             the relevant trap or interrupt occurred. On return, the remote debugger may have modified
             this state; the new state should be used when resuming execution.
222                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.14.10        gdb__set__trace__flag:  enable or disable single-stepping in a state frame


Synopsis

       #include  <flux/gdb.h>

       void gdb__set__trace__flag(int trace_enable, [in/out] struct  gdb_state *state);


Description

       This architecture-specific function merely modifies the specified processor state structure to en-
       able or disable single-stepping according to the trace_enable parameter.  On architectures that
       have some kind of trace flag, this function simply sets or clears that flag as appropriate.  On
       other architectures, this behavior is achieved through other means.  This function is called by
       machine-independent remote debugging stubs such as gdb_serial_signal before resuming ex-
       ecution of the subject program, according to whether the remote debugger requested that the
       program "continue" or "step" one instruction.


Parameters

       trace_enable:     True if single-stepping should be enabled, or false otherwise.

       state:   The state frame to modify.
8.14.  REMOTE KERNEL DEBUGGING WITH GDB                                                           223


8.14.11        gdb__breakpoint:  macro to generate a manual instruction breakpoint


Synopsis

       #include  <flux/gdb.h>

       void gdb__breakpoint(void);


Description

       This is simply an architecture-specific macro which causes an instruction causing a breakpoint
       trap to be emitted at the corresponding location in the current function. This macro can be used
       to set "manual breakpoints" in program code, as well as to give control to the debugger at the
       very beginning of program execution as described in Section 8.14.2.
224                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.15         Serial-line  Remote  Debugging  with  GDB


The GDB serial-line debugging protocol is probably the most powerful and commonly-used remote debugging
protocol supported by GDB; this is the only protocol for which the OS toolkit currently has direct support.
The  GDB  serial-line  debugging  stub  supplied  with  the  OS  toolkit  is  fully  architecture-independent,  and
supports most of the major features of the GDB serial-line protocol.
    For technical information on the remote serial-line GDB debugging protocol, or information on how to
run and use the remote debugger itself, consult the appropriate sections of the GDB manual.  This section
merely describes how remote serial-line debugging is supported by the Flux OS toolkit.
    Note that source code for several example serial-line debugging stubs are supplied in the GDB distribution
(gdb/*-stub.c);  in  fact,  this  code  was  used  as  a  template  and  example  for  the  OS  toolkit's  serial-line
debugging stub.  However,  these stubs are highly machine-dependent and make many more assumptions
about how they are used. For example, they assume that they have exclusive control of the processor's trap
vector table, and are therefore only generally usable in an embedded environment where traps are never
supposed to occur during normal operation and therefore all traps can be fielded directly by the debugger.
In contrast, the serial-line debugging stub provided in the Flux OS toolkit is much more generic and cleanly
decomposed, and therefore should be usable in a much wider range of environments.



8.15.1        Redirecting console output to the remote debugger

If  the  machine  on  which  the  kernel  is  being  debugged  is  truly  "remote,"  e.g.,  in  a  different  room  or  a
completely different building, and you don't have easy access to the machine's "real" console, it is possible
to make the kernel use the remote debugger as its "console" for printing status messages and such.  To do
this, simply write your kernel's "console" output functions (e.g., putchar and puts, if you're using the OS
toolkit's minimal C library for console output routines such as printf) so that they call gdb_serial_putchar
and gdb_serial_puts, described in Sections 8.15.4 and 8.15.5, respectively.
    Sadly, this mechanism currently only works for console output:  console input cannot be obtained from
the remote debugger's console because the GDB serial-line debugging protocol does not currently support
it.
8.15.  SERIAL-LINE REMOTE DEBUGGING WITH GDB                                                    225


8.15.2        gdb__serial__signal:  primary event handler in the GDB stub


Synopsis

       #include  <flux/gdb_serial.h>

       void gdb__serial__signal([in/out] int *signo, [in/out] struct  gdb_state *state);


Description

       This is the main trap/signal handler routine in the serial-line debugging stub; it should be called
       whenever a relevant processor trap occurs.  This function notifies the remote debugger about
       the event that caused the processor to stop,  and then waits for instructions from the remote
       debugger.  The remote debugger may then cause the stub to perform various actions, such as
       examine memory, modify the register state, or kill the program being debugged. Eventually, the
       remote debugger will probably instruct the stub to resume execution, in which case this function
       returns with the signal number and trap state modified appropriately.

       If this function receives a "kill" (`k') command from the remote debugger, then it breaks the
       remote debugging connection and then calls panic to reboot the machine.  XXX may not be
       appropriate when debugging a user task; should call an intermediate function.


Parameters

       signo:    On entry, the variable referenced by this pointer contains the signal number to transmit
             to the remote debugger. On return, this variable may have been modified to indicate what
             signal should be dispatched to the program being debugged. For example, if the variable is
             the same on return as on entry, then it means the remote debugger instructed the stub to
             "pass through" the signal to the application. If *signo is 0 on return from this function, it
             means the remote debugger has "consumed" the signal and execution of the subject program
             should be resumed immediately.

       state:   On entry, this structure contains a snapshot of the processor state at the time the relevant
             trap or interrupt occurred.  On return, the remote debugger may have modified this state;
             the new state should be used when resuming execution.


Dependencies

       gdb_serial_send:         8.15.7

       gdb_serial_recv:         8.15.6

       gdb_copyin:       8.14.6

       gdb_copyout:       8.14.7

       gdb_set_trace_flag:         8.14.10

       panic:     6.6.3
226                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.15.3        gdb__serial__exit:  notify the remote debugger that the subject is dead


Synopsis

       #include  <flux/gdb_serial.h>

       void gdb__serial__exit(int exit_code);


Description

       This function sends a message to the remote debugger indicating that the program being debugged
       is  terminating.   This  message  causes  the  debugger  to  display  an  appropriate  message  on  the
       debugger's console along with the exit_code,  and causes it to break the connection (i.e.,  stop
       listening for further messages on the serial port). If no remote debugging connection is currently
       active, this function does nothing.

       The client OS should typically call this function just before it reboots for any reason, so that the
       debugger does not hang indefinitely waiting for a response from a kernel that is no longer running.
       Alternatively, if the remote debugging facility is being used to debug a user-mode process running
       under the kernel, then this function should be called when that process terminates.

       Note that despite its name, this function does return. It does not by itself cause the machine to
       "exit" or reboot or hang or whatever; it merely notifies the debugger that the subject program
       is about to terminate.


Parameters

       exit_code:    Exit code to pass back to the remote debugger. Typically this value is simply printed
             on the remote debugger's console.


Dependencies

       gdb_serial_send:         8.15.7

       gdb_serial_recv:         8.15.6
8.15.  SERIAL-LINE REMOTE DEBUGGING WITH GDB                                                    227


8.15.4        gdb__serial__putchar:  output a character to the remote debugger's console


Synopsis

       #include  <flux/gdb_serial.h>

       void gdb__serial__putchar(int ch);


Description

       If a remote debugging connection is currently active, this function sends the specified character
       to the remote debugger in a special "output" (`O') message which causes that character to be
       sent to the debugger's standard output. This allows the serial line used for remote debugging to
       double as a remote serial console, as described in Section 8.15.1.

       Note that using gdb_serial_putchar by itself to print messages can be very inefficient, because
       a separate message is used for each character, and each of these messages must be acknowledged
       by the remote debugger before the next character can be sent.  When possible, it is much faster
       to print strings of text using gdb_serial_puts (see Section 8.15.5).  If you are using the im-
       plementation of printf in the OS toolkit's minimal C library (see Section 6.5), you can make
       this happen automatically by overriding puts with a version that calls gdb_serial_puts directly
       instead of calling putchar successively on each character.

       If this function is called while no remote debugging connection is active, but the gdb_serial_send
       and gdb_serial_receive pointers are initialized to point to serial-line communication functions,
       then this function simply sends the specified character out the serial port using gdb_serial_send.
       This way, if the kernel attempts to print any messages before a connection has been established
       or after the connection has been dropped (e.g., by calling gdb_serial_exit), they won't confuse
       the debugger or cause the kernel to hang as they otherwise would, and they may be seen by the
       remote user if the serial port is being monitored at the time.

       If the gdb_serial_send and gdb_serial_receive pointers are uninitialized (still NULL) when
       this function is called, it does nothing.


Parameters

       ch:   The character to send to the remote debugger's console.


Dependencies

       gdb_serial_send:         8.15.7

       gdb_serial_recv:         8.15.6
228                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.15.5        gdb__serial__puts:  output a line to the remote debugger's console


Synopsis

       #include  <flux/gdb_serial.h>

       void gdb__serial__puts(const  char *s);


Description

       If a remote debugging connection is currently active,  this function sends the specified string,
       followed  by  a  newline  character,  to  the  remote  debugger  in  a  special  "output"  (`O')  message
       which causes the line to be sent to the debugger's standard output.  This allows the serial line
       used for remote debugging to double as a remote serial console, as described in Section 8.15.1.

       If this function is called while no remote debugging connection is active, but the gdb_serial_send
       and gdb_serial_receive pointers are initialized to point to serial-line communication functions,
       then this function simply sends the specified line out the serial port using gdb_serial_send. This
       way, if the kernel attempts to print any messages before a connection has been established or
       after the connection has been dropped (e.g., by calling gdb_serial_exit), they won't confuse
       the debugger or cause the kernel to hang as they otherwise would, and they may be seen by the
       remote user if the serial port is being monitored at the time.

       If the gdb_serial_send and gdb_serial_receive pointers are uninitialized (still NULL) when
       this function is called, it does nothing.


Parameters

       s:    The string to send to the remote debugger's console.  A newline is automatically appended
             to this string.


Dependencies

       gdb_serial_send:         8.15.7

       gdb_serial_recv:         8.15.6
8.15.  SERIAL-LINE REMOTE DEBUGGING WITH GDB                                                    229


8.15.6        gdb__serial__recv:  vector to GDB serial line receive function


Synopsis

       #include  <flux/gdb_serial.h>

       int  (*gdb_serial_recv)(void);


Description

       Before the remote serial-line debugging stub can be used, this global variable must be initialized to
       point to a function to call to read a character from the serial port. The function should not return
       until a character has been received; the GDB stub has no notion of timeouts or interruptions.

       Calling functions in the GDB serial-line debugging stub before this variable is initialized (i.e.,
       while it is still null) is guaranteed to be harmless.


Returns

       Returns the character received.
230                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.15.7        gdb__serial__send:  vector to GDB serial line send function


Synopsis

       #include  <flux/gdb_serial.h>

       void  (*gdb_serial_send)(int  ch);


Description

       Before the remote serial-line debugging stub can be used, this global variable must be initialized
       to point to a function to call to send out a character on the serial port.

       Calling functions in the GDB serial-line debugging stub before this variable is initialized (i.e.,
       while it is still null) is guaranteed to be harmless.


Returns

       Returns the character received.
8.15.  SERIAL-LINE REMOTE DEBUGGING WITH GDB                                                    231


8.15.8        gdb__pc__com__init:   ________|_X86sPC|et up serial-line debugging over a COM port


Synopsis

       #include  <flux/gdb.h>

       void gdb__pc__com__init(int com_port, struct  termios *com_params);


Description

       This is a simple "wrapper" function which ties together all of the OS toolkit's remote debugging
       facilities to automatically create a complete remote debugging environment for a specific, typical
       configuration: namely, remote serial-line debugging on a PC through a COM port. This function
       can be used as-is if this configuration happens to suit your purposes, or it can be used as an
       example for setting up the debugging facilities for other configurations.

       Specifically, this function does the following:

          o  Sets  the  base_trap_handler  variable  to  point  to  gdb_trap.   This  establishes  the  GDB
             debugging trap handler as the basic handler used to handle all processor traps.

          o  Sets the gdb_signal variable to point to gdb_serial_signal.  This "connects" the generic
             GDB debugging code to the serial-line debugging stub.

          o  Sets  gdb_serial_recv  to  point  to  com_cons_getchar,  and  gdb_serial_send  to  point  to
             com_cons_putchar.  This connects the serial-line debugging stub to the simple polling PC
             COM-port console code.

          o  Initializes the specified COM port using the specified parameters (baud rate, etc.).

          o  Sets the hardware IRQ vector in the base IDT corresponding to the selected COM port
             to point to an interrupt handler that invokes the remote debugger with a "fake" SIGINT
             trap, and enables the serial port interrupt.  This allows the remote user to interrupt the
             running kernel by pressing CTRL-C on the remote debugger's console, at least if the kernel
             is running with interrupts enabled.


Parameters

       com_port :    The COM port number through which to communicate: must be 1, 2, 3, or 4.

       com_params:       A pointer to a termios structure defining the required serial port communication
             parameters. If this parameter is NULL, the serial port is set up for 9600,8,N,1 by default.


Dependencies

       gdb_trap:      8.14.5

       gdb_signal:       8.14.9

       gdb_serial_signal:         8.15.2

       gdb_serial_recv:         8.15.6

       gdb_serial_send:         8.15.7

       com_cons_init:        8.10.4

       com_cons_getchar:         8.10.4

       com_cons_putchar:         8.10.4

       com_cons_enable_receive_interrupt:                8.10.4

       base_idt:      8.7.4

       termios:      6.5
232                                                   CHAPTER 8.  KERNEL SUPPORT LIBRARY (LIBKERN.A)


8.16         Annotations


XXX Implemented, but currently undocumented.
    (Annotations  are  just  "markers"  you  can  place  in  code  or  static  data  which  cause  annotation  tables
to be built up elsewhere containing pointers to the places the markers appear, along with other optional
information, e.g., pointers to rollback routines.)



Chapter  9


Symmetric   Multi   Processing   Library



(libsmp.a)


Author: Kevin T. Van Maren



9.1       Introduction


This library is designed to simplify the startup and use of multiprocessors. It defines a common interface to
multiprocessor machines that is fairly platform independant.



9.2       Supported  Systems


Currently, SMP support is only provided for Intel x86 computers conforming to the Intel Multiprocessor
Specification.



9.2.1       Intel x86

Systems which fully comply to the Intel MultiProcessing Specification should be supported.  Since some of
the code is based on Linux 2.0, some features (such as dual I/O APICS) are not fully suported.
    Additioanlly, inter-processor interupts (necessary for TLB flushes) are not yet implemented, although
they will be soon.
    While not entirely correct, in general, if a machine works with Linux 2.0 it will work with the OS Toolkit;
if it doesn't, then it probably won't.



9.3       API  reference

                                                               233
234                                CHAPTER 9.  SYMMETRIC MULTI PROCESSING LIBRARY (LIBSMP.A)


9.3.1       smp__initialize:  Initializes the SMP startup code


Synopsis

       #include  <flux/smp.h> int smp__initialize(void);


Description

       This function does the initial setup for the SMP support.  It should be called before any other
       libsmp  routines  are  used.   It  identifies  the  processers  and  gets  them  ready  and  waiting  in  a
       busy-loop for a "go" from the boot processor.

       Note that success DOESN'T necessarially mean the computer has multiple processors.  Rather,
       failure indicates that the machine does not support multiple processors. smp_get_num_cpus should
       be used to determine the number of CPUs present.

       Don't call this more than once...yet.


Parameters

       void :   This function takes no parameters


Returns

       It returns 0 on success (SMP-system is found). E_SMP_NO_CONFIG is returned on non-IMPS-
       compliant x86 machines.
9.3.  API REFERENCE                                                                                                235


9.3.2       smp__find__cur__cpu:  Return the processor ID of the current processor.


Synopsis

       int smp__find__cur__cpu(void);


Description

       This function returns an unique (per-processor) integer representing the current processor. Note
       that the numbers are NOT guaranteed to be sequential or starting from 0, although that may be
       a common case.


Parameters

       void :   This function takes no parameters


Returns

       The processor's ID.
236                                CHAPTER 9.  SYMMETRIC MULTI PROCESSING LIBRARY (LIBSMP.A)


9.3.3       smp__find__cpu:  Return the next processor ID


Synopsis

       int smp__find__cpu(int first);


Description

       Given a number, it returns the first processor ID such that ID >= first.

       The first call should be with 0; subsequent calls should be with (last_cpu + 1).

       This is an interator function designed to help the client OS determine which processor numbers
       are used.


Parameters

       first :  The starting processor number.  It should be 0 on the first call, and one more than the
             result of a previous call for subsequent calls.


Returns

       Returns E_SMP_NO_PROC if there are no more processors, otherwise the UID of the next pro-
       cessor.
9.3.  API REFERENCE                                                                                                237


9.3.4       smp__start__cpu:  Starts a processor running a specified function


Synopsis

       void smp__start__cpu(int processor_id, void (*func)(void *data), void *data, void *stack_ptr);


Description

       This releases the processor specified to start running a function with the specified stack.

       Results are undefined if:

    1  the processor indicated does not exist,

    2  if a processor attempts to start itself,

    3  if any processor is started more than once, or

    4  if any of the parameters are invallid.

       smp_find_cur_cpu can be used to prevent calling smp_start_cpu on yourself. This function must
       be called for each processor started up by smp_initialize;  if the processor is not used, then
       func should execute the halt instruction immediatly.

       It is up to the user to verify that the processor is started up correctly.


Parameters

       processor_id :    The UID of a processor found by the startup code.

       func:    A function pointer to be called by the processor after it has set up its stack.

       data:    A pointer to some structure that is placed on that stack before func is called.

       stack_ptr :   The stack pointer to be used by the processor.


Returns

       Returns nothing.
238                                CHAPTER 9.  SYMMETRIC MULTI PROCESSING LIBRARY (LIBSMP.A)


9.3.5       smp__get__num__cpus:  Returns the total number of processors


Synopsis

       int smp__get__num__cpus(void);


Description

       This returns the number of processors that exist.


Parameters

       void :   This function takes no parameters


Returns

       The number of processors that have been found.  In a non-SMP system, this will always return
       1.



Chapter  10


Flux   Device   Driver   Framework


The actual interfaces specified in this chapter are preliminary and likely to change as our implementation
evolves.  However, they should at least provide a good idea of the overall design of the framework, and what
OS-specific code will be needed in order to use it in a particular environment.
    Feedback on the interfaces is solicited; send to oskit@jensen.cs.utah.edu.



10.1         Introduction


The Flux device driver framework is a device driver interface specification designed to allow existing device
drivers to be borrowed from well-established operating systems in either source or binary form, and used
unchanged to provide extensive device support in new operating systems or other programs that need device
drivers (e.g., hardware configuration management utilities). With appropriate glue, this framework can also
be used in an existing operating system to augment the drivers already supported by the OS. This chapter
describes  the  device  driver  framework  itself;  the  following  associated  chapters  describe  specific  libraries
provided as part of the Flux OS toolkit that provide driver and kernel code implementing or supporting this
interface.
    The primary goals of this device driver framework are, in decreasing order of importance:

    1. Breadth of hardware coverage. There is a tremendous range of common hardware available these
       days, each typically supporting its own device programming interface and requiring a special device
       driver.   Device  drivers  for  a  given  device  are  generally  only  available  for  a  few  operating  systems,
       depending on how well-established the particular device and OS is. Thus, in order to achieve maximum
       hardware coverage, the framework must be capable of incorporating device drivers originally written
       for a variety of different operating systems.

    2. Adaptability  to  different  environments.  This device driver framework is intended to be useful
       not only in traditional Unix-like kernels, but also in operating systems with widely different structures,
       e.g., kernels written in a "stackless" interrupt model, or kernels that run all device drivers as user mode
       programs, or kernels that do not support virtual memory.

    3. Ease-of-use. It should be reasonably easy for an OS developer to add support for this framework to a
       new or existing OS. The set of support functions the OS developer must supply should be kept as small
       and simple as possible, and there should be few "hidden surprises" lurking in the drivers. In situations
       where existing device drivers supported by this toolkit have special requirements that the OS must
       satisfy in order to use them, these requirements are clearly documented in the relevant chapters.

    4. Performance.   In  spite  of  the  above  constraints,  device  drivers  should  be  able  to  run  under  this
       framework with as little unnecessary overhead as possible. Performance issues are discussed further in
       Section 10.5.

    Since the most important goal of this framework is to achieve wide hardware coverage by making use
of  existing  drivers,  and  not  to  define  a  new  model  or  interface  for  writing  drivers,  it  is  somewhat  more


                                                               239
240                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


demanding and restricting in terms of OS support than would be ideal if we were writing entirely new device

drivers from scratch.  Other device driver interface standards, such as DDI/DKI and UDI, are not designed
to allow easy adaptation of existing drivers; instead, they are intended to define and restrict the interfaces
and environment used by new drivers specially written for those interfaces, so that these new drivers will
be as widely useful as possible.  For example, UDI requires all conforming drivers to be implemented in a
nonblocking interrupt model; this theoretically allows UDI drivers to run easily in either process-model or
interrupt-model kernels, but at the same time it eliminates all possibility of adapting existing traditional
process-model drivers to be UDI conformant without extensive changes to the drivers themselves. Hopefully,
at some point in the future, one of these more generic device driver standards will become commonplace
enough so that conforming device drivers are available for "everything"; however, until then, the Flux device
driver framework takes a compromise approach, being designed to allow easy adaptation of a wide range of
existing drivers while keeping the primary interface as simple and flexible as possible.



10.1.1        Full versus partial compliance

Because  the  range  of  existing  drivers  to  be  adopted  under  this  framework  is  so  diverse  in  terms  of  the
assumptions and restrictions made by the drivers,  it would be impractical to define the requirements of
the framework as a whole to be the "union" of all the requirements of all possible drivers.  For example,
if we had taken that approach, then the framework would only be usable in kernels in which all physical
memory is directly mapped into the kernel's virtual address space at identical addresses, because some drivers
will not work unless that is the case.  This restriction would make the framework completely unusable in
many  common  OS  environments,  even  though  there  are  plenty  of  drivers  available  that  don't  make  the
virtual = physical assumption and should work fine in OS environments that don't meet that requirement.
    For this reason, we have defined the framework itself to be somewhat more generic than is suitable for
"all" existing drivers, and to account for the remaining "problematic" drivers, we make a distinction between
full  and  partial  compliance.   A  fully  compliant  driver  is  a  driver  that  makes  no  additional  assumptions
or requirements beyond those defined as part of the basic driver framework;  these drivers should run in
any environment that supports the framework.  A partially compliant driver is a driver that is compliant
with the framework, except that it makes one or more additional restrictions or requirements, such as the
virtual = physical requirement mentioned above. For each partially-compliant driver provided with the OS
toolkit, the exact set of additional restrictions made by the driver are clearly documented and provided in
both human- and machine-readable form so that a given OS environment can make use of the framework as
a whole while avoiding drivers that will not work in the environment it provides.
10.2         Organization


In a typical OS environment in which all device drivers run in the kernel, Figure 10.1 illustrates the basic
organization of the device driver framework.
    The heavy black horizontal lines represent the actual interfaces comprising the framework,  which are
described in this chapter.  There are two primary interfaces:  the device  driver  interface (or just "driver
interface"), which the OS kernel uses to invoke the device drivers; and the driver-kernel interface (or just
"kernel interface"), which the device drivers use to invoke kernel support functions.  The kernel implements
the kernel interface and uses the driver interface;  the drivers implement the driver interface and use the
kernel interface.
    Chapter 11, immediately following this chapter, describes a library supplied as part of the OS toolkit
that provides facilities to help the OS implement the kernel interface and use the driver interface effectively.
Default implementations suitable in specific, typical environments are provided for many operations; the OS
can use these default implementations or not, as the situation demands.
    The following chapters describe the specific device driver sets supplied with the OS toolkit for use in
environments supporting the Flux device driver framework. Since the Flux project is not in the driver writing
business, and does not wish to be, these driver sets are derived from existing kernels, either unchanged or
with as little code modified as possible so that the versions of the drivers in the OS toolkit can easily be
kept up-to-date with the original source bases from which they are derived.
10.2.  ORGANIZATION                                                                                                241




                 Figure 10.1: Organization of Flux Device Driver Framework in a typical kernel
242                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.3         Driver  Sets


Up to this point we have used the term "device driver set" fairly loosely; however, in the context of the Flux
device driver framework, this term has a very important, specific meaning.  A driver set is a set of related
device drivers that work together and are fairly tightly integrated together. Different driver sets running in
a given environment are independent of each other and oblivious to each other's presence. Drivers within a
set may share code and data structures internally in arbitrary ways; however, code in different driver sets
may not directly share data structures. (Different driver sets may share code, but only if that code is "pure"
or operates on a disjoint set of data structures.)
    Of course, the surrounding OS can maintain shared data structures in whatever way it chooses; this is the
only way drivers in different sets can interact with each other.  For example, if a kernel is using a FreeBSD
device driver to drive one network card and a Linux driver to drive another, then the kernel can take IP
packets coming in on one card and route them out through the other card, but the network device drivers
themselves are completely oblivious to each other's presence.
    Some driver sets may contain only a single driver; this is ideal for modularity purposes, since in this case
each such driver is independent of all others. Also, given some effort on the part of the OS, some multi-driver
sets can be "split up" into multiple single-driver sets and used independently; Section 10.4.5 describes one
way this can be done.
    In essence, each driver set represents an "encapsulated environment" with a well-defined interface and a
clearly-bounded set of state.  The concept of a driver set has important implications throughout the device
driver framework, especially in terms of execution environment and synchronization; the following sections
describe these aspects of the framework in more detail.
10.4         Execution  Model


The code within a given driver set assumes that it executes in a traditional nonpreemptive, uniprocessor-style
process model, with two "levels":  interrupt level and process level.  The OS itself does not need to use this
execution model, but it must provide it to the device drivers running under this framework. To be specific:


    1. Each driver set is a single-threaded execution domain: only one (virtual or physical) CPU may execute
       code in the driver set at a given time.


    2. Code in a driver set always runs at one of two levels:  process level or interrupt level.  Whenever the
       host OS invokes a device driver through the driver interface, the driver set is running at process level.
       When the host OS delivers an interrupt to a device driver by calling a previously registered interrupt
       handler, the driver set is running at interrupt level.


    3. Process-level execution in a driver set can be interrupted at any time by interrupt handlers in the
       driver set, except when the process-level driver code has disabled interrupts using fdev_intr_disable
       (see Section 10.13.1). Interrupt handlers cannot be interrupted by the same or other interrupt handlers
       in the driver set.  (XXX is the host OS required to allow interrupt handlers to interrupt process-level
       activities? Yes for drivers that busy-wait for interrupts to occur.)


    4. Multiple process-level device driver invocations, or "activities," may be outstanding at a given time,
       as long as only one is actually executing at a time (as required by rule 1). A subset of the functions in
       the driver-kernel interface are defined as blocking functions; whenever one of these functions is called,
       the host OS may start a new activity in the driver set, or switch back to other previously blocked
       activities.


    5. The host OS supplies each outstanding activity with a separate stack, which is retained across blocking
       function calls. Stacks are only relinquished by a driver set when the operation completes and the driver
       returns from the original call that was used to invoke it.


    6. While an interrupt handler is running in a driver set, no process-level execution can take place in the
       same driver set.
10.4.  EXECUTION MODEL                                                                                          243


    7. Interrupt handlers always run to completion: they may not call blocking functions or enable interrupts.



    Although on the surface it may appear that these requirements place severe restrictions on the host OS,
the required execution model can in fact be provided quite easily even in most kernels supporting other
execution models.  The following sections describe some example techniques for providing this execution
model.
10.4.1        Use in multiprocessor kernels


Global spin lock:   The easiest way to provide the required execution model for the device driver framework
in a nonpreemptive, process-model, multiprocessor kernel such as Mach 3.0 is to place a single global spin
lock around all code running in the device driver framework. A process must acquire this lock before entering
driver code, and release it after the operation completes.  (This includes both process-level entry through
the driver interface, and interrupt-level entry into the drivers' interrupt handlers.)  In addition, all blocking
functions in the driver-kernel interface, which the host OS supplies, should release the global lock before
blocking and acquire the lock again after being woken up.  This way, other processors, and other processes
on the same processor, can run code in the same or other drivers while the first operation is blocked.

    Note that this global lock must be handled carefully in order to avoid deadlock situations.  A simple,
"naive" non-reentrant spin lock will not work, because if an interrupt occurs on a processor that is already
executing process-level driver code, and that interrupt tries to lock the global lock again, it will deadlock
because the lock is already held by the process-level code. The typical solution to this problem is to implement
the lock as a "reentrant" lock, so that the same processor can lock it twice (once at process level and once
at interrupt level) without deadlocking.

    Another  strategy  for  handling  the  deadlock  problem  is  for  the  host  OS  simply  to  disable  interrupts
before acquiring the global spin lock and enable interrupts after releasing it, so that interrupt handlers are
only called while the process-level device driver code is blocked.  (In this case, the fdev_intr_enable and
fdev_intr_disable calls, provided by the OS to the drivers, would do nothing because interrupts are always
disabled during process-level execution.) This strategy is not recommended, however, because it will increase
interrupt latency and break many existing partially-compliant drivers which busy-wait at process level for
conditions set by interrupt handlers.
Spin lock per driver set:   As a refinement to the approach described above, to achieve better parallelism,
the host OS kernel may want to maintain a separate spin lock for each driver set.  This way, for example, a
network driver can be run on one processor while a disk driver is being run on another.  This parallelism is
allowed by the framework because driver sets are fully independent and do not share data with each other
directly.
10.4.2        Use in preemptive kernels


The issues and solutions for implementing the required execution model in preemptive kernels are similar to
those for multiprocessor kernels:  basically, locks are used to protect device driver code.  Again, the locking
granularity can be global or per-driver set (or anything in between, as the OS desires). However, in this case,
a blocking lock must be used rather than a simple spin lock because the lock must continue to be held if a
process running device driver code is preempted.  (Note the distinction between OS-level "blocking," which
can occur at any time during execution of driver code but is made invisible to the driver code through the
use of locks; and driver-level "blocking," which only occurs when a driver calls a function defined by this
framework to be a blocking function.)

    An alternative solution to the preemption problem is simply to disable preemption while running code
in  the  driver  framework.   This  solution  is  likely  to  be  simpler  in  terms  of  implementation  and  to  have
less overhead, but it may greatly increase thread dispatch latency, possibly defeating the purpose of kernel
preemption in the first place.
244                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.4.3        Use in multiple-interrupt-level kernels


Many existing kernels, particularly those derived from Unix or BSD, implement a range of "interrupt priority
levels," typically assigning different levels to different classes of devices such as block, character, or network
devices. In addition, some processor architectures, such as the 680x0, directly support and require the use of
some kind of IPL-based scheme.  Although the Flux device driver framework does not directly support any
notion of interrupt priority levels, it can be used fairly easily in IPL-based kernels by assigning a particular
IPL to each driver set used by the kernel.  In this case, the fdev_intr_disable routine provided by the
kernel does not disable all interrupts, but instead only disables interrupts at the driver set's priority level
and at all lower priority levels.  This way, although the code in each driver set is only aware of interrupts
being "enabled" or "disabled," the host OS can in effect enforce a general IPL-based scheme.
    An obvious limitation, of course, is that all of the device drivers in a particular driver set must generally
have the same IPL. However, this is usually not a problem, since the drivers in a set are usually closely
related anyway.



10.4.4        Use in interrupt-model kernels

Many  small  kernels  use  a  pure  interrupt  model  internally  rather  than  a  traditional  process  model;  this
basically means that there is only one kernel stack per processor rather than one kernel stack per process,
and therefore kernel code can't block without losing all of the state on its stack.  This is probably the most
difficult environment in which to use the framework, since the framework fundamentally assumes one stack
per  outstanding  device  driver  invocation.   Nevertheless,  there  are  a  number  of  reasonable  ways  to  work
around this mismatch of execution model, some of which are described briefly below as examples:


    o  Switch  stacks  while  running  driver  code.  Before the kernel invokes a device driver operation
       (e.g., makes a read or write request), it allocates a special "alternate" kernel stack, possibly from
       a "pool" of stacks reserved for this purpose.  This alternate stack is associated with the outstanding
       operation until the operation completes;  the kernel switches to the alternate stack before executing
       process-level device driver code, and switches back to the per-processor kernel stack when the driver
       blocks or returns.  Depending on the details of the kernel's execution model, the kernel may also have
       to switch back to the per-processor stack when the process-level device driver code is interrupted, due
       to an event such as a hardware interrupt or a page fault occurring while copying data into or out
       of a user-mode process's address space.  However,  note that stack switching is only required when
       running process-level device driver code; interrupt handlers in the device driver framework are already
       "interrupt model" code and need no special adaptation.

    o  Run  process-level  device  driver  code  on  a  separate  kernel  thread.  If the kernel supports
       kernel threads in some form (threads that run using a traditional process model but happen to execute
       in the kernel's address space), then process-level device driver code can be run on a kernel thread.
       Basically, the kernel creates or otherwise "fires off" a new kernel thread for each new device driver
       operation invoked, and the thread terminates when the operation is complete. (If thread creation and
       termination are expensive, then a "pool" of available device driver threads can be cached.) The kernel
       must ensure that the driver threads active in a particular driver set at a given time cannot preempt
       each other arbitrarily except in the blocking functions defined by this framework; one way to do this
       is with locks (see Section 10.4.2).  Conceptually, this solution is more or less isomorphic to the stack
       switching solution described above, since a context switch basically amounts to a stack switch; only
       the low-level details are really different.

    o  Run the device drivers in user mode. If a process-model environment cannot easily be provided or
       simulated within the kernel, then the best solution may be to run device drivers using this framework
       in user mode, as ordinary threads running on top of the kernel. Of course, this solution brings with it
       various potential complications and efficiency problems; however, in practice they may be fairly easily
       surmountable, especially in kernels that already support other kinds of user-level device drivers. Also,
       even if the drivers in this framework can only be run in user mode, there is nothing to prevent "native"
       drivers designed specifically for the kernel in question from running in supervisor mode with minimal
       performance overhead; this way, extremely popular or performance-critical hardware can be supported
10.4.  EXECUTION MODEL                                                                                          245






                        Figure 10.2: Using the framework to create user-mode device drivers



       directly in the kernel, while the drivers from this framework can be run in user mode to increase the
       breadth of supported hardware.

    o  Run  the  device  drivers  at  an  intermediate  privilege  level.   Some  processor  architectures,
       such as the x86 and PA-RISC, support multiple privilege levels besides just "supervisor mode" and
       "user mode."  Kernels for such architectures may want to run device drivers under this framework
       at an intermediate privilege level,  if this approach results in a net win in terms of performance or
       implementation complexity. Alternatively, on most architectures, the kernel may be able to run device
       drivers in user mode but with an address map identical to the kernel's, allowing them direct access to
       physical memory and other important kernel resources.



10.4.5        Use in out-of-kernel, user-mode device drivers

In  some  situations,  for  reasons  of  elegance,  modularity,  configuration  flexibility,  robustness,  or  even  (in
some cases) performance, it is desirable to run device drivers in user mode, as "semi-ordinary" application
programs.  This is done as a matter of course by some microkernels such as L3[?]  and VSTa[?].  There is
nothing in the Flux device driver framework that prevents its device drivers from executing in user mode,
and in fact the framework was deliberately designed with support for user-mode device drivers in mind.
    Figure 10.2 illustrates an example system in which device drivers are located in user-mode processes. In
this case, all of the code within a given driver set is part of the user-level device driver process, and the
"surrounding" OS-specific code, which makes calls to the drivers through the driver interface, and provides
the functions in the "kernel interface," is not actually kernel code at all but, rather, "glue" code that handles
communication with the kernel and other processes. For example, many of the functions in the driver-kernel
interface, such as the calls to allocate interrupt request lines, will be implemented by this glue code as system
calls to the "actual" kernel, or as remote procedure calls to servers in other processes.
    Device driver code running in user space will typically run in the context of ordinary threads; the execution
environment required by the driver framework can be built on top of these threads in different ways.  For
246                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


example, the OS-specific glue code may run on only a single thread and use a simple coroutine mechanism

to provide a separate stack for each outstanding process-level device driver operation; alternately, multiple
threads may be used, in which case the glue code will have to use locking to provide the nonpreemptive
environment required by the framework.
    Dispatching interrupt handlers in these user-mode drivers can be handled in various ways, depending
on  the  environment  and  kernel  functionality  provided.   For  example,  interrupt  handlers  may  be  run  as
"signal handers" of some kind "on top of" the thread(s) that normally execute process-level driver code;
alternatively, a separate thread may be used to run interrupt handlers.  In the latter case, the OS-specific
glue code must use appropriate locking to ensure that process-level driver code does not continue to execute
while interrupt handlers are running.


Shared interrupt request lines

One particularly difficult problem for user-level drivers in general, and especially for user-level drivers built
using this framework, is supporting shared interrupt lines. Many platforms, including PCI-based PCs, allow
multiple unrelated devices to send interrupts to the processor using a single request line; the processor must
then sort out which device(s) actually caused the interrupt by checking each of the possible devices in turn.
With user-level drivers, the code necessary to perform this checking is typically part of the user-mode device
driver, since it must access device-specific registers.  Thus, in a "naive" implementation, when the kernel
receives a device interrupt, it must notify all of the drivers hooked to that interrupt, possibly causing many
unnecessary context switches for every interrupt.
    The typical solution to this problem is to allow device drivers to "download" small pieces of "disambigua-
tion" code into the kernel itself; the kernel then chains together all of the code fragments for a particular
interrupt line, and when an interrupt occurs, the resulting code sequence determines exactly which device(s)
caused the interrupt, and hence, which drivers need to be notified.  This solution works fine for "native"
drivers designed specifically for the kernel in question; however, there is no obvious, straightforward way to
support such a feature in the driver framework.
    For this reason, until a better solution can be found, the following policy applies to using shared interrupts
in this framework: for a given shared interrupt line, either the kernel must unconditionally notify all registered
drivers running under this framework, and take the resulting performance hit; or else the drivers running
under this framework will not support shared interrupts at all.  (Native drivers written specifically for the
kernel in question can still use the appropriate facilities to support shared interrupt lines efficiently.)



10.5         Performance


Since this framework emphasizes breadth, adaptability, and ease-of-use over raw performance, the perfor-
mance of device drivers running under this framework is likely to suffer somewhat; how much depends on
how well-matched the particular driver is to the driver framework and to the host OS. Various factors can
influence driver performance:  for example, if the OS's network code does not match the network drivers in
terms of whether scatter/gather message buffers are supported or required, performance is likely to suffer
somewhat due to extra copying between the driver and the OS's network code. The OS developer will have
to take these issues into account when selecting which sets of device drivers to use (e.g., FreeBSD versus
Linux network drivers).  If the device driver sets are chosen carefully and the OS's driver support code is
designed well, in many cases it should be possible to use these drivers with minimal performance loss.
    Another consideration is how extensively the OS should rely on this device driver framework.  There is
nothing preventing the OS from maintaining its own (probably smaller) collection of "native" drivers designed
and tuned for the particular OS; this way, the OS can achieve maximum performance for particularly common
or performance-critical hardware devices, and use the larger set of device drivers easily available through
this framework to provide support for other types of hardware that otherwise wouldn't be supported at all.
This approach of combining native and emulated drivers is likely to be especially important for kernels that
are not well matched to the existing drivers this framework was designed around: e.g., "stackless" interrupt
model kernels which must run emulated device drivers on special threads or in user space.
    Once this framework is more mature, we intend to include performance statistics reflecting typical relative
performance for various types of drivers and kernels using this framework; until then, see the results in the
10.6.  DEVICE DRIVER INITIALIZATION                                                                        247


USENIX paper, "Linux Device Driver Emulation in Mach," for a general indication of probable performance

costs.



10.6         Device  Driver  Initialization


When the host OS is ready to start using device drivers in this framework, it typically calls a probe function
for each driver set it uses; this function initializes the drivers and checks for hardware devices supported by
any of the drivers in the set. If any such devices are found, they are registered with the host OS by calling a
registration routine specific to the type of bus on which the device resides (e.g., ISA, PCI, SCSI). The host
OS can then record this information internally so that it knows which devices are available for later use.
The OS can implement device registration any way it chooses; however, the driver support library (libfdev)
provided by the OS toolkit provides a default implementation of a registration mechanism which builds a
single "hardware tree" representing all known devices; see Section 11.2 for more information.
    When  a  device  driver  discovers  a  device,  it  creates  a  device  node  structure  representing  the  device.
The device node structure can be of arbitrary size, and most of its contents are private to the device driver.
However, the first part of the device node is always a structure of type fdev_t, defined in flux/fdev/fdev.h,
which contains generic information about the device and driver needed by the OS to make use of the device.
In addition, depending on the device's type, there may be additional information available to the host OS,
as described in the following section.



10.7         Device  Classification


Device nodes have types that follow a C++-like single-inheritance subtyping relationship, where fdev_t is
the ultimate ancestor or "supertype" of all device types.
    In general, the host OS must know what class of device it is talking to in order to make use of it properly.
On the other hand, it is not strictly necessary for the host OS to recognize the specific device type, although
it may be able to make better use of the device if it does.
    The block device class has the following attributes:

    o  All input and output is synchronously driven by the host OS, through calls to fdev_drv_read and
       fdev_drv_write;  the driver never calls the asynchronous I/O functions defined in Section ?? .  I/O
       operations  always  complete  "promptly":  barring  device  driver  or  hardware  bugs,  reads  and  writes
       are never delayed indefinitely due to external conditions.  (This contrasts with network devices, for
       example, where input is received when another machine sends a message, not when the host OS asks
       for input.)

    o  There may be a minimum read/write granularity, or block size, which the driver specifies in the blksize
       field of its fdev_blk_t structure.  The block size is always a power of two (e.g., typically 512 for most
       disks),  and is always less than the processor's minimum page size (PAGE_SIZE, Section 8.2.2).  The
       offset and count parameters of all read/write calls made by the host OS to this device driver must be
       an even multiple of this block size.  For block devices with no minimum read/write granularity, the
       driver specifies a block size of 1 (i.e., one-byte granularity).

    o  Block devices may have removable media, such as floppy drives, CD-ROM drives, or removable hard
       drives.  The device driver provides an indication to the OS of whether or not the device supports
       removable media.

    The character device class has the following characteristics:

    o  Output is synchronous, directed by the host OS, but input is asynchronous, directed by the external
       device.

    o  Incoming  and  outgoing  data  consists  of  a  stream  of  bytes;  there  is  no  larger  minimum  read/write
       granularity.  Multiple  bytes  of  data  can  be  sent  and  received  in  one  operation,  but  this  is  just  an
       optimization; there is no semantic difference from handling each byte individually.
248                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


    The network device class has the following characteristics:


    o  Output is synchronous, directed by the host OS, but input is asynchronous, directed by the external
       device.

    o  Data is handled in units of packets; one send or receive operation is performed for each packet.

    o  Packets sent and received typically have specific size and format restrictions, depending on the specific
       network type (e.g., ethernet, myrinet).

    Note that it would certainly be possible to decompose these device classes into a deeper type hierarchy.
For example, in abstract terms it might make sense to arrange character and network devices under a single
supertype representing "asynchronous" devices.  However,  since the structure representing this "abstract
supertype" would contain essentially nothing in terms of actual code or data, this additional level was not
deemed useful for the driver framework. Of course, the OS is free to use any type hierarchy (or non-hierarchy)
it desires for its own data structures representing devices, drivers, etc.



10.8         Buffer  Management


XXX overview



10.9         Asynchronous  I/O


XXX overview



10.10          Other  Considerations


XXX some rare, poorly-designed hardware does not work right if long delays occur while programming the
devices.  (This is supposedly the case for some IDE drives, for example.)  For this reason, reliability and
hardware compatibility may be increased by implementing fdev_intr_disable as a function that really does
disable all interrupts on the processor in question.
    XXX Symbol name conflicts among libraries...  For each existing driver set, provide a list of "reserved"
symbols used by the set.
10.11.  COMMON DEVICE DRIVER INTERFACE                                                              249


10.11          Common  Device  Driver  Interface


This section describes the Flux device driver interfaces that are common to all types of drivers and hardware.
250                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.11.1        fdev.h:  common device driver framework definitions


Synopsis

       #include  <flux/fdev/fdev.h>


XXX
10.11.  COMMON DEVICE DRIVER INTERFACE                                                              251


10.11.2        fdev__ioctl:  control a device using a driver-specific protocol


Synopsis

       #include  <flux/fdev/fdev.h>

       int fdev__ioctl(fdev_t *dev, int cmd, void *buf, fdev_buf_vec_t *bufvec);


Direction

       OS ! Driver


Description

       This function is the only OS-to-driver call that is common to all types of devices and drivers; it
       provides a common mechanism by which the OS can invoke device- and driver-specific control
       operations. As the name implies, this entrypoint corresponds to the ioctl entrypoint in typical
       Unix device driver interfaces:  it provides a fully generic "escape hatch" through which drivers
       can easily expose arbitrary device features without requiring specific support to be present in the
       rest of the OS. XXX explain implications in more detail.

       The caller must supply a structure of type fdev_t that identifies the device on which the operation
       is  to  be  performed.   The  argument  cmd  specifies  the  control  operation;  its  interpretation  is
       completely specific to the device driver being called.  The buf and bufvec arguments represent
       a buffer used to hold any associated data passed to and/or from the driver.


Parameters

       dev :   The device on which the control operation is to be performed.

       cmd :    The command to be performed. This is driver specific.

       buf :   The opaque buffer containing additional data to be passed to the driver for the processing
             of the command, and/or to hold data passed back from the driver to the OS.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
252                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.12          Driver  Memory  Allocation


The OS must provide routines for drivers to call to allocate memory for the private use of the drivers, as
well as for I/O buffers and other purposes. The Flux device driver framework defines a single set of memory
allocation functions which all drivers running under the framework call to allocate and free memory.
    Device drivers often need to allocate memory in different ways, or memory of different types, for different
purposes.   For  this  reason,  the  device  driver  framework  defines  a  set  of  flags  provided  to  each  memory
allocation function describing how the allocation is to be done, or what type of memory is required.
    As with other aspects of the Flux device driver framework,  the libfdev library provides default im-
plementations of the memory allocation functions, but these implementations may be replaced by the OS
as desired.  The default implementations make a number of assumptions which are often invalid in "real"
OS kernels;  therefore,  these functions will often be overridden by the client OS. Specifically,  the default
implementation assumes:

    o  The LMM pool malloc_lmm is used to manage kernel memory.

    o  Memory allocation and deallocation never block.

    o  All memory allocation functions can be called at interrupt time.

    o  All allocated blocks are physically as well as virtually contiguous.
10.12.  DRIVER MEMORY ALLOCATION                                                                        253


10.12.1        fdev__memflags__t:  memory allocation flags


Synopsis

       XXX typedef unsigned fdev_memflags_t;


Direction

       Driver ! OS


Description

       All  of  the  memory  allocation  functions  used  by  device  drivers  in  the  Flux  device  framework
       take a parameter of type fdev_memflags_t, which is a bit field describing various option flags
       that affect how memory allocation is done.  Device drivers often need to allocate memory that
       satisfies certain constraints, such as being physically contiguous, or page aligned, or accessible
       to DMA controllers.  These flags abstract out these various requirements, so that all memory
       allocation requests made by device drivers are sent to a single set of routines; this design allows
       the OS maximum flexibility in mapping device memory allocation requests onto its internal kernel
       memory allocation mechanisms.

       Routing all memory allocations through a single interface this way may have some impact on
       performance, due to the cost of decoding the flags argument on every allocation or deallocation
       call. However, this cost is expected to be small compared to the typical cost of actually performing
       the requested operation.

       One rule that must be observed by device drivers at all times is that all calls made for a given mem-
       ory block must use the same flags parameter.  For example, if a memory block is allocated with
       FDEV_MEM_PHYS_CONTIG  _  FDEV_MEM_NONBLOCKING, then exactly the same flags must be specified
       in the call to free that block. This constraint allows fdev_mem_alloc and fdev_mem_free to map
       onto widely different sets of underlying OS mechanisms depending on the flags argument, without
       the glue code needing to keep track of "hidden state" attached to each allocated block.

       The specific flags currently defined are as follows:

       FDEV_AUTO_SIZE:        The memory allocator must keep track of the size of allocated blocks allo-
             cated using this flag;  in this case,  the value size parameter passed in the corresponding
             fdev_mem_free call is meaningless.  For blocks allocated without this flag set,  the caller
             (device driver) promises to keep track of the size of the allocated block, and pass it back to
             fdev_mem_free on deallocation.

             It is possible for the OS to implement these memory allocation routines so that they ignore
             the FDEV_AUTO_SIZE flag and simply always keep track of block sizes themselves.  However,
             note that in some situations, doing so may produce extremely inefficient memory usage. For
             example, if the OS memory allocation mechanism prefixes each block with a word containing
             the block's length, then any request by a device driver to allocate a page-aligned page (or
             some  other  naturally-aligned,  power-of-two-sized  block)  will  consume  that  page  plus  the
             last word of the previous page.  If many successive allocations are done in this way, only
             every other page will be usable, and half of the available memory will be wasted. Therefore,
             it  is  generally  a  good  idea  for  the  memory  allocation  functions  to  pay  attention  to  the
             FDEV_AUTO_SIZE flag, at least for allocations with alignment restrictions.

       FDEV_NONBLOCKING:         If set, this flag indicates that the memory allocator must not block during
             the allocation or deallocation operation. More specifically, the flag indicates that the device
             driver code must not be run in the context of other, concurrent processes while the allocation
             is taking place.  Any calls to the allocation functions from interrupt handlers must specify
             the FDEV_NONBLOCKING flag.

       FDEV_PHYS_WIRED:        Indicates  that  the  must  must  be  non-pageable.   Accesses  to  the  returned
             memory must not fault.
254                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


       FDEV_PHYS_CONTIG:         Indicates the underlying physical memory must be contiguous.

       FDEV_PHYS_EQ_VIRT:         Indicates the virtual address must exactly equal the physical address so the
             driver may use them interchangeably. The FDEV_PHYS_CONTIG flag must also be set whenever
             this flag is set.

       FDEV_ISA_DMA_MEMORY:          This flag applies only to machines with ISA busses or other busses that
             are software compatible with ISA, such as EISA, MCA, or PCI. It indicates that the memory
             allocated must be appropriate for DMA access using the system's built-in DMA controller.
             In  particular,  it  means  that  the  buffer  must  be  physically  contiguous,  must  be  entirely
             contained in the low 16MB of physical memory, and must not cross a 64KB boundary. (By
             implication, this means that allocations using this flag are limited to at most 64KB in size.)
             The FDEV_PHYS_CONTIG flag must also be set if this flag is set.

       FDEV_X86_1MB_MEMORY:          This flag only applies to x86 machines, in which some device drivers may
             need to call 16-bit real-mode BIOS routines.  Such drivers may need to allocate physical
             memory  in  the  low  1MB  region  accessible  to  real-mode  code;  this  flag  allows  drivers  to
             request such memory.
10.12.  DRIVER MEMORY ALLOCATION                                                                        255


10.12.2        fdev__mem__alloc:  allocate memory for use by device drivers


Synopsis

       void  *fdev__mem__alloc(vm_size_t size, fdev_memflags_t flags, unsigned align_bits);


Direction

       Driver ! OS


Description

       This function is called by the drivers to allocate memory.  Allocate the requested amount of
       memory with the restrictions specified by the flags argument as described above.


Parameters

       size:   Amount of memory to allocate.

       flags:   Restrictions on memory.

       align_bits:    Boundary on which memory should be aligned.


Returns

       Returns the address of the allocated block in the kernel's virtual address space, or NULL if not
       enough memory was available.
256                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.12.3        fdev__mem__free:  free memory allocated with fdev__mem__alloc


Synopsis

       void fdev__mem__free(void *block, fdev_memflags_t flags, vm_size_t size);


Direction

       Driver ! OS


Description

       Frees a memory block previously allocated by fdev_mem_alloc.


Parameters

       block :   A pointer to the memory block, as returned from fdev_mem_alloc.

       flags:   Must be exactly the set of flags used during the call to fdev_mem_alloc that allocated
             this memory block.

       size:   If flags includes FDEV_MEM_SIZE, then this parameter must be the size requested when this
             block was allocated. Otherwise, the value of the size parameter is meaningless.
10.12.  DRIVER MEMORY ALLOCATION                                                                        257


10.12.4        fdev__mem__get__phys:  find the physical address of an allocated block


Synopsis

       vm_offset_t fdev__mem__get__phys(void *block, fdev_memflags_t flags, vm_size_t size);


Direction

       Driver ! OS


Description

       Returns the physical address of an allocated memory block. Can only be called on blocks allocated
       with the FDEV_PHYS_CONTIG flag set, i.e., blocks that are guaranteed to be physically contiguous.


Parameters

       block :   A pointer to the memory block, as returned from fdev_mem_alloc.

       flags:   Must be the set of flags used during the call to fdev_mem_alloc that allocated this memory
             block.

       size:   If flags includes FDEV_MEM_SIZE, then this parameter must be the size requested when this
             block was allocated. Otherwise, the value of the size parameter is meaningless.
258                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.12.5        fdev__mem__get__phys__list:   find  the  physical  address  list  of  an  allocated

               block

Synopsis

       int fdev__mem__get__phys__list(void *block, fdev_memflags_t flags, vm_size_t size, vm_offset_t
       *phys_buf, unsigned phys_buf_len);


Direction

       Driver ! OS


Description

       Can only be called on blocks allocated with the FDEV_PHYS_WIRED flag set, i.e., blocks that are
       guaranteed to be wired down to physical memory so that the physical address corresponding
       to a particular virtual address in the memory block never changes for as long as the block is
       allocated.  Otherwise, the information returned by this routine would be meaningless because it
       might change at any time without the driver's knowledge.

       XXX This definition probably won't work.


Parameters

       block :   A pointer to the memory block, as returned from fdev_mem_alloc.

       flags:   Must be the set of flags used during the call to fdev_mem_alloc that allocated this memory
             block.

       size:   If flags includes FDEV_MEM_SIZE, then this parameter must be the size requested when this
             block was allocated. Otherwise, the value of the size parameter is meaningless.
10.13.  HARDWARE INTERRUPTS                                                                                 259


10.13          Hardware  Interrupts


We do not deal with shared interrupts yet...
    In a given driver environment in this framework,  there are only two "interrupt levels":  enabled and
disabled. In the default case in which all device drivers of all types are linked together into one large driver
environment in an OS kernel, this means that whenever one driver masks interrupts, it masks all device
interrupts in the system.1
    However, an OS can implement multiple interrupt priority levels, as in BSD or Windows NT, if it so
desires, by creating separate "environments" for different device drivers. For example, if each driver is built
into a separate, dynamically-loadable module, then the fdev_intr_ calls in different driver modules could be
resolved by the dynamic loader to spl-like routines that switch between different interrupt priority levels.
For example, the fdev_intr_disable call in network drivers may resolve to splnet, whereas the same call
in a disk driver may be mapped to splbio instead.
____________________________________________________1
    Rationale:   The Linux device drivers work this way, and we can't provide more than what we have to work with.  This
also makes the OS interface simpler, and may allow the basic operations to be faster due to this simplicity.
260                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.13.1        fdev__intr__disable:  prevent interrupts in the driver environment


Synopsis

       void fdev__intr__disable(void);


Direction

       Driver ! OS


Description

       Disable  further  entry  into  the  calling  driver  set  through  an  interrupt  handler.   This  can  be
       implemented either by directly disabling interrupts at the interrupt controller or CPU, or using
       some software scheme.

       XXX Merely needs to prevent intrs from being dispatched to the driver set.  Drivers may see
       spurious interrupts if they briefly cause interrupts while disabled.
10.13.  HARDWARE INTERRUPTS                                                                                 261


10.13.2        fdev__intr__enable:  allow interrupts in the driver environment


Synopsis

       void fdev__intr__enable(void);


Direction

       Driver ! OS


Description

       Enable interrupt delivery to the calling driver set.  This can be implemented either by directly
       enabling interrupts at the interrupt controller or CPU, or using some software scheme.
262                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.13.3        fdev__intr__alloc:  allocate an interrupt request line


Synopsis

       int fdev__intr__alloc(int irqnum, int (*handler)(void *), void *data, int flags);


Direction

       Driver ! OS


Description

       Allocate an interrupt request line and attach the specified handler to it. On interrupt, the kernel
       must pass the data argument to the handler.

       Flags:

       FDEV_INTR_SHAREABLE:          If this flag is specified, the interrupt request line can be shared between
             multiple devices. On interrupt, the OS will call each handler attached to the interrupt line.
             Without this flag set, the OS is free to return an error if another handler is attached to the
             interrupt request line.


Parameters

       irqnum:     The interrupt request line to allocate.

       handler :    Interrupt handler.

       data:    Data passed by the kernel to the interrupt handler.

       flags:


Returns

       Returns 0 on success, non-zero on error.
10.14.  SLEEP/WAKEUP                                                                                              263


10.14          Sleep/Wakeup
264                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.14.1        fdev__sleep__init:  prepare to put the current process to sleep


Synopsis

       #include  <flux/fdev/fdev.h>

       void fdev__sleep__init(fdev_sleeprec_t *sleeprec);


Direction

       Driver ! OS


Description

       This function initializes a "sleep record" structure in preparation for the current process's going
       to sleep waiting for some event to occur. The sleep record is used to avoid races between actually
       going to sleep and the event of interest, and to provide a "handle" on the current activity by
       which fdev_wakeup can indicate which process to awaken.


Parameters

       sleeprec:    A pointer to the process-private sleep record.
10.14.  SLEEP/WAKEUP                                                                                              265


10.14.2        fdev__sleep:  put the current process to sleep


Synopsis

       #include  <flux/fdev/fdev.h>

       void fdev__sleep(fdev_sleeprec_t *sleeprec);


Direction

       Driver ! OS


Description

       The driver calls this function at process level to put the current activity (process) to sleep until
       some event occurs,  typically triggered by a hardware interrupt or timer handler.  The driver
       must supply a pointer to a process-private "sleep record" variable (sleeprec), which is typically
       just allocated on the stack by the driver.  The sleeprec must already have been initialized using
       fdev_sleep_init.   If  the  event  of  interest  occurs  after  the  fdev_sleep_init  but  before  the
       fdev_sleep, then fdev_sleep will return immediately without blocking.


Parameters

       sleeprec:    A  pointer  to  the  process-private  sleep  record,  already  allocated  by  the  driver  and
             initialized using fdev_sleep_init.
266                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.14.3        fdev__wakeup:  wake up a sleeping process


Synopsis

       #include  <flux/fdev/fdev.h>

       void fdev__wakeup(fdev_sleeprec_t *sleeprec);


Direction

       Driver ! OS


Description

       The driver calls this function at interrupt level to wake up a process-level activity that has gone
       to sleep (or is preparing to go to sleep) waiting on some event. It is harmless to wake up a process
       that has already been woken.


Parameters

       sleeprec:    A pointer to the sleep record of the process to wake up. Must actually point to a valid
             sleep record variable that has been properly initialized using fdev_sleep_init.
10.15.  DRIVER-KERNEL INTERFACE: TIMING                                                               267


10.15          Driver-Kernel  Interface:  Timing
268                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.15.1        fdev__timer__register:  start a timer


XXX
10.15.  DRIVER-KERNEL INTERFACE: TIMING                                                               269


10.15.2        fdev__nanosleep:  wait for some amount of time to elapse


Synopsis

       void fdev__nanosleep(unsigned sec, unsigned nsec);


Direction

       Driver ! OS


Description

       Delay for at least the specified amount of time.


Parameters

       sec:   Length of delay in seconds.

       nsec:    Length of delay in nanoseconds.
270                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.15.3        fdev__nanosleep__nonblock:  wait a short time without blocking


Synopsis

       void fdev__nanosleep__nonblock(unsigned sec, unsigned nsec);


Direction

       Driver ! OS


Description

       Delay for at least the specified amount of time. The OS must ensure that the driver is not entered
       during the delay.
10.16.  BUFFER MANAGEMENT                                                                                   271


10.16          Buffer  Management
272                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.16.1        fdev__buf__copyin:  copy data from an opaque buffer to a driver's buffer


Synopsis

       int fdev__buf__copyin(void *src,  fdev_buf_vec_t *bufvec,  vm_offset_t offset,  void *dest,
       unsigned count);


Direction

       Driver ! OS


Description

       Copy  data  from  the  opaque  source  buffer  to  a  driver-private  memory  buffer  at  the  specified
       destination address.  The source buffer was passed by the OS to the driver.  The destination
       memory was allocated by the driver.

       This function must be prepared to handle addresses and sizes of any alignment.


Parameters

       src:   Opaque source buffer.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Offset within source buffer.

       dest :   Destination address.

       count :   Amount of data to copy.


Returns

       Returns 0 on success, non-zero on error.
10.16.  BUFFER MANAGEMENT                                                                                   273


10.16.2        fdev__buf__copyout:  copy data from a driver into an opaque buffer


Synopsis

       int fdev__buf__copyout(void *src, void *dest, fdev_buf_vec_t *bufvec, vm_offset_t offset,
       unsigned count);


Direction

       Driver ! OS


Description

       Copy data from the source address to the destination buffer.  The source address was allocated
       by the driver. The destination buffer was passed by the kernel to the driver.

       This function must be prepared to handle addresses and sizes of any alignment.


Parameters

       src:   Source address.

       dest :   Opaque destination buffer.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Offset within destination buffer.

       count :   Amount of data to copy.


Returns

       Returns 0 on success, non-zero on error.
274                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.16.3        fdev__buf__wire:  wire down part of a buffer to physical memory


Synopsis

       int fdev__buf__wire(void *buf, fdev_buf_vec_t *bufvec, vm_offset_t offset, unsigned size,
       vm_offset_t *page_list, [out] unsigned *amount_wired);


Direction

       Driver ! OS


Description

       Wire down a portion of the buffer contents to physical memory.  The kernel must at least wire
       one page starting at the page containing the offset.  The kernel should try to wire down the size
       requested,  however,  it can wire down less than the requested amount.  The kernel should set
       the argument amount_wired to the amount of memory actually wired.  It should return the list
       of physical pages in the argument page_list.  The size and offset arguments have no alignment
       restrictions.


Parameters

       buf :   The buffer to be wired.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Offset within the buffer.

       size:   Amount of data to wire.

       page_list :   List of physical pages returned by the kernel.

       amount_wired :      The amount of data actually wired by the kernel.


Returns

       Returns 0 on success, non-zero on error.
10.16.  BUFFER MANAGEMENT                                                                                   275


10.16.4        fdev__buf__unwire:  unwire previously wired data


Synopsis

       void  fdev__buf__unwire(void  *buf,  fdev_buf_vec_t  *bufvec,  vm_offset_t  offset,  unsigned
       size);


Direction

       Driver ! OS


Description

       Unwire the portion of the buffer previously wired down by the called to fdev_buf_wire.  The
       offset and size arguments must be the same as that given to fdev_buf_wire.


Parameters

       buf :   The opaque buffer to be unwired.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Offset within the buffer.

       size:   Amount of data to unwire.
276                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.16.5        fdev__buf__map:  map a buffer into the driver's virtual address space


Synopsis

       int fdev__buf__map(void *buf, fdev_buf_vec_t *bufvec, void **kaddr, unsigned size);


Direction

       Driver ! OS


Description

       This routine makes the contents of the specified opaque buffer directly accessible to the device
       driver, mapping it into the kernel address space if necessary. XXX more details... XXX only one
       mapping at once? XXX allow mapping of only part of the buffer?


Parameters

       buf :   Buffer whose data is to be mapped.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       kaddr :    Kernel virtual address returned by the kernel.

       size:   Amount of data to map.


Returns

       Returns 0 on success, non-zero no error.
10.16.  BUFFER MANAGEMENT                                                                                   277


10.16.6        fdev__buf__unmap:  unmap data mapped with fdev__buf__map


Synopsis

       void fdev__buf__unmap(void *kaddr, unsigned size);


Direction

       Driver ! OS


Description

       Unmap a buffer previously mapped by fdev_buf_map. XXX need to pass the buffer too


Parameters

       kaddr :    Kernel virtual address to unmap. This must be the same as returned by fdev_buf_map

       :     size Amount of data to unmap. This must be the same as returned by fdev_buf_map
278                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.17          Mapping  Physical  Memory
10.17.  MAPPING PHYSICAL MEMORY                                                                          279


10.17.1        fdev__map__phys__mem:  map physical memory into kernel virtual memory


Synopsis

       int fdev__map__phys__mem(vm_offset_t pa, unsigned length, void **kaddr, int flags);


Direction

       Driver ! OS


Description

       Allocate  kernel  virtual  memory  and  map  the  caller  supplied  physical  addresses  into  it.   The
       address and length must be aligned on a page boundary.

       This function is intended to provide device drivers access to memory-mapped devices.

       Flags:

       FDEV_PHYS_NOCACHE:         Inhibit cache of data in the specified memory.

       FDEV_PHYS_WRITETHROUGH:           Data cached from the specified memory must be synchronously writ-
             ten back on writes.


Parameters

       pa:   Starting physical address.

       length:    Amount of memory to map.

       kaddr :    Kernel  virtual  address  allocated  and  returned  by  the  kernel  that  maps  the  specified
             memory.

       flags:   Memory mapping attributes, as described above.


Returns

       Returns 0 on success, non-zero on error.
280                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.18          Device  Registration
10.18.  DEVICE REGISTRATION                                                                                   281


10.18.1        fdev__alloc:  allocate a device node
282                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.18.2        fdev__free:  free a device node
10.19.  BLOCK STORAGE DEVICE INTERFACES                                                             283


10.19          Block  Storage  Device  Interfaces


XXX describe fdev_blk, blksize, etc.
284                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.19.1        fdev__blk__read:  read data from a device


Synopsis

       #include  <flux/fdev/blk.h>

       fdev_error_t fdev__blk__read(fdev_blk_t *dev, void *buf, fdev_buf_vec_t *bufvec, fdev_off_t
       offset, unsigned size, [out] unsigned *amount_read);


Direction

       OS ! Driver


Description

       This function reads data from a block storage device.  The caller must supply a pointer to a
       structure of type fdev_t which identifies the device from which data is to be read. The function
       attempts to read the amount of requested data into the caller-supplied opaque buffer object; see
       Section ??  for information on opaque buffers. The amount of data successfully read is placed in
       the amount_read argument.  Data can only be read from block storage devices in integral units
       of the device's block size, available from the blksize variable in the device's fdev_blk_t device
       description structure.


Parameters

       dev :   The device from which data is to be read.

       buf :   An opaque handle for the buffer where data is to be placed.  This handle is only directly
             meaningful to the host OS, which supplies functions to operate on the buffer.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Specifies the absolute byte offset at which to start reading. Must be an even multiple of
             the the device's block size.

       size:   Number of bytes of data to read. Must be an even multiple of the the device's block size.

       amount_read :      The amount of data actually read is filled in by the function.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
10.19.  BLOCK STORAGE DEVICE INTERFACES                                                             285


10.19.2        fdev__blk__write:  write data to a device


Synopsis

       #include  <flux/fdev/blk.h>

       fdev_error_t fdev__blk__write(fdev_blk_t *dev, void *buf, fdev_buf_vec_t *bufvec, fdev_off_t
       offset, unsigned count, [out] unsigned *amount_written);


Direction

       OS ! Driver


Description

       This function writes data to a block storage device.  The caller must supply a structure of type
       fdev_t that identifies the device from which data is to be read.  The function attempts to write
       the  amount  of  requested  data  from  the  caller  supplied  buffer.   The  amount  of  data  actually
       written is placed in the amount_written argument.  Data can only be written to block storage
       devices in integral units of the device's block size, available from the blksize variable in the
       device's fdev_blk_t device description structure.


Parameters

       dev :   The device to which data is to be written.

       buf :   The opaque buffer containing data to be written.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       offset :  Specifies the absolute byte offset at which to start writing. Must be an even multiple of
             the the device's block size.

       size:   Number of bytes of data to write. Must be an even multiple of the the device's block size.

       amount_written:       The amount of data actually written is filled in by the function.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
286                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.20          Network  Device  Interfaces


XXX Call this interface Ethernet-specific, and use different classes for other network types?
10.20.  NETWORK DEVICE INTERFACES                                                                       287


10.20.1        fdev__net__send:  send a packet on a network interface


Synopsis

       #include  <flux/fdev/net.h>

       fdev_error_t fdev__net__send(fdev_net_t *dev, void *buf, fdev_buf_vec_t *bufvec, vm_size_t
       pkt_size);


Direction

       OS ! Driver


Description

       The OS calls this device This function transmits a packet on a network interface device.  The
       packet is contained in an opaque buffer, the contents of which must conform to the packet size
       and format restrictions of the specific network type of the device being invoked.  For example,
       Ethernet packets have a well-known minimum and maximum size, as well as a specific link-layer
       header which must always be used.


Parameters

       dev :   The device to which the packet is to be sent.

       buf :   The opaque buffer containing the packet to be sent.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.

       pkt_size:    Size of the packet, in bytes.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
288                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.20.2        fdev__net__alloc:  allocate an opaque buffer into which to receive data


Synopsis

       #include  <flux/fdev/net.h>

       fdev_error_t fdev__net__alloc(fdev_net_t *dev, vm_size_t buf_size, fdev_memflags_t flags,
       [out] void **buf, [out] fdev_buf_vec_t **bufvec);


Direction

       Driver ! OS


Description

       The device driver calls this OS-supplied function to request a buffer for a packet to be received
       into.  The driver may or may not use this buffer immediately:  for example, drivers for DMA-
       based network controllers may want to maintain one or more empty packet buffers in a "receive
       buffer pool" so that they can supply those buffers directly to the DMA hardware and not have
       to copy packets out of another buffer after receipt.  However,  for each buffer allocated using
       fdev_net_alloc,  the driver will call either fdev_net_recv or fdev_net_free exactly once,  to
       dispose of the buffer appropriately.

       The  device  driver  specifies  a  set  of  memory  allocation  requirements  using  the  flags  parame-
       ter,  just  as  for  allocation  of  driver-private  memory  using  fdev_mem_alloc.   For  example,  if
       the  driver  will  be  using  the  buffer  for  DMA,  it  must  specify  the  FDEV_PHYS_WIRED  flag,  and
       possibly FDEV_PHYS_CONTIG, scatter/gather DMA is not supported.  XXX what about buf_size,
       AUTO_SIZE, etc.?

       The OS returns an opaque buffer and corresponding function vector through which the driver
       can perform the standard buffer operations such as copyout, map, and wire.


Parameters

       dev :   The device for which the packet buffer will be used. XXX maybe shouldn't specify this.

       buf_size:    The size of the packet receive buffer required.

       flags:   Memory type flags specifying requirements for the buffer.

       buf :   The opaque buffer allocated is returned in this parameter.

       bufvec:    A function vector table providing the device driver with functions with which to access
             the opaque buffer.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
10.20.  NETWORK DEVICE INTERFACES                                                                       289


10.20.3        fdev__net__recv:  notify the OS that data has been received into a buffer


Synopsis

       #include  <flux/fdev/net.h>

       fdev_error_t fdev__net__recv(fdev_net_t *dev, void *buf, fdev_buf_vec_t *bufvec, vm_size_t
       pkt_size);


Direction

       Driver ! OS


Description

       The device driver calls this OS-supplied function after receiving a packet into a packet buffer
       previously allocated using fdev_net_alloc.  Once this call is made, control of the packet buffer
       is passed from the driver back to the OS, and the driver must not reference it again. (Of course,
       after the OS is done processing the packet, it is free to keep the buffer around in a cache and pass
       it back to the driver in a future fdev_net_alloc call; however, as far as the driver is concerned,
       this is a new buffer.)


Parameters

       dev :   The device from which the packet was received.

       buf :   The opaque packet buffer pointer, as returned from fdev_net_alloc.

       bufvec:    The buffer's corresponding function vector, as returned from fdev_net_alloc.

       pkt_size:    Size of the packet received; the first pkt_size bytes of the packet buffer have been filled
             in with the packet, while the rest of the buffer has undefined contents.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
290                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.20.4        fdev__net__free:  free an unused network packet buffer


Synopsis

       #include  <flux/fdev/net.h>

       fdev_error_t fdev__net__free(fdev_net_t *dev, void *buf, fdev_buf_vec_t *bufvec);


Direction

       Driver ! OS


Description

       The device driver calls this OS-supplied function to dispose of a packet buffer allocated using
       fdev_net_alloc without receiving a packet into it.  Typically this routine is called during driver
       shutdown, or when "pruning" inactive buffers from any internal receive buffer cache the driver
       may be using.


Parameters

       dev :   The device from which the packet was received.

       buf :   The opaque packet buffer pointer, as returned from fdev_net_alloc.

       bufvec:    The buffer's corresponding function vector, as returned from fdev_net_alloc.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
10.21.  SERIAL DEVICE INTERFACES                                                                            291


10.21          Serial  Device  Interfaces
292                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.21.1        fdev__serial__set:  set standard serial port parameters


Synopsis

       #include  <flux/fdev/serial.h>

       fdev_error_t fdev__serial__set(fdev_serial_t *dev, unsigned baud, int mode);


Description

       This function is specific to RS-232 serial devices:  specifically, devices with a class of FDEV_CHAR
       and a character device type of FDEV_CHAR_SERIAL. It can be called by the host OS to set standard
       serial port parameters such as baud rate.

       The following mode flags are defined in flux/fdev/serial.h: XXX

       Most existing serial device drivers also support an ioctl-based method of setting these serial
       parameters and perhaps others.  However, the specific ioctl codes and structure formats used
       vary depending on the nature and origin of the device driver (e.g., BSD serial drivers use dif-
       ferent ioctl codes than Linux drivers do); therefore, this function is provided so that the basic
       parameters common to all standard serial ports can be controlled in a driver-independent way.


Parameters

       dev :   A pointer to the serial device node to invoke. Must be of type fdev_serial_t.

       baud :   The new baud rate for the serial port to use.

       mode:     The new serial line mode, consisting of the flags defined above.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
10.21.  SERIAL DEVICE INTERFACES                                                                            293


10.21.2        fdev__serial__get:  get standard serial port parameters and line status


Synopsis

       #include  <flux/fdev/serial.h>

       fdev_error_t fdev__serial__get(fdev_serial_t *dev, [out] unsigned *baud, [out] int *mode,
       [out] int *status);


Description

       This function is specific to RS-232 serial devices:  specifically, devices with a class of FDEV_CHAR
       and  a  character  device  type  of  FDEV_CHAR_SERIAL.  It  can  be  called  by  the  host  OS  to  ex-
       amine  the  state  of  the  serial  port.   The  current  baud  rate,  serial  line  mode  (as  specified  to
       fdev_drv_set_serial), and status flags are returned in separate parameters.

       The following status flags are defined in flux/fdev/serial.h: XXX

       Most existing serial device drivers also support an ioctl-based method of checking these serial
       parameters and perhaps others.  However, the specific ioctl codes and structure formats used
       vary depending on the nature and origin of the device driver (e.g., BSD serial drivers use dif-
       ferent ioctl codes than Linux drivers do); therefore, this function is provided so that the basic
       parameters common to all standard serial ports can be controlled in a driver-independent way.


Parameters

       dev :   A pointer to the serial device node to invoke. Must be of type fdev_serial_t.

       baud :   The serial port's current baud rate is returned in this parameter.

       mode:     The serial port's current mode (e.g., the mode last specified using fdev_drv_set_serial)
             is returned in this parameter.

       status:    The serial port's current status flags are returned in this parameter.


Returns

       Returns 0 on success, or an error code specified in <flux/fdev/error.h>, on error.
294                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.22          Driver-Kernel  Interface:    ________|_X86IPC|SA  device  registration
10.22.  DRIVER-KERNEL INTERFACE:  ________|_X86IPC|SA DEVICE REGISTRATION                         295


10.22.1        fdev__isa__add:  add a device node to an ISA bus


The address parameter is used to uniquely identify the device on the ISA bus.  For example, if there are
two identical NE2000 cards plugged into the machine, the address will be be the only way the host OS can
distinguish them, because all of the other parameters of the device will be identical.  If address is in the
range 0-0xffff (0-65535), it is interpreted as a port number in I/O space; otherwise, it is interpreted as a
physical memory address.  For devices that use any I/O ports for communication with software, the base
of the "primary" range of I/O ports used by the device should be used as the address; a physical memory
address should be used only for devices that only communicate through memory-mapped I/O.
296                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.22.2        fdev__isa__remove:  remove a device node from an ISA bus
10.22.  DRIVER-KERNEL INTERFACE:  ________|_X86IPC|SA DEVICE REGISTRATION                         297


10.22.3        fdev__isa__alloc__ports:  allocate a range of I/O ports
298                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.22.4        fdev__isa__free__ports:  release a range of I/O ports
10.22.  DRIVER-KERNEL INTERFACE:  ________|_X86IPC|SA DEVICE REGISTRATION                         299


10.22.5        fdev__isa__alloc__physmem:  allocate a range of physical memory
300                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.22.6        fdev__isa__free__physmem:  release a range of physical memory
10.22.  DRIVER-KERNEL INTERFACE:  ________|_X86IPC|SA DEVICE REGISTRATION                         301


10.22.7        fdev__isa__alloc__dma:  allocate a DMA channel
302                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.22.8        fdev__isa__free__dma:  release a DMA channel
10.23.  DRIVER-KERNEL INTERFACE:  ____|_PC|PCI DEVICE REGISTRATION                             303


10.23          Driver-Kernel  Interface:    ____|_PC|PCI  device  registration
304                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.24          Driver-Kernel  Interface:  SCSI  device  registration
10.24.  DRIVER-KERNEL INTERFACE: SCSI DEVICE REGISTRATION                                 305


10.24.1        fdev__scsi__add:  add a device node to a SCSI bus
306                                                      CHAPTER 10.  FLUX DEVICE DRIVER FRAMEWORK


10.24.2        fdev__scsi__remove:  remove a device node from a SCSI bus



10.25          Error  Codes



Chapter  11


Device   Driver   Support   Library



(libfdev.a)


This chapter is extremely incomplete; it is basically only a bare skeleton.
    Implementation is in progress and an initial snapshot should be available in August.
11.1         Introduction


This library provides default implementations of various functions needed by device drivers under the Flux
device driver framework. These default implementations can be used by the host OS, if appropriate, to make
it easier to adopt the driver framework. The facilities provided include:


    o  Hardware resource management and tracking functions to allocate and free IRQs, I/O ports, DMA
       channels, etc.


    o  Device namespace management


    o  Memory allocation for device drivers


    o  Data buffer management
11.2         Device  Registration


XXX Builds a hardware tree. An example hardware tree is shown in Figure 11.1.
11.3         Naming


The library provides a convenient function to search the hardware tree and find specific device nodes given
Unix-style device names. For example, given the string `hd0' (BSD-style) or `hda' (Linux-style), this function
returns a pointer to the device node for the first IDE hard disk it finds in the hardware tree.
    fdev_t  *fdev__lookup(const char *name);
    XXX describe
11.4         Memory  Allocation


Default implementation uses the LMM.


                                                               307
308                                      CHAPTER 11.  DEVICE DRIVER SUPPORT LIBRARY (LIBFDEV.A)




                                           Figure 11.1: Example Hardware Tree



11.5         Buffer  Management


Provides a "simple buffer" implementation, in which buffers are simply regions of physically- and virtually-
contiguous physical memory.



11.6         Processor  Bus  Resource  Management


XXX to allocate and free IRQs, I/O ports, DMA channels, etc.



Chapter  12


Linux   Driver   Set   (libfdev ___linux.a)



Author: Shantanu Goel
    This chapter is very incomplete. Some of the internal details of the Linux driver emulation are described,
but not the aspects relevant for typical use of the library.
12.1         Introduction


XXX
12.2         Partially-compliant  Drivers


There are a number of assumptions made by some drivers:  if a given assumption is not met by the OS
using the framework, then the drivers that make the assumption will not work, but other drivers may still
be usable.  The specific assumptions made by each partially-compliant driver are listed in a table in the
appropriate section below; here is a summary of the assumptions some of the drivers make:


    o  Kernel memory can be allocated from interrupt handlers.


    o  Drivers can allocate contiguous chunks of physical memory larger than one page.


    o  (x86) Drivers can allocate memory specifically in the low 16MB of memory accessible to the PC's
       built-in DMA controller.


    o  Drivers can sleep uninterruptibly.


    o  Drivers can access the clock timer and DMA registers directory.


    o  "Poll-and-Yield:" polls for short periods of time and yields the CPU without explicitly going to sleep.
12.3         Internals


The  following  sections  document  all  the  variables  and  functions  that  Linux  drivers  can  refer  to.   These
variables and functions are provided by the glue code supplied as part of the library, so this information
should not be needed for normal use of the library under the device driver framework.  However, they are
documented here for the benefit of those working on this library or upgrading it to new versions of the Linux
drivers, or for those who wish to "short-cut" through the framework directly to the Linux device drivers in
some situations, e.g., for performance reasons.


                                                               309
310                                                      CHAPTER 12.  LINUX DRIVER SET (LIBFDEV_LINUX.A)


12.3.1        Variables


current:      This is a global variable that points to the state for the current process.  It is mostly used by
       drivers to set or clear the interruptible state of the process.

jiffies:      Many Linux device drivers depend on a global variable called jiffies, which in Linux contains
       a clock tick counter that is incremented by one at each 10-millisecond (100Hz) clock tick.  The device
       drivers typically read this counter while polling a device during a (hopefully short) interrupt-enabled
       busy-wait loop.  Although a few drivers take the global clock frequency symbol HZ into account when
       determining timeout values and such, most of the drivers just used hard-coded values when using the
       jiffies counter for timeouts, and therefore assume that jiffies increments "about" 100 times per
       second.

irq2dev_map:       This variable is an array of pointers to network device structures.  The array is indexed by
       the interrupt request line (IRQ) number. Linux network drivers use it in interrupt handlers to find the
       interrupting network device given the IRQ number passed to them by the kernel.

blk_dev:     This variable is an array of "struct blk_dev_struct" structures.  It is indexed by the major device
       number. Each element contains the I/O request queue and a pointer to the I/O request function in the
       driver. The kernel queues I/O requests on the request queue, and calls the request function to process
       the queue.

blk_size:      This variable is an array of pointers to integers. It is indexed by the major device number. The
       subarray is indexed by the minor device number.  Each cell of the subarray contains the size of the
       device in 1024 byte units. The subarray pointer can be NULL, in which case, the kernel does not check
       the size and range of an I/O request for the device.

blksize_size:        This variable is an array of pointers to integers. It is indexed by the major device number.
       The subarray is indexed by the minor device number. Each cell of the subarray contains the block size
       of the device in bytes. The subarray can be NULL, in which case, the kernel uses the global definition
       BLOCK_SIZE (currently 1024 bytes) in its calculations.

hardsect_size:        This variable is an array of pointers to integers. It is indexed by the major device number.
       The subarray is indexed by the minor device number. Each cell of the subarray contains the hardware
       sector size of the device in bytes. If the subarray is NULL, the kernel uses 512 bytes in its calculations.

read_ahead:       This variable is an array of integers indexed by the major device number.  It specifies how
       many sectors of read-ahead the kernel should perform on the device.  The drivers only initialize the
       values in this array; the Linux kernel block buffer code is the actual user of these values.

wait_for_request:         The Linux kernel uses a static array of I/O request structures.  When all I/O request
       structures are in use, a process sleeps on this variable. When a driver finishes an I/O request and frees
       the I/O request structure, it performs a wake up on this variable.

EISA_bus:      If this variable is non-zero, it indicates that the machine has an EISA bus.  It is initialized bye
       the Linux kernel prior to device configuration.

high_memory:       This variable contains the address of the last byte of physical memory plus one. It is initialized
       by the Linux kernel prior to device configuration.

intr_count:       This variable gets incremented on entry to an interrupt handler, and decremented on exit. Its
       purpose is let driver code determine if it was called from an interrupt handler.

kstat:     This variable contains Linux kernel statistics counters.  Linux drivers increment various fields in it
       when certain events occur.

tq_timer:      Linux  has  a  notion  of  "bottom  half"  handlers.   These  handlers  have  a  higher  priority  than
       any user level process but lower priority than hardware interrupts.  They are analogous to software
       interrupts in BSD. Linux checks if any "bottom half" handlers need to be run when it is returning to
12.3.  INTERNALS                                                                                                      311


       user mode.  Linux provides a number of lists of such handlers that are scheduled on the occurrence of

       specific events.  tq_timer is one such list.  On every clock interrupt, Linux checks if any handlers are
       on this list, and if there are, immediately schedules the handlers to run.


timer_active:        This integer variable indicates which of the timers in timer_table (described below) are
       active. A bit is set if the timer is active, otherwise it is clear.


timer_table:       This variable is an array of "struct timer_struct" elements.  The array is index by global
       constants defined in <linux/timer.h>. Each element contains the duration of timeout, and a pointer to
       a function that will be invoked when the timer expires.


system_utsname:         This variable holds the Linux version number.  Some drivers check the kernel version to
       account for feature differences between different kernel releases.



12.3.2        Functions

autoirq_setup:        int autoirq__setup(int waittime);

       This function is called by a driver to set up for probing IRQs. The function attaches a handler on each
       available IRQ, waits for waittime ticks, and returns a bit mask of IRQs available IRQs.  The driver
       should then force the device to generate an interrupt.


autoirq_report:         int autoirq__report(int waittime);

       This function is called by a driver after it has programmed the device to generate an interrupt.  The
       function waits waittime ticks,  and returns the IRQ number on which the device interrupted.  If no
       interrupt occurred, 0 is returned.


register_blkdev:         int  register__blkdev(unsigned  major,  const  char  *name,  struct  file_operations
       *fops);

       This function registers a driver for the major number major. When an access is made to a device with
       the specified major number, the kernel accesses the driver through the operations vector fops.  The
       function returns 0 on success, non-zero otherwise.


unregister_blkdev:          int unregister__blkdev(unsigned major, const char *name);

       This  function  removes  the  association  between  a  driver  and  the  major  number  major,  previously
       established by register_blkdev. The function returns 0 on success, non-zero otherwise.


getblk:     struct buffer__head *getblk(kdev_t dev, int block, int size);

       This function is called by a driver to allocate a buffer size bytes in length and associate it with device
       dev, and block number block.


brelse:     void brelse(struct buffer_head *bh);

       This function frees the buffer bh, previously allocated by getblk.


bread:     struct buffer__head *bread(kdev_t dev, int block, int size);

       This function allocates a buffer size bytes in length, and fills it with data from device dev, starting at
       block number block.


block_write:       int block__write(struct inode *inode, struct file *file, const char *buf, int count);

       This function is the default implementation of file write. It is used by most of the Linux block drivers.
       The function writes count bytes of data to the device specified by i_rdev field of inode,  starting at
       byte offset specified by f_pos of file, from the buffer buf.  The function returns 0 for success, non-zero
       otherwise.
312                                                      CHAPTER 12.  LINUX DRIVER SET (LIBFDEV_LINUX.A)


block_read:       int block__read(struct inode *inode, struct file *file, const char *buf, int count);

       This function is the default implementation of file read. It is used by most of the Linux block drivers.
       The function reads count bytes of data from the device specified by i_rdev field of inode, starting at byte
       offset specified by f_pos field of file, into the buffer buf.  The function returns 0 for success, non-zero
       otherwise.


check_disk_change:          int check__disk__change(kdev_t dev);

       This function checks if media has been removed or changed in a removable medium device specified by
       dev. It does so by invoking the check_media_change function in the driver's file operations vector. If a
       change has occurred, it calls the driver's revalidate function to validate the new media.  The function
       returns 0 if no medium change has occurred, non-zero otherwise.


request_dma:       int request__dma(unsigned drq, const char *name);

       This function allocates the DMA request line drq for the calling driver. It returns 0 on success, non-zero
       otherwise.


free_dma:      void free__dma(unsigned drq);

       This function frees the DMA request line drq previously allocated by request_dma.


disable_irq:       void disable__irq(unsigned irq);

       This function masks the interrupt request line irq at the interrupt controller.


enable_irq:       void enable__irq(unsigned irq);

       This function unmasks the interrupt request line irq at the interrupt controller.


request_irq:       int  request__irq(unsigned  int  irq,  void  (*handler)(int,  struct),  unsigned  long  flags,
       const char *device);

       This function allocates the interrupt request line irq, and attach the interrupt handler handler to it.
       It returns 0 on success, non-zero otherwise.


free_irq:      void free__irq(unsigned int irq);

       This function frees the interrupt request line irq, previously allocated by request_irq.


kmalloc:      void  *kmalloc(unsigned  int size, int priority);

       This function allocates size bytes memory. The priority argument is a set of bitfields defined as follows:


       GFP_BUFFER:      Not used by the drivers.

       GFP_ATOMIC:      Caller cannot sleep.

       GFP_USER:     Not used by the drivers.

       GFP_KERNEL:      Memory must be physically contiguous.

       GFP_NOBUFFER:       Not used by the drivers.

       GFP_NFS:     Not used by the drivers.

       GFP_DMA:     Memory must be usable by the DMA controller. This means, on the x86, it must be below
             16 MB, and it must not cross a 64K boundary. This flag implies GFP_KERNEL.


kfree:     void kfree(void *p);

       This function frees the memory p previously allocated by kmalloc.


vmalloc:      void  *vmalloc(unsigned long size);

       This function allocates size bytes of memory in kernel virtual space that need not have underlying
       contiguous physical memory.
12.4.  BLOCK DEVICE DRIVERS                                                                                   313


check_region:        int check__region(unsigned port, unsigned size);

       Check if the I/O address space region starting at port and size bytes in length, is available for use.
       Returns 0 if region is free, non-zero otherwise.

request_region:         void request__region(unsigned port, unsigned size, const char *name);

       Allocate  the  I/O  address  space  region  starting  at  port  and  size  bytes  in  length.   It  is  the  caller's
       responsibility to make sure the region is free by calling check_region, prior to calling this routine.

release_region:         void release__region(unsigned port, unsigned size);

       Free the I/O address space region starting at port and size bytes in length, previously allocated by
       request_region.

add_wait_queue:        void add__wait__queue(struct wait_queue **q, stuct wait_queue *wait);

       Add the wait element wait to the wait queue q.

remove_wait_queue:          void remove__wait__queue(struct wait_queue **q, struct wait_queue *wait);

       Remove the wait element wait from the wait queue q.

down:    void down(struct semaphore *sem);

       Perform a down operation on the semaphore sem.  The caller blocks if the value of the semaphore is
       less than or equal to 0.

sleep_on:      void sleep__on(struct wait_queue **q, int interruptible);

       Add the caller to the wait queue q, and block it.  If interruptible flag is non-zero, the caller can be
       woken up from its sleep by a signal.

wake_up:     void wake__up(struct wait_queue **q);

       Wake up anyone waiting on the wait queue q.

wait_on_buffer:        void wait__on__buffer(struct buffer_head *bh);

       Put the caller to sleep, waiting on the buffer bh.  Called by drivers to wait for I/O completion on the
       buffer.

schedule:      void schedule(void);

       Call the scheduler to pick the next task to run.

add_timer:      void add__timer(struct timer_list *timer);

       Schedule a time out. The length of the time out and function to be called on timer expiry are specified
       in timer.

del_timer:      int del__timer(struct timer_list *timer);

       Cancel the time out timer.



12.4         Block  device  drivers



12.5         Network  drivers


Things drivers may want to do that make emulation difficult:

    o  Call the 16-bit BIOS.

    o  Use the system DMA controller.

    o  Assume kernel virtual addresses are equivalent to physical addresses.
314                                                      CHAPTER 12.  LINUX DRIVER SET (LIBFDEV_LINUX.A)


_______________________________________________________________________________________________|DeNamescript|ionVji=fPfi|esP+|Y|c*
 *||urrent|_|
_______________________________________________________________________________________________
|__cmd640.c__|__CMD640_IDE_Chipset__|_|__________________|____________|_______|_____________|__|
|__floppy______|Floppy_drive_____________|_|_______*_____|_____*______|___*___|_____*______|___|
|__ide-cd.c____|IDE_CDROM____________|_|_________________|_____*______|___*___|_____*______|___|
|__ide.c________|IDE_Disk_________________|_|____________|____________|_______|_____________|__|
|__rz1000.c___|_RZ1000_IDE_Chipset___|_|_________________|____________|_______|_____________|__|
|__sd.c________|SCSI_disk_________________|_|____________|_____*______|_______|_____________|__|
|__sr.c_________|SCSI_CDROM___________|_|________________|____________|_______|_____________|__|
|__triton.c____|Triton_IDE_Chipset_____|_|_________*_____|____________|_______|_____________|__|

                                          Table 12.1: Linux block device drivers



    o  Assume kernel virtual addresses can be mapped to physical addresses merely by adding a constant
       offset.

    o  Implement timeouts by busy-waiting on a global clock-tick counter.

    o  Busy-wait for interrupts.  XXX This means that the OS must allow interrupts during execution of
       process-level driver code, and not just when all process-level activity is blocked.
12.6         SCSI  drivers


The Linux SCSI driver set includes both the low-level SCSI host adapter drivers and the high-level SCSI
drivers for generic SCSI disks, tapes, etc.
12.6.  SCSI DRIVERS                                                                                                   315




_____________________________________________________________________________________________________________
|__Name________|___Description__________________________|_|__V__=_P__|__jiffies__|____P+Y__|_current__|______|_

|__3c501.c______|__3Com_3c501_ethernet_______________|_|_______________|_____*______|_______|____________|___|
|__3c503.c______|__NS8390_ethernet____________________|_|______________|_____*______|_______|____________|___|
|__3c505.c______|__3Com_Etherlink_Plus_(3C505)____|_|__________________|_____*______|_______|____________|___|
|__3c507.c______|__3Com_EtherLink16_________________|_|________________|_____*______|_______|____________|___|
|__3c509.c______|__3c509_EtherLink3_ethernet________|_|________________|_____*______|_______|____________|___|
|__3c59x.c______|__3Com_3c590/3c595_"Vortex"______|_|__________________|_____*______|_______|____________|___|
|__ac3200.c_____|__Ansel_Comm._EISA_ethernet_____|_|___________________|_____*______|_______|____________|___|
|__apricot.c_____|_Apricot_______________________________|_|____*_____|______*______|_______|____________|___|
|__at1700.c_____|__Allied_Telesis_AT1700______________|_|______________|_____*______|_______|____________|___|
|__atp.c_________|_Attached_(pocket)_ethernet_______|_|________________|_____*______|_______|____________|___|
|__de4x5.c______|__DEC_DE425/434/435/500_________|_|___________________|_____*______|_______|____________|___|
|__de600.c______|__D-link_DE-600_______________________|_|_____________|_____*______|_______|____________|___|
|__de620.c______|__D-link_DE-620_______________________|_|_____________|_____*______|_______|____________|___|
|__depca.c______|__DEC_DEPCA_&_EtherWORKS__|_|_________________________|_____*______|_______|____________|___|
|__e2100.c______|__Cabletron_E2100____________________|_|______________|_____*______|_______|____________|___|
|__eepro.c_______|_Intel_EtherExpress_Pro/10________|_|________________|_____*______|_______|____________|___|
|__eexpress.c___|__Intel_EtherExpress__________________|_|_____________|_____*______|_______|____________|___|
|__eth16i.c______|_ICL_EtherTeak_16i_&_32___________|_|________________|_____*______|_______|____________|___|
|__ewrk3.c______|__DEC_EtherWORKS_3______________|_|___________________|_____*______|_______|____________|___|
|__hp-plus.c____|__HP_PCLAN/plus___________________|_|_________________|_____*______|_______|____________|___|
|__hp.c__________|_HP_LAN_____________________________|_|______________|_____*______|_______|____________|___|
|__hp100.c______|__HP10/100VG_ANY_LAN__________|_|_____________________|_____*______|_______|____________|___|
|__lance.c_______|_AMD_LANCE_______________________|_|__________*_____|______*______|_______|____________|___|
|__ne.c__________|_Novell_NE2000______________________|_|______________|_____*______|_______|____________|___|
|__ni52.c________|_NI5210_(i82586_chip)_______________|_|______________|_____*______|_______|____________|___|
|__ni65.c________|_NI6510_(am7990_`lance'_chip)_____|_|_________*_____|______*______|_______|____________|___|
|__seeq8005.c___|__SEEQ_8005__________________________|_|______________|_____*______|_______|____________|___|
|__sk_g16.c______|_Schneider_&_Koch_G16____________|_|_________________|_____*______|_______|____________|___|
|__smc-ultra.c__|__SMC_Ultra___________________________|_|_____________|_____*______|_______|____________|___|
|__tulip.c________|DEC_21040__________________________|_|_______*_____|______*______|_______|____________|___|
|__wavelan.c____|__AT&T_GIS_(NCR)_WaveLAN____|_|_______________________|_____*______|_______|____________|___|
|__wd.c__________|_Western_Digital_WD80x3__________|_|_________________|_____*______|_______|____________|___|
|__znet.c________|_Zenith_Z-Note_______________________|_|_____________|_____*______|_______|____________|___|

                                             Table 12.2: Linux network drivers
316                                                      CHAPTER 12.  LINUX DRIVER SET (LIBFDEV_LINUX.A)






__________________________________________________________________________________________________________
|__Name____________|__Description_________________________|_|V__=_P__|__jiffies__|__P+Y__|_current__|____|__
|__53c7,8xx.c_______|_NCR_53C7x0,_53C8x0_____________|_|________*_____|_____*______|______|___________|__|_
|__AM53C974.c____|____AM53/79C974_(PCscsi)___________|_|________*_____|___________|_______|___________|__|_
|__BusLogic.c______|__BusLogic_MultiMaster_adapters__|_|________*_____|_____*______|______|___________|__|_
|__NCR53c406a.c__|____NCR53c406a_______________________|_|______*_____|_____*______|______|___________|__|_
|__advansys.c______|__AdvanSys_SCSI_Adapters_________|_|________*_____|_____*______|______|___________|__|_
|__aha152x.c_______|__Adaptec_AHA-152x_______________|_|______________|_____*______|______|___________|__|_
|__aha1542.c_______|__Adaptec_AHA-1542_______________|_|________*_____|_____*______|______|___________|__|_
|__aha1740.c_______|__Adaptec_AHA-1740_______________|_|________*_____|___________|_______|___________|__|_
|__aic7xxx.c________|_Adaptec_AIC7xxx_________________|_|_______*_____|_____*______|______|___________|__|_
|__eata.c____________|EATA_2.x_DMA_host_adapters___|_|________________|_____*______|______|___________|__|_
|__eata_dma.c______|__EATA/DMA_host_adapters_______|_|__________*_____|_____*______|______|___________|__|_
|__eata_pio.c_______|_EATA/PIO_host_adapters________|_|_______________|_____*______|______|___________|__|_
|__fdomain.c_______|__Future_Domain_TMC-16x0_______|_|________________|_____*______|______|___________|__|_
|__in2000.c_________|_Always_IN_2000____________________|_|___________|_____*______|______|___________|__|_
|__NCR5380.c______|___Generic_NCR5380_________________|_|_______*_____|_____*______|_*___|____________|__|_
|__pas16.c__________|_Pro_Audio_Spectrum/Studio_16__|_|_______________|___________|_______|___________|__|_
|__qlogic.c__________|Qlogic_FAS408_____________________|_|___________|_____*______|______|___________|__|_
|__scsi.c____________|SCSI_middle_layer_________________|_|_____*_____|_____*______|_*___|_____*______|__|_
|__scsi_debug.c_____|_SCSI_debug_layer__________________|_|___________|_____*______|______|___________|__|_
|__seagate.c________|_ST01,ST02,_TMC-885_____________|_|______________|_____*______|______|___________|__|_
|__t128.c___________|_Trantor_T128/128F/228__________|_|______________|___________|_______|___________|__|_
|__u14-34f.c________|_UltraStor_14F/34F________________|_|______*_____|_____*______|______|___________|__|_
|__ultrastor.c_______|UltraStor_14F/24F/34F___________|_|_______*_____|___________|_______|___________|__|_
|__wd7000.c________|__WD-7000___________________________|_|_____*_____|_____*______|______|___________|__|_

                                              Table 12.3: Linux SCSI drivers



Chapter  13


FreeBSD   Driver   Set



(libfdev ___freebsd.a)


This library is planned but not implemented yet.

                                                               317
318                                               CHAPTER 13.  FREEBSD DRIVER SET (LIBFDEV_FREEBSD.A)






Chapter  14


Novell   ODI   Network   Drivers



(libfdev ___odi.a)


This library is planned but not implemented yet.

                                                               319
320                                     CHAPTER 14.  NOVELL ODI NETWORK DRIVERS (LIBFDEV_ODI.A)






Chapter  15


Appendix   A:   Directory   Structure


Dimensions of the directory structure, in order of precedence:

    o  Targets (e.g. entire libraries)

    o  Architecture-specific

    o  Platform-specific (e.g. PC)

    o  Environment-specific (e.g. raw versus DOS)

    o  Modules

    The flux directory, containing the exported public header files, is considered to be one "target."  Each
library produced is another target.
    The architecture dimension needs to have higher precedence so that as few architecture symlinks are
needed as possible. For example, this way, only one `machine' symlink is needed in the main public include
directory.
    As a general rule, a particular installation of the OS toolkit is specific to one architecture but not specific
to other variables such as platform or environment; the libraries and headers installed include the appropriate
code for all the supported platforms and environments.
    In the public includes, the policy for determining what goes into machine from what goes into machine/kern
is this:  All header files whose contents are purely generic and make no assumptions about the system envi-
ronment are placed in machine; these are the files that define symbolic processor constants and such which
should always be valid in any context.  Header files for library modules that provide "default" implementa-
tions of something or other (such as trap handling or CPU initialization) are placed in machine/kern instead.
These modules are intended to be widely applicable, but by nature they must make some assumptions about
their environment, and thus may not be usable in some circumstances.
    One litmus test that pretty accurately makes this distinction is whether the routines defined in a given
header file are "pure" and only operate on data passed as parameters, or whether they depend on global
variables  of  any  kind.   Pure  functions  are  generally  defined  in  header  files  in  machine,  whereas  impure
functions are generally found in machine/kern.

                                                               321
