4 min read

Categories

Tags

In computing, a system call is how a program requests a service from an operating system’s kernel. This may include hardware related services (e.g. accessing the hard disk), creating and executing new processes, and communicating with integral kernel services (like scheduling). System calls provide an essential interface between a process and the operating system.


System Call

Whenever system call is executed from user level a trap is generated and it gets handled at file Exception.S (i386\i386)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/*
 * Even though the name says 'int0x80', this is actually a TGT (trap gate)
 * rather then an IGT (interrupt gate).  Thus interrupts are enabled on
 * entry just as they are for a normal syscall.
 */
     SUPERALIGN_TEXT
IDTVEC(int0x80_syscall)
     pushl     $2               /* sizeof "int 0x80" */
     subl     $4,%esp               /* skip over tf_trapno */
     pushal
     pushl     %ds
     pushl     %es
     pushl     %fs
     SET_KERNEL_SREGS
     FAKE_MCOUNT(TF_EIP(%esp))
     pushl     %esp
     call     syscall   ==> from here syscall function is called after pushing some prereq
     add     $4, %esp
     MEXITCOUNT
     jmp     doreti ==> do return from interrupt


syscall() is a function in Trap.c (i386\i386) is a machine dependent code area.


Function syscall() is called with trap frame which has input values to system call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/*
 * Exception/Trap Stack Frame
 */

struct trapframe {
     int     tf_fs;
     int     tf_es;
     int     tf_ds;
     int     tf_edi;
     int     tf_esi;
     int     tf_ebp;
     int     tf_isp;
     int     tf_ebx;
     int     tf_edx;
     int     tf_ecx;
     int     tf_eax;
     int     tf_trapno;
     /* below portion defined in 386 hardware */
     int     tf_err;
     int     tf_eip;
     int     tf_cs;
     int     tf_eflags;
     /* below only when crossing rings (e.g. user to kernel) */
     int     tf_esp;
     int     tf_ss;
};


Function syscall() now gets all the values of system call input and store in kernel so that kernel code can access the values while processing system call.

1
2
3
	params = (caddr_t)frame->tf_esp + sizeof(int);
     code = frame->tf_eax;   // code is system call code
     orig_tf_eflags = frame->tf_eflags;


Then check we have direct or indirect system call. Indirect system call is one which is like loadable kernel module which loads system call. Once loaded then it can pass the system call number as argument which can be injected in here.


Then given the system call code we check if we have the code as valid number. If system call code is valid then lookup in system call table which system call is referred else simply store the pointer to indirect system call which is system call 0.

1
2
3
4
if (code >= p->p_sysent->sv_size)
           callp = &p->p_sysent->sv_table[0];
       else
           callp = &p->p_sysent->sv_table[code];


where sv_table entry will have following values

1
2
3
4
5
6
7
8
9
struct sysent {               /* system call table */
     int     sy_narg;     /* number of arguments */
     sy_call_t *sy_call;     /* implementing function */
     au_event_t sy_auevent;     /* audit event associated with syscall */
     systrace_args_func_t sy_systrace_args_func;
                    /* optional argument conversion function. */
     u_int32_t sy_entry;     /* DTrace entry ID for systrace. */
     u_int32_t sy_return;     /* DTrace return ID for systrace. */
};


Once we know which system it is then we determine how many arguments it have

narg = callp->sy_narg;


Then we copy all the user space arguments into kernel space using function copying which validates the addresses of source and destination etc.

1
2
3
4
5
6
7
  /*
      * copyin and the ktrsyscall()/ktrsysret() code is MP-aware
      * copy data from params to args
      */
     if (params != NULL && narg != 0)
          error = copyin(params, (caddr_t)args,
                        (u_int)(narg * sizeof(int)));


Now, check if we saw any error so far. If we did find the error then we need to return the error but, at this time in kernel it does not know how to return in user space. So, we set the register 0 to be non zero which is the error number and we set another register which is carry bit(eflag).

So, when system call is returned, C lib which issued the system call will check if carry bit(eflag) is set or not. If it is set then it will get the error number from register 0 and map it to human readable error errorno and then override the register 0 with -1.
So, say if open system call encounters error then its going to get the error value from register 0; translate it to meaningful error and then override the register 0 with -1. So open returned the value -1 which is failure.

1
2
3
if (error == 0) {
          td->td_retval[0] = 0;
          td->td_retval[1] = frame->tf_edx; // return another value in register 1 if we have anything to return

Now we call the actual system call

error = (*callp->sy_call)(td, args);

Here, td is the thread pointer args are the arguments we have copied above from user space to kernel space

We check error and if all fine then we return the values back to user space

1
2
3
4
5
6
7
8
9
10
switch (error) {
     case 0:
          frame->tf_eax = td->td_retval[0];
          frame->tf_edx = td->td_retval[1];
          frame->tf_eflags &= ~PSL_C;
          break;

else we handle the error
          frame->tf_eax = error;
          frame->tf_eflags |= PSL_C;

Now we fall back to assembly program from where this function was called.