intermittent failure during 'Starting Network'

Posted: Wed Nov 14, 2007 12:37 am
by migmog
Sometimes (40-50% of reboots) my box does not get past 'Starting Network' in its boot sequence. a message pops up 'init: no more kernel processes' or something similar - I will update this thread with the actual message next time I see it. It does not respond to pings, but keypresses do register although I can't get a console.

It boots via tftpboot and nfsroot so in order to get that far the network already works, so there must be something funny going on.

I've figured out that it's doing udhcpc stuff in here by looking at the logs after a clean bootup. Is this necessary in an nfsroot environment?

How can I check the logs from the failed state? I tried to get a console with various combinations of CTRL-ALT-F1-F2 etc keys, but no joy.

any ideas?

Posted: Wed Nov 14, 2007 9:38 pm
by Pablo
What hardware are you using?
What version of MiniMyth are you using?

MiniMyth does not start a virtual terminal login. As a result, there is no way to debug until the network and telnet start. In the latest test release, I have enabled a virtual terminal login on tty1. If you boot with it, then you should be able to login as 'root' (no password required).

MiniMyth gets other information from DHCP besides an IP address. As a result, it performs DHCP even when it has already obtained an IP address through other means (e.g. NFS boot). If it has an IP address, then it requests the same address through DHCP. Also, if it is an NFS root file system, then it does not reset the network interface.

Posted: Thu Nov 15, 2007 1:12 am
by migmog
I just tried the latest test release (b23). Frontend is Epia M10k. nfs/dhcp/tftp server is a Maxtor Shared Storage NAS running openMSS, backend is an old compaq desktop running ubuntu.

Now I'm getting these messages:

INIT: cannot execute "/sbin/agetty"
--- above line repeated 10 times ----
INIT: Id "1" respawning too fast: disabled for 5 minutes
INIT: no more processes left in this runlevel

so I guess the tty is not running, and I can't login. Has the system gone to rootfs-ro yet? I tried copying agetty into /sbin beside init, modprobe and pivot-root, ensuring 755 permissions but that didn't help.

I'm intrigued about your boot processes - there's no /etc/inittab, where did you add the agetty?

Also, now when it _does_ get past the Starting Network, it goes to the end of the normal bootup, but instead of going to X and Myth, I get a tty login.

I tried looking at the dhcp server logs (edited to protect the innocent), but I can't see anything wrong, other than no activity after the 2nd DHCPOFFER

in.tftpd[14526]: RRQ from filename PXEClient/kernel-minimyth
dnsmasq[330]: DHCPDISCOVER(eth0) M:A:C
dnsmasq[330]: DHCPOFFER(eth0) M:A:C
dnsmasq[330]: DHCPREQUEST(eth0) M:A:C
dnsmasq[330]: DHCPACK(eth0) M:A:C phoenix
rpc.mountd: authenticated mount request from phoenix:770 for /shares/mss-hdd/Archive/nfsroot/minimyth (/shares/mss-hdd)
dnsmasq[330]: DHCPDISCOVER(eth0) M:A:C
dnsmasq[330]: DHCPOFFER(eth0) M:A:C

Posted: Thu Nov 15, 2007 5:21 am
by Pablo
Something is very wrong. Did you extract the NFS file system tarball as root?

If you want to get a sense of what happens at the initial boot, look at the /sbin/init script.

Basically, before the real init process starts, the read-only file system must be converted to a read-write file system. This is done using a bootstrap root file system (/) and a bootstrap init script (/sbin/init).

The kernel calls /sbin/init. /sbin/init creates a tmpfs file system (/rw) and unifies it with the read-only root file system (/rootfs-ro) using unionfs, resulting in the read-write file system (/rootfs). Then pivot_root is used to make /rootfs the root file system. Finally, a chroot is performed and the real init is called.

Posted: Thu Nov 15, 2007 6:36 pm
by migmog
Hmmmm..... Yes the NFS file system was extracted as root. I did it again to be sure, and both my b23 trees are the same.

Thanks for pointing me at /sbin/init - it all makes more sense now.

in b23 your're doing
1:2345:respawn:/sbin/agetty -n -l /bin/login 9600 tty1

in b20 this was commented out:
#3:2345:respawn:/sbin/agetty -n -l /bin/sh 9600 tty3

I think the -n and -l /bin/login are incompatible. I will try it with -n -l /bin/sh later tonight.

Posted: Thu Nov 15, 2007 7:01 pm
by Pablo
The former requires logging in before using the virtual terminal, whereas the latter does not. Both worked for me.

Posted: Fri Nov 16, 2007 12:44 am
by migmog
Pablo, thanks for your AMAZING support! It's people like you that make the whole open source thing the force that it is today.

I tried tonight a dozen times, but the problem has not happened today. I will come back on this thread when it recurs.


Posted: Fri Nov 16, 2007 12:56 am
by migmog
I just tried it again after posting the last message and it failed as before.

Now, I think I know why agetty fails: It's because /bin/agetty is on the nfsroot filesystem, and the network is down so no filesystem, so no /bin/agetty.


Debugging this is going to be hard....

Posted: Mon Dec 31, 2007 2:00 am
by migmog
My frontend box died a couple of weeks ago - looks like this problem was hardware related. I will update if/when I get a new box going.