- The main point to remember about networks is that they are serial transmission media. That means that data flows one bit at a time from point A to point B. So, to troubleshoot a network problem, start at point A and check that everything is working on the way to point B.
- Modern networks have functions to help people use them, such as domain name servers, which translate between network addresses and human readable host names. But when troubleshooting network problems, these servers may be inaccessible. So use the most basic method of communication for testing access and collecting status information. The commands to find the network addresses that will be needed are in the procedures below.
- Networks may be slow, or services unavailable. There is usually a timeout set for networks commands, which may have to run to completion before a program can continue. So, patience is required for troubleshooting network problems.
- Network protocols expect things to go wrong, so retries are built in most of the time. But during troubleshooting, this is not always a good thing. If possible, override the retry value with a count parameter. Otherwise, it may be necessary to interrupt a program that is not responding.
Application software: The local host is point A. However, there are many points of failure before data can leave the local host.
- Email client: Email clients are usually configured to automatically poll the email server for new mail. If the connection is lost, this causes error messages, which might need to be responded to before the program can continue. If it happens while a message is being transferred, then the client might get hung and have to be forced to end.
- Web browser: Web browsers are most susceptible to domain name server problems. Each computer is assigned a local domain name server, but it only keeps track of a few host names. When a web site name is requested that is not in its list, it must query other servers to find it. This may take long enough for the web browser to timeout. Or the domain name server that tracks the requested name may be down. So, web browsers are very poor indicators of network problems.
- Games: Online games are similar to email in that a single server is contacted. However, there is a higher volume and more constant flow of data. So, network problems become apparent more quickly. However, because game servers often are further away from the local host than email servers, network congestion is more possible along the client/server path and may cause the game performance to be affected.
- Other applications: Audio and video streaming use a protocol that is called "unreliable". This is not as bad as it sounds, but it does mean that occasionally some data can be lost. This is why network radio sometimes sounds so poor.
- API layer: The connection between a client application and the network software is made through a "socket". On the client side, the Application Program Interface to the network software is usually not a problem. However, it can happen that as many connections as are allowed by a single program are in use, so no more connections can be made. This can be checked using the netstat command described below.
- TCP/UDP layer: The Transmission Control Protocol is what the network software uses to make sure that all data sent is received. The User Datagram Protocol is similar, but it is called "unreliable" because the receiving host does not confirm that data was received. Both use network ports to identify a "session" between two hosts. When a session is active, that pair of ports cannot be used between the same two hosts for another session. Active sessions can be displayed using the netstat command described below.
- IP layer: The Internet Protocol is what the network software uses to determine how to establish a session between two hosts. It uses network addresses to identify the hosts and a route table to know where to send or "route" the data. The network address of the local host is usually in the /etc/hosts table if a fixed address is used. The route table can be displayed using the netstat command described below.
Interface and Local Network
Hardware/Driver: The cable that connects the local host to the network is attached to a piece of hardware that can get data from one end of the cable to the other. There are many such connections between the local host and the server host, but usually problems with those connections must be handled by others. The driver is simply the software that is used to control the hardware.
The configuration of the hardware and its driver are found in the file, /etc/sysconfig/network. The driver is started and stopped using the /etc/init.d/network command. The status of the hardware and its driver can be displayed using the ifconfig command described below.
- Serial port/PPP: This hardware is the slowest network connection. The cable is connected to the serial port on the local host and to a modem at the other end. The modem is connected to a telephone jack. Often modems are built into the local host, and it is connected directly to the telephone jack. The Point-to-Point Protocol is used to send data between the local host and the Internet Service Provider's (ISP) modem. The ISP then has a faster connection to the network.
- NIC card/Ethernet driver: A Network Interface Card uses software for faster networks. Most NICs use the Ethernet protocol, although there are several others available. The Ethernet protocol is used to send data between the local host and a router, which is connected to the network. On home networks, the cable from the NIC is usually connected to a cable modem or a Digital Subscriber Line (DSL) modem, which is connected to a router at the cable television or telephone company office.
- Wifi card/Wifi driver: There is no cable connected to a Wifi card. However, the Wifi card communicates with an "access point" which usually has an Ethernet NIC that is connected the same as a local host Ethernet NIC.
- Home networks usually have a router between the local hosts and the cable or DSL modem. It is possible to plug more than one local host into the router, but most networks use a hub or a switch to allow a single router connector or "port" to be used. There may be an application available to display the status of the router. The same may be true for switches. However, hubs usually only display a little bit of status information on the front panel.
These devices usually have very little control capability, and are reset by powering them off and then back on. It is advisable to keep them physically close together, and if they are all plugged into a single power strip, then resetting them simultaneously is as simple as turning off and on the main power strip switch.
- Connection to network: The ISP provides the hardware and software to get access to any host connected to the world wide web, if everything is working correctly. To simplify making connections, domain name servers translate between network addresses and host names. For example, by issuing the host command, the host names can have their addresses displayed. The address of the ISP's domain name servers is in the file, /etc/resolv.conf. Any hosts defined in the file, /etc/hosts, will be used without requesting them to be translated by the domain name servers. So, it is a good idea to use this table for hosts on the local network, and even a few of the ISP hosts. The names in the /etc/hosts file do not have to match the actual name of the host defined. So instead of using the real domain server name of, ns02.someplace.st.us02.comcast.net, the name dns1 could be used.
- Network path: Under the TCP/IP protocol, data is broken into "packets" of various sizes. The protocol is designed to choose any path that is available for each individual packet. This means that while some packets may be lost, there is usually a way for the entire message to reach its destination. However, sometimes there is a point of failure that it is impossible to bypass. There are several tools, described below, to determine if the path to the server is available.
- Hardware: The hardware for servers and their networks is essentially the same as that at the local host, just more of it. This means complexity, which means more chances for errors.
- Network stack: The server's network software must answer requests from hundreds or thousands of client hosts. If the clients happen to make their requests at nearly the same time, the server can become so busy that the network software rejects some of the connection requests.
- Server application: The server application can also become overloaded and responds so slowly that the client times out.
Another application problem can be that the configuration file is not set correctly for the functionality being attempted. The sshd configuration is famous for its gotchas. The defaults work 90% of the time, but when unusual uses are required, it is often difficult to get it right. Using the ssh -v argument while attempting to connect can help.
- Permissions: Sometimes the permissions set for files and directories are insufficient for the user id that the application runs under. This usually happens when a new application is being brought up. But it can also happen when the application allows different access privileges, and a users access is changed. Applications often consist of multiple programs, and error messages passed between them can become meaningless, making this problem especially difficult to track down.
Also, the /etc/hosts.deny and /etc/hosts.allow specify which client hosts are even allowed to establish sessions, and with which application. If this is not configured correctly, the client will usually receive a message indicating "access denied".
Start from closest point
- Check processes: Sometimes multiple instances of the same process can create resource conflicts.
Example: Someone clicked on the mozilla startup icon twice
> ps -ef | head -1; ps -ef | grep moz
UID PID PPID C STIME TTY TIME CMD
an_id 4775 3924 0 Jun22 ? 00:00:00 /bin/sh /usr/bin/mozilla
an_id 4786 4775 0 Jun22 ? 00:11:41 /opt/mozilla/lib/mozilla-bin
an_id 7976 3924 0 Jun22 ? 00:00:00 /bin/sh /usr/bin/mozilla
an_id 8103 7976 0 Jun22 ? 00:11:41 /opt/mozilla/lib/mozilla-bin
- Check TCP/IP stack: If it is up, there should be some services active.
> netstat -a | grep tcp
tcp 0 0 localhost:smtp : LISTEN
tcp 1 0 localhost:34020 localhost:ipp CLOSE_WAIT
tcp 0 0 localhost:32803 : LISTEN
tcp 0 0 :ssh : LISTEN
tcp 0 0 localhost:smtp *: LISTEN
- Check interface: You may need to execute this as root.
> ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:E0:81:29:8E:58
inet addr:192.168.1.5 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::2e0:81ff:fe29:8e58/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:480368 errors:0 dropped:0 overruns:0 frame:0
TX packets:24715 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:250359054 (238.7 Mb) TX bytes:2946463 (2.8 Mb)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:403 errors:0 dropped:0 overruns:0 frame:0
TX packets:403 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:27106 (26.4 Kb) TX bytes:27106 (26.4 Kb)
- Check the route table.
> netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
10.11.12.13 0.0.0.0 255.255.0.0 U 0 0 0 eth0
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0
- Check the local network.
> ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=150 time=0.701 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=150 time=0.674 ms
- Check the domain name server.
> cat /etc/resolv.conf
> ping 10.11.12.10
PING 10.11.12.10 (10.11.12.10) 56(84) bytes of data.
64 bytes from 10.11.12.10: icmp_seq=1 ttl=242 time=14.8 ms
64 bytes from 10.11.12.10: icmp_seq=2 ttl=242 time=13.0 ms
> host linux.org
linux.org has address 220.127.116.11
> host redhat.com
redhat.com has address 18.104.22.168
> host suse.com
suse.com has address 22.214.171.124
- Check the network path.
> traceroute something.com
traceroute to something.com (10.11.12.13), 30 hops max, 40 byte packets
1 local_router (192.168.1.1) 2.688 ms 4.170 ms 5.757 ms
2 next_router (10.1.1.1) 3.626 ms 5.143 ms 6.912 ms
3 * * *
4 something.com (10.11.12.13) 6.743 ms 7.954 ms 9.051 ms
Example 2: Try to telnet to redhat.com on port 80. Use Ctrl-] to break out of telnet.
> telnet redhat.com 80
Connected to redhat.com.
Escape character is '^]'.
The Requests for Comment that define the TCP/IP protocol are found at the following site. They are well worth a read after a little bit of experience working with network issues has been gained.