How Do You Troubleshoot a Network Problem?

Note:  This posting should help you troubleshoot many different network problems (not just those described in the problem scenario below).  Possible solutions 1 through 5 are ideal for erratic nmap results (inconsistent or discrepant output). 

Problem scenario
A port seems blocked on a Linux server given the results of nmap. The host appears to be down. You know this port is not blocked by intermediate routers and/or firewalls.  You know the server is turned on.  Sometimes you get false negatives with nmap.  How do you troubleshoot a network problem with a port that seems blocked but is not truly blocked insofar as you can tell?

If there are commands in bold, the outermost (and possibly the only) quotes should be ignored when actually entering them into a command prompt.

Possible solution #1
Verify the IP addresses are really what you think.  Sometimes with VPNs, routing tables, DNS issues, DHCP or NAT, an IP address can become confused by the professional herself.

Possible solution #2
n.b.  This is only relevant for Docker containers.  If you are not running Docker, skip this possible solution.  There could be a web service on the Linux server that maps the inbound port to a different port.  This separate port could be blocked by an intermediate [hardware or software] firewall.  This can happen when a web service is run in a Docker container.  A .yml file will map the listening, inbound port to another port.  This can create the problem scenario described above.  You can change the mapping of the .yml file and restart the Docker container.  Or you can open up the firewall to allow connectivity on the second port involved.

Possible solution #3
Use nmap -Pn x.x.x.x (not the -p flag and a specific port).  The short reason why this may work is that it bypasses a process called "host discovery" that is invoked with the -p flag (to test a specific port).  If the "host discovery" process fails, and it happens initially, then nmap will report that the host appears down when it is not.  Remember that nmap -Pn x.x.x.x only scans 1000 ports. There are 65535 TCP ports total. To scan every port, try this: nmap -p 1-65535 x.x.x.x

For more details, see the * below.

Possible solution #4
Use sudo nmap -p 55 x.x.x.x (where 55 is the port you are testing and x.x.x.x is the IP address).  When nmap is run with a non-root user without the "sudo " in front, two ports are tested in the host discovery phase.  This is an initial, behind-the-scenes process.  If both of these ports (80 and 443) are blocked, nmap will report failure.  With "sudo " in front, four ports are used.  You have more chances of getting past the initial host discovery phase with a leading "sudo " invocation.**
You may get output such as this (without the "sudo" before the name command):

Starting Nmap 7.40 ( ) at 2018-04-10 10:56 UTC
Note: Host seems down. If it is really up, but blocking our ping probes, try -Pn
Nmap done: 1 IP address (0 hosts up) scanned in 3.04 seconds

The above problem can be circumvented if you use the sudo command before nmap. (You may want to see this posting too.)

Possible solution #5
Use the "-d" flag at the end of the nmap command.  The nmap utility has many processes that happen behind the scenes.  To see the print out of these processes verbosely in real time, add the -d flag like this:
nmap -p 22 x.x.x.x -d

Possible solution #6
Try ping and traceroute.  These commands may reveal something new to help you find the problem.  This is a "back to basics" approach that can be overlooked by intermediate level (not advanced) network engineers.  If ping shows a packet loss above 0%, then you may want to read this posting. If traceroute provides only "* * *" as output, try the traceroute command with "sudo " in front and the " -I" flag after the traceroute command but before the IP address like this:  sudo traceroute -I x.x.x.x You may also want to see this posting for details about TTL packets.

Possible solution #7
If you are using Windows, use tracert from a DOS prompt.  tracert works like traceroute.  You can use test-netconnection in newer versions of PowerShell.  This command works like nmap with different options to use when you run it.  If you are using PowerShell version 3 (which does not have test-netconnection), you can see this posting.  You never know what Windows hosts may reveal.  These options may help. You may also want to see this posting for details about TTL packets.

Possible solution #8
You may want to install Cacti on a Linux server in your network.  If your network is flooded with packets from a malware infection or a DDoS attack, Cacti may help show you the congestion graphically and isolate the source.  Collisions in a network can degrade network performance.  Packets can flood the network from a malfunctioning device.  For a Windows environment, you may want to view this posting.

Possible solution #9
Be aware of operating system firewalls, intermediate firewalls, and intermediate intrusion prevention systems with active rules to drop packets or reject connections from certain IP addresses and not others.  Be aware of internal IP addresses and external IP addresses.  To find an internal IP address on a Linux server, run ip addr show.  To find an external IP address on a Linux server, run curl (assuming the server has access to the internet).  To find an internal IP address for a Windows server, run ipconfig from a DOS or PowerShell prompt.  To find an external IP address on a Windows server, open PowerShell and run curl  Consider that an IP address on a given server may be the one used for successful connectivity that was not the one you were expecting. To find out if there is a firewall running on Linux, see this posting.

Possible solution #10
The port might truly be blocked, but you have forgotten about some silent coworker implementing something new (e.g., applying updates to a router's flash memory or implementing a new ACL).  Was there an email that you received with the subject "Maintenance window"? Network Access Control Lists are part of AWS.

Possible solution #11
You might be on the wrong server (or a different server from the one you thought you were on).  False positives from networking utilities could be the result of being on the wrong server.  Sometimes DHCP in a highly automated or dynamic environment can contribute to what seems to be a false positive. 

Possible solution #12
Can you look at the routers and switches involved?  Do you see flashing amber or solid red lights on the network interface ports?  Did a network cable get unplugged?  Is there an electronic device that operates on the same frequency as the wireless routers?  There is a story about a network outage that happened every weekday at lunch time.  The kitchen was physically positioned between the workstations using the wireless network and the WiFi routers, the frequency of the microwaves being on would interfere with the wireless connections.   

Possible solution #13
Could the network be using IP v6 and you did not know about it?  Could the networking team have changed the routing protocols?  Was an OS patch recently released to the servers involved in your network riddle? 

Possible solution #14
If you are having trouble isolating a network problem, install tcpdump on the server that is not receiving connections properly and consistently.  You may want to run "tcpdump" on this host that is not being as responsive as you like.  tcpdump can make pings work that otherwise would not.  These threads are examples of that:

Possible solution #15
Use the netstat command.  You can use the "-anlp" flags to produce relevant details.  You can use "grep" to find only network activity that is associated with a certain string pattern.  For example, if you can guess what port should be active, and you are running on Linux, run a command like this with one: sudo netstat -anlp | grep 8080 You should replace "8080" with the port you surmise there is activity on.  If you see no results, then that port is not active.

Possible solution #16
If your problem is that you cannot reach the internet at all with an available WiFi connection, see this link.

Possible solution #17
If you do not want to use or install nmap, and you want to test connectivity over a specific port, do the following.

If you have can configure two terminal sessions, this can work. If you cannot have more than one, use the screen command before you start the first nc command (below) so you can simulate a second terminal afterward. If you need directions installing screen, see this posting. Install the nc utility. On CentOS/RHEL/Fedora you would run sudo yum -y install nc to install it. Once nc is installed, run this command: nc -l -p 9999 > /tmp/contint.txt

From a second terminal run these two commands:

date > /tmp/orig.txt
nc 9999 < /tmp/orig.txt

Now check your work from either terminal: cat /tmp/contint.txt

Possible solution #18
A socket is an IP address and a TCP port combination. Sockets can be reachable with a curl command, but not with a ping. (Page 131 of Kubernetes in Action explains this in more detail.) Do not give up hope if you cannot ping an IP address. Pings do not use individual ports and can fail if ICMP packets are turned off; thus the socket may be working correctly (for non-theoretical purposes). It often makes sense to go "back to basics" and troubleshoot a network problem from layer 1 of the OSI model to layer 7. But sometimes it is not necessary for fixing a problem.

Ping works at layer 3 (aka Networking layer) of the OSI model (according to this posting and this posting). Individual TCP ports operate at layer 4 (aka the Transport layer) according to this posting. Ping does not have knowledge of higher layers (and thus ports), and a socket involves a port making it at layer 4 issue. Some people characterize ports to be something that is above layer 4 (such as this posting or this posting). This external page helps explain why you cannot ping a port.

Possible solution #19
If you are trying to SSH to a server but you are getting, connection refused, see this posting.

Possible solution #20
If the NIC on your workstation has no lights on, it could be that you configured the NIC in the BIOS to never turn on. The BIOS options may be in the Boot Options, and it may seem that you are enabling a boot to the NIC. We have found that turning it on will enable the NIC as normal (and unchecking it as a boot option can disable the NIC after a normal OS boot up). You do this even though you are not booting to the PXE. To see more see How Do You Get the Internet and/or NIC on Your Windows Workstation to Work?

Possible solution #21
Is selinux enabled? It can block ports. Checking the firewall rules is not enough. Run this command to see if it is running: sudo getenforce

You may see Enabled, Permissive or Disabled. If it is "Disabled", then SELinux is not the problem. To learn more about how to configure SELinux to allow connectivity over a given port, see this.

Possible solution #22
What time is it? If it is during peak business hours, there could be a spike in the network usage. Traffic congestion can create more collisions and consume available bandwidth. If it is after core business hours or on the weekend, could there be a maintenance window that was unannounced? Sometimes planned outages happen right at the end of the business day. With email overload and numerous Slack/chat messages, there could have been a notification that you missed. To learn about the recommended practices of monitoring see this posting. You may be experiencing long-tail latency.

Possible solution #23
For interpreting results of nmap, see these postings (and the bottom part of this article):

How Do You Connect over Port 5986 on a Windows Server?
How Do You Troubleshoot the nmap Results “Host seems down” when the Other Server is Not Down?
How Do You Troubleshoot the nmap Result “Host seems down. If it is really up, but blocking our ping probes”?
How Do You Troubleshoot a False “State” Value in nmap Results?
Why Cannot You Ping a Server when Nmap Commands to The Server Work?

*  It is possible to have a situation like this:
Ping from server A to server B works.  
Ping from server C to server B works.
From server C, this works:  nmap -p 22 serverB
From server B, this works:  nmap -p 22 serverB
From server A, this does not work: nmap -p 22 serverB
From server A, this worksnmap -Pn serverB
This irregularity (or anomaly) seems elusive.  What is different about using nmap to test one specific port?

The -Pn flag bypasses host discovery.  This option with nmap can help you understand why there appears to be an inconsistency.

Host discovery is a process that is part of most nmap commands (depending on which flags you use) in nmap's initial stages of running.  If you are not using "sudo" before the nmap commands, or running the commands as root, the host discovery process is limited to two ports (80 and 443).   

This was taken from the nmap 7.1 man page:  "If no host discovery options are given, Nmap sends an ICMP echo request, a TCP SYN packet to port 443, a TCP ACK packet to port 80, and an ICMP timestamp request." 

** Using "nmap -p 22" (or a different port number) will be different from "nmap -Pn" or "sudo nmap -p 22".  If you are using "-Pn" flag will bypass the host discovery stage altogether.  With sudo before the nmap command (or running nmap as the root user), the host discovery process happens, but it happens differently.  With a sudoer running nmap, the host discovery process uses four ports and not just two.  Thus the scan process (or nmap run) will continue and not appear blocked in the initial stage of host discovery if the extra ports allow for reachability.

Leave a comment

Your email address will not be published. Required fields are marked *