网络故障排查工具集

约 2059 字大约 7 分钟

networktroubleshooting

2025-07-17

概述

网络故障排查是运维工程师的核心技能。本文系统介绍Linux环境下的主要网络诊断工具，每个工具的使用方法和适用场景，并提供基于OSI模型的排查方法论。

排查方法论：OSI分层排查

ping - 连通性测试

# 基本ping
ping -c 4 8.8.8.8

# 设置TTL（检测路由跳数）
ping -t 10 8.8.8.8

# 设置包大小（MTU检测）
ping -s 1472 -M do 8.8.8.8
# -M do: 禁止分片，如果超过MTU会报错
# 1472 + 8(ICMP头) + 20(IP头) = 1500(标准MTU)

# 快速ping（洪泛模式，需root）
sudo ping -f -c 1000 10.0.0.1

# ping IPv6
ping6 ::1
ping -6 google.com

# 实际场景
# 场景1: ping不通但能访问 → 可能ICMP被防火墙拦截
# 场景2: ping延迟波动大 → 网络拥塞或链路质量问题
# 场景3: ping部分丢包 → 链路不稳定或拥塞

traceroute / mtr - 路由追踪

# traceroute（发送TTL递增的包）
traceroute 8.8.8.8
traceroute -I 8.8.8.8  # 使用ICMP（默认UDP）
traceroute -T 8.8.8.8  # 使用TCP

# mtr（结合ping和traceroute，持续监测）
mtr 8.8.8.8
mtr -r -c 100 8.8.8.8  # 报告模式，发100个包

# mtr输出解读
# HOST              Loss%  Snt   Last   Avg  Best  Wrst StDev
# 1. gateway         0.0%  100    1.2   1.1   0.8   2.5   0.3
# 2. isp-router      0.0%  100    5.3   5.1   4.8   8.2   0.5
# 3. ???            100.0%  100    0.0   0.0   0.0   0.0   0.0  ← 不响应ICMP
# 4. target          0.0%  100   15.2  14.8  14.1  18.3   0.8

# 解读要点：
# - 某跳100%丢包但后续正常 → 该设备不回复ICMP，不影响
# - 从某跳开始持续丢包 → 该跳可能是问题点
# - StDev大 → 链路不稳定

dig / nslookup - DNS查询

# dig - 全功能DNS查询工具
dig example.com                   # 查A记录
dig example.com AAAA              # 查IPv6地址
dig example.com MX                # 查邮件服务器
dig @8.8.8.8 example.com         # 指定DNS服务器
dig +short example.com            # 精简输出
dig +trace example.com            # 追踪完整解析路径
dig +nocmd +noall +answer example.com  # 仅显示应答

# 反向解析
dig -x 93.184.216.34

# 查看所有记录
dig example.com ANY

# nslookup（交互模式）
nslookup
> server 8.8.8.8
> set type=MX
> example.com

# DNS排查场景
# 场景1: 域名无法解析
dig example.com @8.8.8.8        # 换DNS服务器测试
dig example.com @ns1.example.com # 直接查权威DNS
dig +trace example.com           # 找出哪一步失败

# 场景2: DNS解析慢
dig example.com | grep "Query time"  # 查看解析耗时

netstat / ss - 连接状态

# ss（推荐，比netstat更快）
ss -tlnp                          # 监听的TCP端口
ss -ulnp                          # 监听的UDP端口
ss -tnp                           # 所有TCP连接
ss -s                             # 连接统计摘要
ss -tn state established          # 已建立的连接
ss -tn state time-wait            # TIME_WAIT连接
ss -tn dst 10.0.0.1               # 到特定IP的连接
ss -tn sport = :80                # 源端口80的连接

# ss输出示例
# State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
# ESTAB   0       0       10.0.0.1:8080       10.0.0.2:54321     users:(("java",pid=1234,fd=56))

# 按状态统计
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn

# netstat（传统工具）
netstat -tlnp                     # 监听端口
netstat -an | grep ESTABLISHED | wc -l  # 连接数

iptables / nftables - 防火墙

# iptables 查看规则
iptables -L -n -v                 # 查看所有规则（带计数器）
iptables -L -n -v --line-numbers  # 带行号
iptables -t nat -L -n -v          # 查看NAT规则

# 常见排查操作
# 临时插入放行规则测试
iptables -I INPUT -p tcp --dport 8080 -j ACCEPT

# 检查是否有DROP规则
iptables -L INPUT -n -v | grep DROP

# 日志记录被拒绝的包
iptables -A INPUT -j LOG --log-prefix "IPTables-Dropped: "

# nftables（iptables的继任者）
nft list ruleset                  # 查看所有规则
nft list table inet filter        # 查看特定表
nft monitor                       # 实时监控规则变化

# 排查场景
# 场景: 端口开放但无法访问
iptables -L INPUT -n -v           # 检查INPUT链
iptables -L FORWARD -n -v         # 如果是转发，检查FORWARD链
# 检查是否有 Docker/Kubernetes 插入的规则
iptables -L -n -v -t nat

curl - HTTP调试

# 基本请求
curl -v https://example.com       # 详细输出（含TLS握手）
curl -I https://example.com       # 仅返回头部

# 性能分析
curl -o /dev/null -s -w "\
DNS Lookup:   %{time_namelookup}s\n\
TCP Connect:  %{time_connect}s\n\
TLS Handshake: %{time_appconnect}s\n\
TTFB:         %{time_starttransfer}s\n\
Total Time:   %{time_total}s\n\
HTTP Code:    %{http_code}\n\
Download:     %{size_download} bytes\n" https://example.com

# 指定DNS解析
curl --resolve example.com:443:1.2.3.4 https://example.com

# 跳过证书验证（调试用）
curl -k https://self-signed.example.com

# POST请求
curl -X POST -H "Content-Type: application/json" \
  -d '{"key":"value"}' https://api.example.com/endpoint

# 模拟慢速连接
curl --limit-rate 100k https://example.com/large-file

tcpdump - 网络抓包

# 基本抓包
sudo tcpdump -i eth0              # 抓eth0所有包
sudo tcpdump -i any               # 抓所有接口

# 常用过滤
sudo tcpdump -i eth0 host 10.0.0.1          # 特定主机
sudo tcpdump -i eth0 port 80                 # 特定端口
sudo tcpdump -i eth0 src 10.0.0.1            # 源地址
sudo tcpdump -i eth0 dst port 443            # 目的端口
sudo tcpdump -i eth0 tcp and port 8080       # TCP端口
sudo tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'  # SYN包

# 保存为pcap文件
sudo tcpdump -i eth0 -w capture.pcap -c 1000
# -c 1000: 捕获1000个包后停止

# 读取pcap文件
tcpdump -r capture.pcap -nn

# 格式化输出
sudo tcpdump -i eth0 -nn -tttt -A port 80   # ASCII显示
sudo tcpdump -i eth0 -nn -X port 80         # 十六进制+ASCII

nmap - 网络扫描

# 端口扫描
nmap -sT 10.0.0.1                # TCP连接扫描
nmap -sS 10.0.0.1                # SYN扫描（半连接）
nmap -sU 10.0.0.1                # UDP扫描
nmap -p 80,443,8080 10.0.0.1     # 指定端口
nmap -p 1-65535 10.0.0.1         # 全端口扫描

# 服务识别
nmap -sV 10.0.0.1                # 版本检测
nmap -O 10.0.0.1                 # 操作系统检测

# 子网扫描
nmap -sn 10.0.0.0/24             # Ping扫描（存活检测）

# 脚本扫描
nmap --script ssl-cert 10.0.0.1  # SSL证书信息
nmap --script http-headers 10.0.0.1  # HTTP头信息

iperf3 - 带宽测试

# 服务端
iperf3 -s

# 客户端
iperf3 -c server_ip               # TCP测试
iperf3 -c server_ip -u -b 100M    # UDP测试，100Mbps
iperf3 -c server_ip -t 30         # 持续30秒
iperf3 -c server_ip -P 4          # 4个并行流
iperf3 -c server_ip -R            # 反向测试（服务端→客户端）

# 输出解读
# [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
# [  5]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec   12   468 KBytes
# Transfer: 传输数据量
# Bitrate: 吞吐量
# Retr: TCP重传次数（越少越好）
# Cwnd: 拥塞窗口大小

常见故障排查场景

场景1：服务无法访问

# 1. 检查服务是否在监听
ss -tlnp | grep 8080

# 2. 检查本地连通性
curl -v http://localhost:8080

# 3. 检查防火墙
iptables -L INPUT -n -v | grep 8080

# 4. 检查网络连通性
ping target_ip
telnet target_ip 8080

# 5. 检查DNS
dig service.example.com

# 6. 抓包确认
tcpdump -i any port 8080 -nn

场景2：访问延迟高

# 1. DNS解析耗时
dig example.com | grep "Query time"

# 2. 网络路径延迟
mtr -r -c 50 target_ip

# 3. TCP连接耗时
curl -o /dev/null -s -w "TCP: %{time_connect}s\nTTFB: %{time_starttransfer}s\n" http://target

# 4. 带宽瓶颈
iperf3 -c target_ip

# 5. 丢包和重传
ss -ti dst target_ip | grep -E "retrans|rtt"

场景3：间歇性连接失败

# 1. 持续监测
while true; do
    curl -o /dev/null -s -w "%{time_total} %{http_code}\n" http://target
    sleep 1
done

# 2. 持续ping
ping -D target_ip | tee ping_log.txt

# 3. 抓包分析
tcpdump -i eth0 -w debug.pcap host target_ip &
# 问题复现后停止抓包，用Wireshark分析

# 4. 检查连接数限制
ss -s
cat /proc/sys/net/ipv4/ip_local_port_range
cat /proc/sys/net/core/somaxconn

工具速查表

工具	主要用途	关键参数
ping	连通性和延迟	`-c`, `-s`, `-M do`
mtr	路由路径持续监测	`-r`, `-c`
dig	DNS查询	`+trace`, `+short`, `@server`
ss	连接状态	`-tlnp`, `-s`, `state`
curl	HTTP调试	`-v`, `-w`, `--resolve`
tcpdump	网络抓包	`-i`, `-w`, `-nn`, 过滤表达式
nmap	端口扫描	`-sT`, `-sV`, `-p`
iperf3	带宽测试	`-c`, `-s`, `-P`, `-u`

总结

网络故障排查的核心方法是按OSI层次自底向上逐层排查。掌握每层对应的诊断工具和常见故障模式，能够快速定位问题根因。在生产环境中，建议建立标准的排查流程和工具集，并配合持续监控系统（Prometheus + Grafana）实现问题的早期发现。

贡献者

withesse

更新日志

2026/3/14 13:09

查看所有更新日志

9f6c2-feat: organize wiki content and refresh site setup于 2026/3/14