Monitoring

Linux

What “monitoring” means (and why it matters)

You’re watching key resources in real time (or from logs/history) so you can answer:
CPU busy? memory full? disk or network slow? which process is the culprit?

Core areas: CPU, Memory (RAM/Swap), Disk & Filesystem, Network, Processes/Services, Logs.

0) Fast “what’s wrong?” checklist

uptime                 # load average & how long the system’s been up

top                    # live view of CPU/mem/processes (q to quit)

free -h                # RAM & swap usage

df -h                  # disk space per filesystem

du -sh *               # which folders are big (in current dir)

ss -tuna               # open TCP/UDP sockets

journalctl -p err -n 50   # last 50 error-level log lines (systemd)

Tip: rerun any command every 2s with watch -n2 "<command>".

1) CPU monitoring

· top / htop: per-process CPU%, memory, load; press 1 in top to see per-CPU cores.

· ps: quick sorted snapshots.

·         ps aux --sort=-%cpu | head

·         ps aux --sort=-%mem | head

· mpstat (per-CPU), pidstat (per-process over time) – from sysstat package:

·         sudo apt install sysstat

·         mpstat 2 5          # every 2s, 5 samples

·         pidstat -p <PID> 1  # track one process each second

· Load average (from uptime or top): rough queue length of runnable tasks.
A rule of thumb: if load ≫ number of CPU cores for sustained time → CPU-bound or lots of I/O wait.

2) Memory (RAM & swap)

· free -h: human-readable RAM and swap.

· In top: check RES (actual RAM used) vs VIRT (address space).

· vmstat 2 (from procps): quick view of si/so (swap in/out), wa (I/O wait).

· If memory is tight, look for big processes:

·         ps aux --sort=-%mem | head

3) Disk & filesystem

· Space:

·         df -h                 # per mountpoint free/used

·         du -sh * | sort -h    # largest folders

· I/O performance:

o iostat -xz 1 (sysstat): device utilization, queue, await latency.

o iotop (needs root): per-process disk I/O in real time.

o    sudo apt install iotop && sudo iotop

4) Network

· Interfaces & counters:

·         ip -s link           # RX/TX stats; check for drops/errors

·         ss -tuna | head      # which ports/connections are open

·         ping -c 3 8.8.8.8    # basic reachability

·         traceroute example.com   # path (sudo apt install traceroute)

· Live bandwidth (pick one):

·         sudo apt install iftop nload bmon

·         sudo iftop -i eth0    # per-connection bandwidth

·         nload                  # simple in/out meters

·         bmon                   # interface graphs

5) Processes, services, and logs

· Which process is heavy? → top, ps, pidstat.

· Service status (systemd):

·         systemctl status nginx

·         systemctl --failed

· Logs:

·         journalctl -u nginx --since "1 hour ago"     # one service

·         journalctl -p warning -b                     # warnings+ since last boot

·         dmesg -w                                     # kernel messages stream

· Open files / ports (when something is “in use”):

·         sudo lsof -p <PID> | head

·         sudo lsof -i :5432

6) Historical monitoring (see past, not just live)

· Enable sysstat collection to use sar for history:

·         sudo apt install sysstat

·         sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat

·         sudo systemctl enable --now sysstat

·         sar -u 1 5           # CPU samples now

·         sar -r               # historical memory (from cron-collected stats)

· For longer term and visuals later: glances, btop, netdata, prometheus+grafana (names to know).

7) Handy one-liners

# Top 10 memory hogs

ps -eo pid,comm,%mem,%cpu --sort=-%mem | head

# Show top I/O wait culprits (needs pidstat/sysstat)

pidstat -d 1 | head

# Follow a log live (search as you go)

journalctl -f | grep -i "error"

# See CPU usage per core for 10 seconds

mpstat -P ALL 1 10

8) Mini-labs (30–40 min total)

Lab A: CPU & load

yes > /dev/null &              # start a busy loop

top                            # see CPU% & load; press 1

pkill yes

Lab B: Memory pressure

free -h

python3 -c "a='x'*200*1024*1024; import time; time.sleep(20)" &

free -h                        # see used & cached

kill %1

Lab C: Disk usage & I/O

dd if=/dev/zero of=bigfile bs=1M count=500 oflag=direct status=progress

iostat -xz 1 | head -n 20

rm bigfile

Lab D: Network glimpse

ping -c 5 google.com

ss -tuna | head

Lab E: Logs & services

sudo systemctl status cron

journalctl -u cron --since "10 min ago"

9) Safety & exam tips

· Don’t kill random PIDs on shared systems; prefer kill <PID> (polite) over kill -9.

· Load average ≠ CPU% but correlates; sustained load way above core count is a flag.

· For disks: high util% / long await → I/O bottleneck.

· For memory: heavy swap or oom-killer messages in dmesg → RAM pressure.

· Know the big tools by name: top/htop, ps, free, df/du, iostat/iotop, ss/iftop, journalctl, dmesg, sar.

If you want, I can bundle these into a 2-page printable cheat sheet or a guided lab PDF for your class.

Linux Command: chown Monitoring: top