Troubleshooting Drives: Smartmontools Best Practices and Commands

Troubleshooting Drives: Smartmontools Best Practices and CommandsMaintaining reliable storage systems requires proactive monitoring, and Smartmontools is one of the most powerful, open-source toolsets for that task. Smartmontools (which includes smartctl and smartd) interfaces with ATA, NVMe and SCSI devices using the S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) system built into modern drives. This article walks through best practices for using Smartmontools to troubleshoot drives, interpret results, set up automated monitoring, and use commands that help diagnose, prevent, and respond to drive failures.


Why use Smartmontools?

S.M.A.R.T. provides internal device metrics and self-test capabilities. Smartmontools turns those raw device features into actionable information:

  • Detects early signs of drive degradation (reallocated sectors, read/write errors, pending sectors).
  • Runs built-in self-tests (short, long, conveyance) to exercise drive internals.
  • Logs historical error and test records for trend analysis.
  • Integrates with system services (smartd) for automated alerts and responses.

Installing Smartmontools

On most Linux distributions and BSDs, smartmontools is available from system package managers:

  • Debian/Ubuntu:

    sudo apt update sudo apt install smartmontools 
  • RHEL/CentOS/Fedora:

    sudo dnf install smartmontools 
  • Arch Linux:

    sudo pacman -S smartmontools 
  • FreeBSD:

    pkg install smartmontools 

macOS users can install via Homebrew:

brew install smartmontools 

After installation, you’ll have two primary utilities: smartctl (manual querying and testing) and smartd (the daemon for automated monitoring).


Basic smartctl usage

Identify drives and their device names first (e.g., /dev/sda, /dev/nvme0n1):

  • List block devices (Linux):

    lsblk -o NAME,MODEL,SIZE,TYPE,MOUNTPOINT 
  • Query S.M.A.R.T. support and basic info:

    sudo smartctl -i /dev/sda 

    This shows whether S.M.A.R.T. is available/enabled and device model, serial number, firmware, and transport type.

  • Get overall SMART health summary:

    sudo smartctl -H /dev/sda 

    Best practice: Run -H as a quick health check; then follow up with attribute inspection if errors appear.

  • View full SMART attributes and error logs:

    sudo smartctl -A /dev/sda sudo smartctl -l error /dev/sda sudo smartctl -l selftest /dev/sda 

    Key attributes to watch: Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable, UDMA_CRC_Error_Count, and Media_Wearout_Indicator (for SSDs). Interpret values using vendor-specific documentation when possible.


Interpreting SMART attributes — practical tips

  • Reallocated_Sector_Ct: Non-zero and increasing counts indicate the drive is mapping-out bad sectors. A rising trend is a strong indicator to replace the drive.
  • Current_Pending_Sector: Sectors awaiting reallocation because of read errors. Even a single pending sector deserves attention.
  • Offline_Uncorrectable: Sectors that cannot be read during offline tests. Any non-zero value is concerning.
  • UDMA_CRC_Error_Count: Often indicates cabling or controller issues (loose SATA cable, bad port) rather than media failure. Check physical connections before replacing the drive.
  • For SSDs, check wear indicators (e.g., Media_Wearout_Indicator or Percentage Used). High wear values mean nearing end-of-life.

Always correlate SMART data with system logs (dmesg, syslog) and application-level errors (I/O timeouts, filesystem errors).


Running self-tests

Smartmontools supports several self-tests. Use them to exercise device internals and produce a clearer picture of media health.

  • Start a short self-test (usually a few minutes):

    sudo smartctl -t short /dev/sda 
  • Start a long (extended) self-test (can take hours):

    sudo smartctl -t long /dev/sda 
  • Start an conveyance test (drive transport damage, shorter than long):

    sudo smartctl -t conveyance /dev/sda 
  • Check test progress and results:

    sudo smartctl -c /dev/sda    # shows self-test support and remaining time immediately after starting sudo smartctl -l selftest /dev/sda  # show results when test completes 

Best practice: run short tests regularly (daily/weekly, depending on workload) and long tests during maintenance windows or before critical operations like backups or migrations.


Using smartd for automated monitoring

smartd is the daemon that periodically polls drives and sends alerts based on rules you configure in /etc/smartd.conf.

  • Typical smartd configuration example:

    
    /dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m [email protected] 

    Meaning:

  • -a enables all default checks.

  • -o on enables automatic offline data collection.

  • -S on enables attribute autosave (if supported).

  • -s schedules short/long tests (S = short, L = long) at specified times.

  • -m specifies recipient for email alerts.

  • Start and enable smartd:

    sudo systemctl enable --now smartd 

Best practices for smartd:

  • Configure email alerts to a monitoring account and integrate with existing monitoring systems (Prometheus, Nagios, Zabbix).
  • Use -M exec to run custom scripts (e.g., auto-fence VM, escalate to ops pipeline) when critical errors occur.
  • Exclude drives that are known to misreport (e.g., certain USB enclosures) to avoid false positives.

Common troubleshooting workflows

  1. Drive reports read errors in system logs:

    • Check SMART summary: sudo smartctl -H /dev/sdX.
    • Inspect attributes: sudo smartctl -A /dev/sdX.
    • Run a short self-test: sudo smartctl -t short /dev/sdX, then check results.
    • If pending/reallocated sectors are present, schedule a long self-test and plan replacement.
  2. Intermittent I/O errors or CRC errors:

    • Inspect UDMA_CRC_Error_Count. If increasing, reseat/replace SATA cable, try different port; test again.
    • Monitor for further CRC increases to determine if controller/cable fix resolved it.
  3. RAID array showing degraded disk or resync failures:

    • Use smartctl on the physical device (not the RAID device) to read attributes.
    • If SMART indicates failure, replace the drive immediately and rebuild.
    • If SMART is clear, check controller logs and firmware; consider running extended self-tests.
  4. SSD showing high wear or decreased performance:

    • Check SMART wear-level attributes and percentage used.
    • Confirm firmware is up to date.
    • If nearing end-of-life, plan replacement and data migration.

NVMe-specific commands

NVMe devices use slightly different syntax but smartctl supports them:

  • Basic info:

    sudo smartctl -i /dev/nvme0n1 
  • Health summary:

    sudo smartctl -H /dev/nvme0n1 
  • NVMe attributes and logs:

    sudo smartctl -A /dev/nvme0n1 sudo smartctl -l error /dev/nvme0n1 
  • Start an NVMe self-test:

    sudo smartctl -t long /dev/nvme0n1 

For NVMe, watch attributes like Data Units Written, Percentage Used, Media and Endurance-related entries, and temperature warnings.


Advanced tips and gotchas

  • Device names vs. RAID devices: Run smartctl against the actual physical device (e.g., /dev/sda or /dev/sgX), not the RAID logical device like /dev/md0, unless your controller supports passthrough.
  • HBA/RAID controllers: Some hardware RAID controllers do not expose SMART data for individual drives. Use vendor tools or enable passthrough (e.g., using /dev/sgX or controller-specific utilities).
  • Power cycling during tests: Avoid interrupting long self-tests; they can take hours. However, if a device becomes unresponsive during testing, treat it as an urgent failure indicator.
  • False positives from USB-to-SATA bridges: Many USB enclosures do not forward SMART attributes reliably. For drives connected via USB, test by connecting directly to SATA when possible.
  • Firmware quirks: Different vendors interpret attributes differently. When in doubt, consult vendor documentation for attribute meanings.

Example commands cheat-sheet

# Basic info sudo smartctl -i /dev/sda # Quick health check sudo smartctl -H /dev/sda # Full attributes + error/selftest logs sudo smartctl -A /dev/sda sudo smartctl -l error /dev/sda sudo smartctl -l selftest /dev/sda # Run self-tests sudo smartctl -t short /dev/sda sudo smartctl -t long /dev/sda # NVMe sudo smartctl -i /dev/nvme0n1 sudo smartctl -A /dev/nvme0n1 # Run smartctl on devices behind controllers that require device type override sudo smartctl -d sat -a /dev/sg2   # example: force SAT transport 

When to replace a drive

  • Any non-zero and increasing values for Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable.
  • SMART self-test failures or repeated, growing error logs.
  • High SSD wear percentage approaching vendor threshold.
  • Persistent I/O errors not resolved by cabling/controller fixes.

If you rely on the data, assume impending failure when SMART shows degrading trends and replace proactively after securing a verified backup.


Integration with monitoring systems

  • Export SMART metrics to Prometheus using exporters like smartmon-exporter or node_exporter textfile collector.
  • Use alerting rules to warn on attribute thresholds (e.g., Current_Pending_Sector > 0, Reallocated_Sector_Ct increasing).
  • Correlate SMART alerts with system logs and RAID controller events to reduce false positives.

Summary

Smartmontools is essential for proactive drive maintenance: it reveals internal device telemetry, runs diagnostic self-tests, and supports automated alerting. Combine regular automated checks (smartd), periodic manual inspections (smartctl), good cabling/firmware hygiene, and integration with monitoring to catch failures early and reduce downtime. When SMART shows clear or trending errors, prioritize backups and replace the drive rather than relying on uncertain remediation.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *