GitHub Repository: https://github.com/timmcfadden/OCIHeartbeat

Quick Start: Clone the repo, configure OCI CLI, and run the setup script


Have you ever had VMs showing as "running" in the OCI console, but they're completely unresponsive? Maybe stuck at a BIOS screen, hung during boot, or frozen for unknown reasons? OCI Heartbeat solves this exact problem.

The Problem

A customer recently experienced Windows VMs hanging at the BIOS screen. The VMs appeared as "running" in Oracle Cloud Infrastructure, but were completely unresponsive to connections. Traditional monitoring that only checks power state wouldn't catch this - you need to monitor VM accessibility status, not just whether the instance is powered on.

This is especially common in Windows environments where:
- VMs hang at BIOS/UEFI screens
- Boot processes fail silently
- Services crash but the VM stays "running"
- Network configurations break post-reboot


What is OCI Heartbeat?

OCI Heartbeat is a Python-based monitoring solution that automatically creates and manages heartbeat alarms for your Oracle Cloud virtual machines. It monitors VM accessibility status every 5 minutes and triggers alerts when VMs are unresponsive OR haven't reported status for 10+ minutes.

Key Features

Smart Responsiveness Detection
- Monitors VM accessibility status, not just power state
- Catches VMs that are "on" but unresponsive
- Detects silent failures and hung processes

Automated VM Discovery
- Process all running VMs in a compartment at once
- Or monitor specific VMs individually
- Automatic compartment detection from VM OCID

Instant Email Notifications
- Integrates with OCI notification topics
- Sends alerts when VMs become unresponsive
- CRITICAL severity for immediate attention

Production Ready
- Non-interactive mode for automation
- Comprehensive error handling
- Detailed success/failure reporting
- Permission validation and troubleshooting guidance


How It Works

OCI Heartbeat creates monitoring alarms that check the VM accessibility metric every 5 minutes. An alarm triggers when:

  1. VM is unresponsive for 5 minutes - The instance is powered on but not responding
  2. No status reported for 10+ minutes - The monitoring agent isn't reporting (potential freeze/hang)

This dual-trigger approach catches both sudden failures and gradual degradation.


Installation

Requirements

  • Python 3.6 or higher
  • OCI Python SDK: pip install oci
  • Configured OCI CLI with API key authentication
  • Active OCI account with compute and monitoring permissions

Setup

git clone https://github.com/timmcfadden/OCIHeartbeat.git cd OCIHeartbeat pip install oci

Prerequisites

Before running, you'll need:

  1. Compartment OCID - Where your VMs are located (starts with ocid1.compartment.)
  2. Notification Topic OCID - For email alerts (starts with ocid1.onstopic.)
  3. Email Subscription - Subscribe to your notification topic to receive alerts
  4. Optional: VM Instance OCID - If monitoring a specific VM (starts with ocid1.instance.)

Usage

Monitor All VMs in a Compartment

Perfect for production environments where you want comprehensive coverage:

python3 oci_vm_alarms.py --compartment ocid1.compartment.oc1..xxx --topic ocid1.onstopic.oc1..xxx

This will:
- Discover all running VMs in the compartment
- Create heartbeat alarms for each VM
- Display a summary of successes and failures

Monitor a Single VM

Ideal for critical individual instances:

python3 oci_vm_alarms.py --vm-ocid ocid1.instance.oc1.iad.xxx --topic ocid1.onstopic.oc1..xxx

The script automatically detects the compartment from the VM OCID.

Non-Interactive Mode

For automation, cron jobs, or CI/CD pipelines:

python3 oci_vm_alarms.py --compartment ocid1.compartment.oc1..xxx --topic ocid1.onstopic.oc1..xxx --non-interactive

Runs without prompts, perfect for scheduled execution.


Real-World Use Cases

Windows Server Environments

Monitor critical Windows VMs prone to BIOS issues:

# Create alarms for all Windows servers in production python3 oci_vm_alarms.py --compartment $PROD_COMPARTMENT --topic $ALERT_TOPIC

Critical Database Servers

Monitor individual high-value instances:

# Monitor primary database server python3 oci_vm_alarms.py --vm-ocid $DB_PRIMARY_OCID --topic $DBA_ALERT_TOPIC

Automated Monitoring Setup

Add to infrastructure-as-code workflows:

# Run after VM provisioning in CI/CD python3 oci_vm_alarms.py --vm-ocid $NEW_VM_OCID --topic $MONITORING_TOPIC --non-interactive

Cost Optimization Audits

Identify zombie VMs that appear running but aren't functional:

# Check all VMs in dev/test environments python3 oci_vm_alarms.py --compartment $DEV_COMPARTMENT --topic $DEVOPS_TOPIC

Monitoring Details

Check Frequency: Every 5 minutes

Alarm Triggers:
- VM unresponsive for 5 consecutive minutes
- No status reported for 10+ minutes

Alarm Severity: CRITICAL

Notification Method: Email via OCI notification topics

Alarm Naming: Heartbeat-[VM-Name] for easy identification


Best Practices

  1. Test First - Always test in non-production environments before deploying to production
  2. Separate Topics - Use different notification topics for different environments (prod/dev/test)
  3. Subscribe Relevant Teams - Ensure the right people receive alerts (ops team, DBAs, etc.)
  4. Regular Audits - Periodically re-run to ensure new VMs are monitored
  5. Document Runbooks - Create response procedures for when alarms trigger

Troubleshooting

The script includes comprehensive error handling for common issues:

  • Permission Errors - Validates you have compute and monitoring permissions
  • Invalid OCIDs - Checks OCID format before making API calls
  • Missing Resources - Reports if compartments or VMs don't exist
  • API Limits - Handles rate limiting gracefully

All errors include actionable troubleshooting guidance.


Why This Matters

Traditional monitoring often only checks if a VM is powered on. This leaves critical gaps:

  • VM Power State: "Running"
  • VM Accessibility: Unresponsive
  • Application Status: Down
  • User Impact: Complete outage

OCI Heartbeat fills this gap by monitoring what actually matters: Can users reach this VM?

This is especially critical for:
- Customer-facing applications
- Database servers
- API endpoints
- Windows environments with boot issues
- Any VM where uptime is critical


Get Started Today

OCI Heartbeat is open source and ready to deploy:

Repository: https://github.com/timmcfadden/OCIHeartbeat

Set up monitoring for your entire fleet in minutes and catch unresponsive VMs before they impact your users.

⚠️ Note: Test thoroughly in your environment before production deployments. Every infrastructure is unique!

Questions? Open an issue on GitHub or contribute improvements! 🚀