GitHub Repository: https://github.com/timmcfadden/OCIHeartbeat
Quick Start: Clone the repo, configure OCI CLI, and run the setup script
Have you ever had VMs showing as "running" in the OCI console, but they're completely unresponsive? Maybe stuck at a BIOS screen, hung during boot, or frozen for unknown reasons? OCI Heartbeat solves this exact problem.
The Problem
A customer recently experienced Windows VMs hanging at the BIOS screen. The VMs appeared as "running" in Oracle Cloud Infrastructure, but were completely unresponsive to connections. Traditional monitoring that only checks power state wouldn't catch this - you need to monitor VM accessibility status, not just whether the instance is powered on.
This is especially common in Windows environments where:
- VMs hang at BIOS/UEFI screens
- Boot processes fail silently
- Services crash but the VM stays "running"
- Network configurations break post-reboot
What is OCI Heartbeat?
OCI Heartbeat is a Python-based monitoring solution that automatically creates and manages heartbeat alarms for your Oracle Cloud virtual machines. It monitors VM accessibility status every 5 minutes and triggers alerts when VMs are unresponsive OR haven't reported status for 10+ minutes.
Key Features
Smart Responsiveness Detection
- Monitors VM accessibility status, not just power state
- Catches VMs that are "on" but unresponsive
- Detects silent failures and hung processes
Automated VM Discovery
- Process all running VMs in a compartment at once
- Or monitor specific VMs individually
- Automatic compartment detection from VM OCID
Instant Email Notifications
- Integrates with OCI notification topics
- Sends alerts when VMs become unresponsive
- CRITICAL severity for immediate attention
Production Ready
- Non-interactive mode for automation
- Comprehensive error handling
- Detailed success/failure reporting
- Permission validation and troubleshooting guidance
How It Works
OCI Heartbeat creates monitoring alarms that check the VM accessibility metric every 5 minutes. An alarm triggers when:
- VM is unresponsive for 5 minutes - The instance is powered on but not responding
- No status reported for 10+ minutes - The monitoring agent isn't reporting (potential freeze/hang)
This dual-trigger approach catches both sudden failures and gradual degradation.
Installation
Requirements
- Python 3.6 or higher
- OCI Python SDK:
pip install oci - Configured OCI CLI with API key authentication
- Active OCI account with compute and monitoring permissions
Setup
git clone https://github.com/timmcfadden/OCIHeartbeat.git
cd OCIHeartbeat
pip install oci
Prerequisites
Before running, you'll need:
- Compartment OCID - Where your VMs are located (starts with
ocid1.compartment.) - Notification Topic OCID - For email alerts (starts with
ocid1.onstopic.) - Email Subscription - Subscribe to your notification topic to receive alerts
- Optional: VM Instance OCID - If monitoring a specific VM (starts with
ocid1.instance.)
Usage
Monitor All VMs in a Compartment
Perfect for production environments where you want comprehensive coverage:
python3 oci_vm_alarms.py --compartment ocid1.compartment.oc1..xxx --topic ocid1.onstopic.oc1..xxx
This will:
- Discover all running VMs in the compartment
- Create heartbeat alarms for each VM
- Display a summary of successes and failures
Monitor a Single VM
Ideal for critical individual instances:
python3 oci_vm_alarms.py --vm-ocid ocid1.instance.oc1.iad.xxx --topic ocid1.onstopic.oc1..xxx
The script automatically detects the compartment from the VM OCID.
Non-Interactive Mode
For automation, cron jobs, or CI/CD pipelines:
python3 oci_vm_alarms.py --compartment ocid1.compartment.oc1..xxx --topic ocid1.onstopic.oc1..xxx --non-interactive
Runs without prompts, perfect for scheduled execution.
Real-World Use Cases
Windows Server Environments
Monitor critical Windows VMs prone to BIOS issues:
# Create alarms for all Windows servers in production
python3 oci_vm_alarms.py --compartment $PROD_COMPARTMENT --topic $ALERT_TOPIC
Critical Database Servers
Monitor individual high-value instances:
# Monitor primary database server
python3 oci_vm_alarms.py --vm-ocid $DB_PRIMARY_OCID --topic $DBA_ALERT_TOPIC
Automated Monitoring Setup
Add to infrastructure-as-code workflows:
# Run after VM provisioning in CI/CD
python3 oci_vm_alarms.py --vm-ocid $NEW_VM_OCID --topic $MONITORING_TOPIC --non-interactive
Cost Optimization Audits
Identify zombie VMs that appear running but aren't functional:
# Check all VMs in dev/test environments
python3 oci_vm_alarms.py --compartment $DEV_COMPARTMENT --topic $DEVOPS_TOPIC
Monitoring Details
Check Frequency: Every 5 minutes
Alarm Triggers:
- VM unresponsive for 5 consecutive minutes
- No status reported for 10+ minutes
Alarm Severity: CRITICAL
Notification Method: Email via OCI notification topics
Alarm Naming: Heartbeat-[VM-Name] for easy identification
Best Practices
- Test First - Always test in non-production environments before deploying to production
- Separate Topics - Use different notification topics for different environments (prod/dev/test)
- Subscribe Relevant Teams - Ensure the right people receive alerts (ops team, DBAs, etc.)
- Regular Audits - Periodically re-run to ensure new VMs are monitored
- Document Runbooks - Create response procedures for when alarms trigger
Troubleshooting
The script includes comprehensive error handling for common issues:
- Permission Errors - Validates you have compute and monitoring permissions
- Invalid OCIDs - Checks OCID format before making API calls
- Missing Resources - Reports if compartments or VMs don't exist
- API Limits - Handles rate limiting gracefully
All errors include actionable troubleshooting guidance.
Why This Matters
Traditional monitoring often only checks if a VM is powered on. This leaves critical gaps:
- ✅ VM Power State: "Running"
- ❌ VM Accessibility: Unresponsive
- ❌ Application Status: Down
- ❌ User Impact: Complete outage
OCI Heartbeat fills this gap by monitoring what actually matters: Can users reach this VM?
This is especially critical for:
- Customer-facing applications
- Database servers
- API endpoints
- Windows environments with boot issues
- Any VM where uptime is critical
Get Started Today
OCI Heartbeat is open source and ready to deploy:
Repository: https://github.com/timmcfadden/OCIHeartbeat
Set up monitoring for your entire fleet in minutes and catch unresponsive VMs before they impact your users.
⚠️ Note: Test thoroughly in your environment before production deployments. Every infrastructure is unique!
Questions? Open an issue on GitHub or contribute improvements! 🚀