Monitoring Guide¶
Overview¶
NetIntel-OCR provides comprehensive monitoring capabilities for production deployments, including metrics, logging, health checks, and alerting.
Monitoring Architecture¶
graph LR
A[NetIntel-OCR] --> B[Metrics]
A --> C[Logs]
A --> D[Health Checks]
A --> E[Traces]
B --> F[Prometheus]
C --> G[Elasticsearch]
D --> H[Health API]
E --> I[Jaeger]
F --> J[Grafana]
G --> J
H --> K[Alerts]
I --> J
System Monitoring¶
Health Checks¶
# Check overall system health
netintel-ocr system health
# Detailed health report
netintel-ocr system health --detailed
# Component-specific health
netintel-ocr system health --component api
netintel-ocr system health --component mcp
netintel-ocr system health --component db
netintel-ocr system health --component models
# JSON output for automation
netintel-ocr system health --json
System Metrics¶
# View current metrics
netintel-ocr system metrics
# Continuous metrics monitoring
netintel-ocr system metrics --watch
# Export metrics
netintel-ocr system metrics --export metrics.json
# Specific metric categories
netintel-ocr system metrics --category cpu
netintel-ocr system metrics --category memory
netintel-ocr system metrics --category disk
netintel-ocr system metrics --category network
Performance Monitoring¶
# Performance snapshot
netintel-ocr system performance
# Performance profiling
netintel-ocr system profile --duration 60
# Bottleneck analysis
netintel-ocr system analyze
# Resource usage
netintel-ocr system resources
Server Monitoring¶
API Server Metrics¶
# Server status
netintel-ocr server status
# Server metrics
netintel-ocr server metrics
# Request statistics
netintel-ocr server requests --stats
# Active connections
netintel-ocr server connections
# Worker status
netintel-ocr server workers
Endpoint Monitoring¶
# Health endpoints
GET /health # Basic health check
GET /ready # Readiness probe
GET /alive # Liveness probe
GET /metrics # Prometheus metrics
GET /status # Detailed status
Example Health Response¶
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "0.1.17",
"uptime": 3600,
"components": {
"api": "healthy",
"mcp": "healthy",
"database": "healthy",
"models": "healthy",
"cache": "healthy"
},
"metrics": {
"requests_total": 1234,
"requests_per_second": 10.5,
"average_latency_ms": 250,
"error_rate": 0.01
}
}
Prometheus Integration¶
Metrics Endpoint Configuration¶
# Enable Prometheus metrics
netintel-ocr config set monitoring.prometheus.enabled true
netintel-ocr config set monitoring.prometheus.port 9090
netintel-ocr config set monitoring.prometheus.path /metrics
Available Metrics¶
# Request metrics
netintel_requests_total{method="POST",endpoint="/process",status="200"} 1234
netintel_request_duration_seconds{quantile="0.99"} 1.5
netintel_requests_in_flight 5
# Processing metrics
netintel_documents_processed_total 456
netintel_pages_processed_total 7890
netintel_diagrams_extracted_total 234
netintel_processing_duration_seconds{document="example.pdf"} 45.2
# Model metrics
netintel_model_inference_duration_seconds{model="qwen2.5vl:7b"} 2.3
netintel_model_load_time_seconds{model="qwen2.5vl:7b"} 15.4
netintel_model_memory_bytes{model="qwen2.5vl:7b"} 7516192768
# Database metrics
netintel_db_connections_active 10
netintel_db_queries_total{type="vector_search"} 567
netintel_db_query_duration_seconds{operation="search"} 0.15
# System metrics
netintel_cpu_usage_percent 45.2
netintel_memory_usage_bytes 4294967296
netintel_disk_usage_bytes{path="/app/cache"} 10737418240
Prometheus Configuration¶
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'netintel-ocr'
static_configs:
- targets: ['localhost:9090']
metrics_path: /metrics
scrape_interval: 10s
Grafana Dashboards¶
Import Dashboard¶
# Export dashboard template
netintel-ocr monitoring dashboard export > dashboard.json
# Configure Grafana
netintel-ocr monitoring dashboard config \
--grafana-url http://localhost:3000 \
--grafana-api-key $GRAFANA_API_KEY
# Deploy dashboard
netintel-ocr monitoring dashboard deploy
Key Dashboard Panels¶
- Request Rate: Requests per second over time
- Latency: P50, P95, P99 latencies
- Error Rate: Percentage of failed requests
- Processing Queue: Documents in queue
- Model Performance: Inference times by model
- Resource Usage: CPU, memory, disk utilization
- Database Performance: Query latencies and throughput
Logging¶
Log Configuration¶
# Set log level
netintel-ocr config set logging.level INFO
# Set log format
netintel-ocr config set logging.format json
# Enable file logging
netintel-ocr config set logging.file /var/log/netintel.log
# Configure rotation
netintel-ocr config set logging.rotation.enabled true
netintel-ocr config set logging.rotation.max_size 100MB
netintel-ocr config set logging.rotation.max_files 10
Log Levels¶
- DEBUG: Detailed debugging information
- INFO: General informational messages
- WARNING: Warning messages
- ERROR: Error messages
- CRITICAL: Critical issues
Structured Logging¶
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"message": "Processing document",
"document": "example.pdf",
"pages": 10,
"request_id": "req-12345",
"user": "api-user",
"duration_ms": 4500,
"metadata": {
"model": "qwen2.5vl:7b",
"diagrams_found": 3
}
}
Log Aggregation¶
# Configure Elasticsearch output
netintel-ocr config set logging.elasticsearch.enabled true
netintel-ocr config set logging.elasticsearch.host elasticsearch:9200
netintel-ocr config set logging.elasticsearch.index netintel-logs
# Configure Fluentd
netintel-ocr config set logging.fluentd.enabled true
netintel-ocr config set logging.fluentd.host fluentd:24224
Distributed Tracing¶
OpenTelemetry Integration¶
# Enable tracing
netintel-ocr config set tracing.enabled true
netintel-ocr config set tracing.provider opentelemetry
# Configure Jaeger
netintel-ocr config set tracing.jaeger.endpoint http://jaeger:14268/api/traces
netintel-ocr config set tracing.sample_rate 0.1
Trace Context¶
# Automatic trace propagation
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"trace_flags": "01",
"operations": [
{
"name": "process_document",
"duration_ms": 4500,
"children": [
{"name": "extract_text", "duration_ms": 1200},
{"name": "detect_diagrams", "duration_ms": 2000},
{"name": "generate_mermaid", "duration_ms": 1000},
{"name": "store_results", "duration_ms": 300}
]
}
]
}
Alerting¶
Alert Configuration¶
# Configure alert rules
netintel-ocr monitoring alerts add \
--name high-error-rate \
--condition "error_rate > 0.05" \
--duration 5m \
--severity critical
# List alerts
netintel-ocr monitoring alerts list
# Test alert
netintel-ocr monitoring alerts test high-error-rate
Alert Channels¶
# Configure email alerts
netintel-ocr monitoring alerts channel add email \
--smtp-host smtp.gmail.com \
--smtp-port 587 \
--from [email protected] \
--to [email protected]
# Configure Slack alerts
netintel-ocr monitoring alerts channel add slack \
--webhook-url https://hooks.slack.com/services/XXX
# Configure PagerDuty
netintel-ocr monitoring alerts channel add pagerduty \
--integration-key YOUR_KEY
Alert Rules Examples¶
# alerts.yaml
alerts:
- name: high_error_rate
expr: rate(netintel_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (threshold 0.05)"
- name: slow_processing
expr: netintel_processing_duration_seconds{quantile="0.99"} > 60
for: 10m
severity: warning
annotations:
summary: "Slow document processing"
- name: low_disk_space
expr: netintel_disk_free_bytes < 1073741824
for: 5m
severity: critical
annotations:
summary: "Low disk space (< 1GB)"
Performance Profiling¶
CPU Profiling¶
# Start CPU profiling
netintel-ocr system profile cpu --duration 60 --output cpu.prof
# Analyze profile
netintel-ocr system profile analyze cpu.prof
# Generate flame graph
netintel-ocr system profile flamegraph cpu.prof --output flame.svg
Memory Profiling¶
# Memory snapshot
netintel-ocr system profile memory --output memory.prof
# Memory leaks detection
netintel-ocr system profile memory --detect-leaks
# Heap analysis
netintel-ocr system profile heap
Custom Metrics¶
Adding Custom Metrics¶
from netintel_ocr.monitoring import metrics
# Counter
documents_processed = metrics.Counter(
'documents_processed_total',
'Total documents processed'
)
documents_processed.inc()
# Histogram
processing_time = metrics.Histogram(
'processing_duration_seconds',
'Document processing duration'
)
with processing_time.time():
process_document()
# Gauge
queue_size = metrics.Gauge(
'processing_queue_size',
'Current queue size'
)
queue_size.set(len(queue))
Monitoring Best Practices¶
1. Set Up Dashboards Early¶
# Deploy standard dashboards
netintel-ocr monitoring dashboard deploy --all
# Customize for your needs
netintel-ocr monitoring dashboard customize \
--template standard \
--output custom-dashboard.json
2. Configure Appropriate Retention¶
# Set metrics retention
netintel-ocr config set monitoring.metrics.retention 30d
# Set log retention
netintel-ocr config set logging.retention 7d
# Set trace retention
netintel-ocr config set tracing.retention 3d
3. Monitor Key SLIs¶
- Availability: Uptime percentage
- Latency: P50, P95, P99 response times
- Error Rate: Percentage of failed requests
- Throughput: Requests/documents per second
4. Set Up Alerting Thresholds¶
# SLA-based alerts
netintel-ocr monitoring alerts add \
--name sla-availability \
--condition "availability < 0.999" \
--severity critical
netintel-ocr monitoring alerts add \
--name sla-latency \
--condition "p99_latency > 5s" \
--severity warning
Troubleshooting Monitoring¶
No Metrics Data¶
# Check metrics endpoint
curl http://localhost:9090/metrics
# Verify Prometheus scraping
netintel-ocr monitoring verify prometheus
# Check configuration
netintel-ocr config get monitoring.prometheus
Missing Logs¶
# Check log configuration
netintel-ocr config get logging
# Test log output
netintel-ocr system test-logs
# Verify permissions
ls -la /var/log/netintel.log
Performance Issues¶
# Reduce metric cardinality
netintel-ocr config set monitoring.metrics.cardinality low
# Adjust sampling
netintel-ocr config set tracing.sample_rate 0.01
# Optimize log level
netintel-ocr config set logging.level WARNING
Monitoring in Production¶
Kubernetes Monitoring¶
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: netintel-ocr
spec:
selector:
matchLabels:
app: netintel-ocr
endpoints:
- port: metrics
interval: 10s
Docker Monitoring¶
# docker-compose.yml
services:
netintel-ocr:
image: netintel-ocr:latest
labels:
- "prometheus.io/scrape=true"
- "prometheus.io/port=9090"
- "prometheus.io/path=/metrics"
Next Steps¶
- Deployment Guide - Production deployment
- Performance Guide - Performance optimization
- Troubleshooting Guide - Common issues
- API Guide - API monitoring endpoints