Monitoring Guide¶

Overview¶

NetIntel-OCR provides comprehensive monitoring capabilities for production deployments, including metrics, logging, health checks, and alerting.

Monitoring Architecture¶

graph LR
    A[NetIntel-OCR] --> B[Metrics]
    A --> C[Logs]
    A --> D[Health Checks]
    A --> E[Traces]

    B --> F[Prometheus]
    C --> G[Elasticsearch]
    D --> H[Health API]
    E --> I[Jaeger]

    F --> J[Grafana]
    G --> J
    H --> K[Alerts]
    I --> J

System Monitoring¶

Health Checks¶

# Check overall system health
netintel-ocr system health

# Detailed health report
netintel-ocr system health --detailed

# Component-specific health
netintel-ocr system health --component api
netintel-ocr system health --component mcp
netintel-ocr system health --component db
netintel-ocr system health --component models

# JSON output for automation
netintel-ocr system health --json

System Metrics¶

# View current metrics
netintel-ocr system metrics

# Continuous metrics monitoring
netintel-ocr system metrics --watch

# Export metrics
netintel-ocr system metrics --export metrics.json

# Specific metric categories
netintel-ocr system metrics --category cpu
netintel-ocr system metrics --category memory
netintel-ocr system metrics --category disk
netintel-ocr system metrics --category network

Performance Monitoring¶

# Performance snapshot
netintel-ocr system performance

# Performance profiling
netintel-ocr system profile --duration 60

# Bottleneck analysis
netintel-ocr system analyze

# Resource usage
netintel-ocr system resources

Server Monitoring¶

API Server Metrics¶

# Server status
netintel-ocr server status

# Server metrics
netintel-ocr server metrics

# Request statistics
netintel-ocr server requests --stats

# Active connections
netintel-ocr server connections

# Worker status
netintel-ocr server workers

Endpoint Monitoring¶

# Health endpoints
GET /health          # Basic health check
GET /ready          # Readiness probe
GET /alive          # Liveness probe
GET /metrics        # Prometheus metrics
GET /status         # Detailed status

Example Health Response¶

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "0.1.17",
  "uptime": 3600,
  "components": {
    "api": "healthy",
    "mcp": "healthy",
    "database": "healthy",
    "models": "healthy",
    "cache": "healthy"
  },
  "metrics": {
    "requests_total": 1234,
    "requests_per_second": 10.5,
    "average_latency_ms": 250,
    "error_rate": 0.01
  }
}

Prometheus Integration¶

Metrics Endpoint Configuration¶

# Enable Prometheus metrics
netintel-ocr config set monitoring.prometheus.enabled true
netintel-ocr config set monitoring.prometheus.port 9090
netintel-ocr config set monitoring.prometheus.path /metrics

Available Metrics¶

# Request metrics
netintel_requests_total{method="POST",endpoint="/process",status="200"} 1234
netintel_request_duration_seconds{quantile="0.99"} 1.5
netintel_requests_in_flight 5

# Processing metrics
netintel_documents_processed_total 456
netintel_pages_processed_total 7890
netintel_diagrams_extracted_total 234
netintel_processing_duration_seconds{document="example.pdf"} 45.2

# Model metrics
netintel_model_inference_duration_seconds{model="qwen2.5vl:7b"} 2.3
netintel_model_load_time_seconds{model="qwen2.5vl:7b"} 15.4
netintel_model_memory_bytes{model="qwen2.5vl:7b"} 7516192768

# Database metrics
netintel_db_connections_active 10
netintel_db_queries_total{type="vector_search"} 567
netintel_db_query_duration_seconds{operation="search"} 0.15

# System metrics
netintel_cpu_usage_percent 45.2
netintel_memory_usage_bytes 4294967296
netintel_disk_usage_bytes{path="/app/cache"} 10737418240

Prometheus Configuration¶

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'netintel-ocr'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: /metrics
    scrape_interval: 10s

Grafana Dashboards¶

Import Dashboard¶

# Export dashboard template
netintel-ocr monitoring dashboard export > dashboard.json

# Configure Grafana
netintel-ocr monitoring dashboard config \
  --grafana-url http://localhost:3000 \
  --grafana-api-key $GRAFANA_API_KEY

# Deploy dashboard
netintel-ocr monitoring dashboard deploy

Key Dashboard Panels¶

Request Rate: Requests per second over time
Latency: P50, P95, P99 latencies
Error Rate: Percentage of failed requests
Processing Queue: Documents in queue
Model Performance: Inference times by model
Resource Usage: CPU, memory, disk utilization
Database Performance: Query latencies and throughput

Logging¶

Log Configuration¶

# Set log level
netintel-ocr config set logging.level INFO

# Set log format
netintel-ocr config set logging.format json

# Enable file logging
netintel-ocr config set logging.file /var/log/netintel.log

# Configure rotation
netintel-ocr config set logging.rotation.enabled true
netintel-ocr config set logging.rotation.max_size 100MB
netintel-ocr config set logging.rotation.max_files 10

Log Levels¶

DEBUG: Detailed debugging information
INFO: General informational messages
WARNING: Warning messages
ERROR: Error messages
CRITICAL: Critical issues

Structured Logging¶

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "message": "Processing document",
  "document": "example.pdf",
  "pages": 10,
  "request_id": "req-12345",
  "user": "api-user",
  "duration_ms": 4500,
  "metadata": {
    "model": "qwen2.5vl:7b",
    "diagrams_found": 3
  }
}

Log Aggregation¶

# Configure Elasticsearch output
netintel-ocr config set logging.elasticsearch.enabled true
netintel-ocr config set logging.elasticsearch.host elasticsearch:9200
netintel-ocr config set logging.elasticsearch.index netintel-logs

# Configure Fluentd
netintel-ocr config set logging.fluentd.enabled true
netintel-ocr config set logging.fluentd.host fluentd:24224

Distributed Tracing¶

OpenTelemetry Integration¶

# Enable tracing
netintel-ocr config set tracing.enabled true
netintel-ocr config set tracing.provider opentelemetry

# Configure Jaeger
netintel-ocr config set tracing.jaeger.endpoint http://jaeger:14268/api/traces
netintel-ocr config set tracing.sample_rate 0.1

Trace Context¶

# Automatic trace propagation
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "trace_flags": "01",
  "operations": [
    {
      "name": "process_document",
      "duration_ms": 4500,
      "children": [
        {"name": "extract_text", "duration_ms": 1200},
        {"name": "detect_diagrams", "duration_ms": 2000},
        {"name": "generate_mermaid", "duration_ms": 1000},
        {"name": "store_results", "duration_ms": 300}
      ]
    }
  ]
}

Alerting¶

Alert Configuration¶

# Configure alert rules
netintel-ocr monitoring alerts add \
  --name high-error-rate \
  --condition "error_rate > 0.05" \
  --duration 5m \
  --severity critical

# List alerts
netintel-ocr monitoring alerts list

# Test alert
netintel-ocr monitoring alerts test high-error-rate

Alert Channels¶

# Configure email alerts
netintel-ocr monitoring alerts channel add email \
  --smtp-host smtp.gmail.com \
  --smtp-port 587 \
  --from [email protected] \
  --to [email protected]

# Configure Slack alerts
netintel-ocr monitoring alerts channel add slack \
  --webhook-url https://hooks.slack.com/services/XXX

# Configure PagerDuty
netintel-ocr monitoring alerts channel add pagerduty \
  --integration-key YOUR_KEY

Alert Rules Examples¶

# alerts.yaml
alerts:
  - name: high_error_rate
    expr: rate(netintel_requests_total{status=~"5.."}[5m]) > 0.05
    for: 5m
    severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} (threshold 0.05)"

  - name: slow_processing
    expr: netintel_processing_duration_seconds{quantile="0.99"} > 60
    for: 10m
    severity: warning
    annotations:
      summary: "Slow document processing"

  - name: low_disk_space
    expr: netintel_disk_free_bytes < 1073741824
    for: 5m
    severity: critical
    annotations:
      summary: "Low disk space (< 1GB)"

Performance Profiling¶

CPU Profiling¶

# Start CPU profiling
netintel-ocr system profile cpu --duration 60 --output cpu.prof

# Analyze profile
netintel-ocr system profile analyze cpu.prof

# Generate flame graph
netintel-ocr system profile flamegraph cpu.prof --output flame.svg

Memory Profiling¶

# Memory snapshot
netintel-ocr system profile memory --output memory.prof

# Memory leaks detection
netintel-ocr system profile memory --detect-leaks

# Heap analysis
netintel-ocr system profile heap

Custom Metrics¶

Adding Custom Metrics¶

from netintel_ocr.monitoring import metrics

# Counter
documents_processed = metrics.Counter(
    'documents_processed_total',
    'Total documents processed'
)
documents_processed.inc()

# Histogram
processing_time = metrics.Histogram(
    'processing_duration_seconds',
    'Document processing duration'
)
with processing_time.time():
    process_document()

# Gauge
queue_size = metrics.Gauge(
    'processing_queue_size',
    'Current queue size'
)
queue_size.set(len(queue))

Monitoring Best Practices¶

1. Set Up Dashboards Early¶

# Deploy standard dashboards
netintel-ocr monitoring dashboard deploy --all

# Customize for your needs
netintel-ocr monitoring dashboard customize \
  --template standard \
  --output custom-dashboard.json

2. Configure Appropriate Retention¶

# Set metrics retention
netintel-ocr config set monitoring.metrics.retention 30d

# Set log retention
netintel-ocr config set logging.retention 7d

# Set trace retention
netintel-ocr config set tracing.retention 3d

3. Monitor Key SLIs¶

Availability: Uptime percentage
Latency: P50, P95, P99 response times
Error Rate: Percentage of failed requests
Throughput: Requests/documents per second

4. Set Up Alerting Thresholds¶

# SLA-based alerts
netintel-ocr monitoring alerts add \
  --name sla-availability \
  --condition "availability < 0.999" \
  --severity critical

netintel-ocr monitoring alerts add \
  --name sla-latency \
  --condition "p99_latency > 5s" \
  --severity warning

Troubleshooting Monitoring¶

No Metrics Data¶

# Check metrics endpoint
curl http://localhost:9090/metrics

# Verify Prometheus scraping
netintel-ocr monitoring verify prometheus

# Check configuration
netintel-ocr config get monitoring.prometheus

Missing Logs¶

# Check log configuration
netintel-ocr config get logging

# Test log output
netintel-ocr system test-logs

# Verify permissions
ls -la /var/log/netintel.log

Performance Issues¶

# Reduce metric cardinality
netintel-ocr config set monitoring.metrics.cardinality low

# Adjust sampling
netintel-ocr config set tracing.sample_rate 0.01

# Optimize log level
netintel-ocr config set logging.level WARNING

Monitoring in Production¶

Kubernetes Monitoring¶

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: netintel-ocr
spec:
  selector:
    matchLabels:
      app: netintel-ocr
  endpoints:
  - port: metrics
    interval: 10s

Docker Monitoring¶

# docker-compose.yml
services:
  netintel-ocr:
    image: netintel-ocr:latest
    labels:
      - "prometheus.io/scrape=true"
      - "prometheus.io/port=9090"
      - "prometheus.io/path=/metrics"

Next Steps¶

Deployment Guide - Production deployment
Performance Guide - Performance optimization
Troubleshooting Guide - Common issues
API Guide - API monitoring endpoints