Skip to content

Batch Processing Guide

New in v0.1.17: Hierarchical CLI

NetIntel-OCR v0.1.17 introduces a hierarchical CLI structure. Batch processing is now under the process batch command group for better organization.

Overview

NetIntel-OCR provides efficient batch processing capabilities for handling multiple documents with parallel processing, progress tracking, and centralized storage.

Basic Batch Processing

Process Multiple Files

# Process all PDFs in directory
netintel-ocr process batch /path/to/pdfs/

# Process with file pattern
netintel-ocr process batch /path/to/pdfs/ --pattern "*.pdf"

# With specific model
netintel-ocr process batch /path/to/pdfs/ --model qwen2.5vl:7b

# Recursive processing
netintel-ocr process batch /path/to/documents/ --recursive

Parallel Processing

# Process 4 documents simultaneously
netintel-ocr process batch /path/to/pdfs/ --parallel 4

# Auto-detect optimal parallelism
netintel-ocr process batch /path/to/pdfs/ --auto-parallel

# GPU parallel processing
netintel-ocr process batch /path/to/pdfs/ --gpu --parallel 2

# Process from file list
echo "doc1.pdf\ndoc2.pdf\ndoc3.pdf" > file_list.txt
netintel-ocr process batch file_list.txt

Advanced Batch Features

Batch Ingestion

# Ingest to vector store
netintel-ocr process batch /path/to/pdfs/ \
  --collection network_docs \
  --parallel 4

# With deduplication
netintel-ocr process batch /path/to/documents/ \
  --deduplicate \
  --collection unified_docs

# Watch directory for new files
netintel-ocr process watch /input/folder \
  --pattern "*.pdf" \
  --collection live_docs

Progress Tracking

# Enable progress bar
netintel-ocr process batch /path/to/pdfs/ --progress

# Save progress for resume
netintel-ocr process batch /path/to/pdfs/ \
  --checkpoint batch-checkpoint.json \
  --resume-on-failure

# Resume from checkpoint
netintel-ocr process batch /path/to/pdfs/ \
  --resume-from batch-checkpoint.json

Output Organization

# Organize by document type
netintel-ocr process batch /path/to/pdfs/ \
  --output-structure type \
  --output-dir processed/

# Result:
# processed/
#   ├── network_diagrams/
#   ├── flow_diagrams/
#   └── text_only/

# Alternative output formats
netintel-ocr process batch /path/to/pdfs/ \
  --output-dir results/ \
  --format json

Batch Configuration

YAML Configuration

# batch-config.yaml
batch:
  max_parallel: 4
  chunk_size: 10
  resume_on_failure: true
  checkpoint_file: batch-state.json

  output:
    structure: document  # or 'type', 'date'
    dir: ./processed

  models:
    text: Nanonets-OCR-s:latest
    network: qwen2.5vl:7b
    flow: qwen2.5vl:7b

  filters:
    min_pages: 5
    max_pages: 500
    file_types: [pdf, png, jpg]

  error_handling:
    max_retries: 3
    retry_delay: 5
    skip_on_error: false

Use Configuration

# Apply batch configuration
netintel-ocr process batch /path/to/pdfs/ --config batch-config.yaml

# Or use config commands
netintel-ocr config set processing.max_parallel 4
netintel-ocr config set processing.chunk_size 10
netintel-ocr process batch /path/to/pdfs/

Centralized Database

Merge to Central Store

# Create centralized database
netintel-ocr db merge \
  --source-dir ./processed \
  --central-db ./central/unified.db

# With metadata
netintel-ocr db merge \
  --add-metadata "project=network-refresh" \
  --add-metadata "date=2024-01-15" \
  ./processed/* ./central/unified.db

Query Centralized Database

# Search across all documents
netintel-ocr db query "firewall configuration" \
  --db ./central/unified.db

# Filter by metadata
netintel-ocr db query "DMZ architecture" \
  --db ./central/unified.db \
  --filter "project=network-refresh"

# Export query results
netintel-ocr db query "network topology" \
  --format json > results.json

Cloud Storage Integration

S3/MinIO Support

# Configure S3
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export S3_BUCKET=netintel-output

# Process and upload to S3
netintel-ocr process batch /path/to/pdfs/ \
  --output-s3 s3://netintel-output/processed/

# Process from S3
netintel-ocr process batch s3://netintel-input/ \
  --pattern "*.pdf" \
  --output-s3 s3://netintel-output/

Azure Blob Storage

# Configure Azure
export AZURE_STORAGE_CONNECTION_STRING=your-connection-string

# Process with Azure storage
netintel-ocr process batch /path/to/pdfs/ \
  --storage-backend azure \
  --container processed-docs

Performance Optimization

Memory Management

# Limit memory per process
netintel-ocr process batch /path/to/pdfs/ \
  --max-memory 4GB \
  --parallel 2

# Enable swap for large documents
netintel-ocr process batch /path/to/large-docs/ \
  --enable-swap \
  --swap-dir /tmp/netintel-swap

CPU/GPU Optimization

# CPU-only batch processing
netintel-ocr process batch /path/to/pdfs/ \
  --cpu-only \
  --parallel $(nproc)

# Mixed CPU/GPU processing
netintel-ocr process batch /path/to/pdfs/ \
  --gpu-for-models "llava,qwen2.5vl" \
  --cpu-for-models "Nanonets-OCR-s"

Caching Strategy

# Enable aggressive caching
netintel-ocr process batch /path/to/pdfs/ \
  --cache-models \
  --cache-embeddings \
  --cache-dir /tmp/netintel-cache

# Share cache across runs
export NETINTEL_CACHE_DIR=/shared/cache
netintel-ocr process batch /path/to/pdfs/ --use-cache

Monitoring and Logging

Real-time Monitoring

# Enable metrics server
netintel-ocr process batch /path/to/pdfs/ \
  --metrics-port 9090 \
  --progress-webhook http://monitor/progress

# View metrics
curl http://localhost:9090/metrics

# Use server monitoring
netintel-ocr server health

Detailed Logging

# Per-document logs
netintel-ocr --debug process batch /path/to/pdfs/ \
  --log-per-document \
  --log-dir ./logs

# Structured logging
netintel-ocr process batch /path/to/pdfs/ \
  --log-format json \
  --log-file batch.jsonl

Error Handling

Retry Logic

# Automatic retry with backoff
netintel-ocr process batch /path/to/pdfs/ \
  --max-retries 3 \
  --retry-backoff exponential \
  --retry-delay 5

Failed Document Handling

# Skip failed documents
netintel-ocr process batch /path/to/pdfs/ \
  --skip-on-error \
  --failed-list failed.txt

# Reprocess failed documents
netintel-ocr process batch failed.txt \
  --max-retries 5

Batch Scripts

Shell Script Example

#!/bin/bash
# batch-process.sh

DOCS_DIR="/path/to/documents"
OUTPUT_DIR="/path/to/output"
FAILED_LIST="failed_docs.txt"

# Clear previous failures
> $FAILED_LIST

# Process in chunks
find $DOCS_DIR -name "*.pdf" | while read -r file; do
  netintel-ocr process pdf "$file" \
    --model qwen2.5vl:7b \
    --output-dir $OUTPUT_DIR || echo "$file" >> $FAILED_LIST
done

# Retry failed documents
if [ -s $FAILED_LIST ]; then
  echo "Retrying failed documents..."
  netintel-ocr process batch $FAILED_LIST \
    --model minicpm-v:latest
fi

Python Script Example

# batch_processor.py
import os
from pathlib import Path
from netintel_ocr import BatchProcessor

processor = BatchProcessor(
    max_parallel=4,
    model="qwen2.5vl:7b",
    output_dir="./processed"
)

# Process all PDFs
pdf_files = Path("/documents").glob("**/*.pdf")
results = processor.process_batch(pdf_files)

# Handle results
for result in results:
    if result.success:
        print(f"✓ {result.file}: {result.diagrams_found} diagrams")
    else:
        print(f"✗ {result.file}: {result.error}")

# Generate summary
processor.generate_summary("batch_summary.json")

Best Practices

  1. Chunk Large Batches: Process in groups of 50-100 documents
  2. Use Checkpoints: Enable resume for long-running batches
  3. Monitor Memory: Set limits to prevent OOM errors
  4. Deduplicate First: Remove duplicates before processing
  5. Test Small Sample: Validate settings on subset first

Next Steps