Batch Processing Guide¶
New in v0.1.17: Hierarchical CLI
NetIntel-OCR v0.1.17 introduces a hierarchical CLI structure. Batch processing is now under the process batch command group for better organization.
Overview¶
NetIntel-OCR provides efficient batch processing capabilities for handling multiple documents with parallel processing, progress tracking, and centralized storage.
Basic Batch Processing¶
Process Multiple Files¶
# Process all PDFs in directory
netintel-ocr process batch /path/to/pdfs/
# Process with file pattern
netintel-ocr process batch /path/to/pdfs/ --pattern "*.pdf"
# With specific model
netintel-ocr process batch /path/to/pdfs/ --model qwen2.5vl:7b
# Recursive processing
netintel-ocr process batch /path/to/documents/ --recursive
Parallel Processing¶
# Process 4 documents simultaneously
netintel-ocr process batch /path/to/pdfs/ --parallel 4
# Auto-detect optimal parallelism
netintel-ocr process batch /path/to/pdfs/ --auto-parallel
# GPU parallel processing
netintel-ocr process batch /path/to/pdfs/ --gpu --parallel 2
# Process from file list
echo "doc1.pdf\ndoc2.pdf\ndoc3.pdf" > file_list.txt
netintel-ocr process batch file_list.txt
Advanced Batch Features¶
Batch Ingestion¶
# Ingest to vector store
netintel-ocr process batch /path/to/pdfs/ \
--collection network_docs \
--parallel 4
# With deduplication
netintel-ocr process batch /path/to/documents/ \
--deduplicate \
--collection unified_docs
# Watch directory for new files
netintel-ocr process watch /input/folder \
--pattern "*.pdf" \
--collection live_docs
Progress Tracking¶
# Enable progress bar
netintel-ocr process batch /path/to/pdfs/ --progress
# Save progress for resume
netintel-ocr process batch /path/to/pdfs/ \
--checkpoint batch-checkpoint.json \
--resume-on-failure
# Resume from checkpoint
netintel-ocr process batch /path/to/pdfs/ \
--resume-from batch-checkpoint.json
Output Organization¶
# Organize by document type
netintel-ocr process batch /path/to/pdfs/ \
--output-structure type \
--output-dir processed/
# Result:
# processed/
# ├── network_diagrams/
# ├── flow_diagrams/
# └── text_only/
# Alternative output formats
netintel-ocr process batch /path/to/pdfs/ \
--output-dir results/ \
--format json
Batch Configuration¶
YAML Configuration¶
# batch-config.yaml
batch:
max_parallel: 4
chunk_size: 10
resume_on_failure: true
checkpoint_file: batch-state.json
output:
structure: document # or 'type', 'date'
dir: ./processed
models:
text: Nanonets-OCR-s:latest
network: qwen2.5vl:7b
flow: qwen2.5vl:7b
filters:
min_pages: 5
max_pages: 500
file_types: [pdf, png, jpg]
error_handling:
max_retries: 3
retry_delay: 5
skip_on_error: false
Use Configuration¶
# Apply batch configuration
netintel-ocr process batch /path/to/pdfs/ --config batch-config.yaml
# Or use config commands
netintel-ocr config set processing.max_parallel 4
netintel-ocr config set processing.chunk_size 10
netintel-ocr process batch /path/to/pdfs/
Centralized Database¶
Merge to Central Store¶
# Create centralized database
netintel-ocr db merge \
--source-dir ./processed \
--central-db ./central/unified.db
# With metadata
netintel-ocr db merge \
--add-metadata "project=network-refresh" \
--add-metadata "date=2024-01-15" \
./processed/* ./central/unified.db
Query Centralized Database¶
# Search across all documents
netintel-ocr db query "firewall configuration" \
--db ./central/unified.db
# Filter by metadata
netintel-ocr db query "DMZ architecture" \
--db ./central/unified.db \
--filter "project=network-refresh"
# Export query results
netintel-ocr db query "network topology" \
--format json > results.json
Cloud Storage Integration¶
S3/MinIO Support¶
# Configure S3
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export S3_BUCKET=netintel-output
# Process and upload to S3
netintel-ocr process batch /path/to/pdfs/ \
--output-s3 s3://netintel-output/processed/
# Process from S3
netintel-ocr process batch s3://netintel-input/ \
--pattern "*.pdf" \
--output-s3 s3://netintel-output/
Azure Blob Storage¶
# Configure Azure
export AZURE_STORAGE_CONNECTION_STRING=your-connection-string
# Process with Azure storage
netintel-ocr process batch /path/to/pdfs/ \
--storage-backend azure \
--container processed-docs
Performance Optimization¶
Memory Management¶
# Limit memory per process
netintel-ocr process batch /path/to/pdfs/ \
--max-memory 4GB \
--parallel 2
# Enable swap for large documents
netintel-ocr process batch /path/to/large-docs/ \
--enable-swap \
--swap-dir /tmp/netintel-swap
CPU/GPU Optimization¶
# CPU-only batch processing
netintel-ocr process batch /path/to/pdfs/ \
--cpu-only \
--parallel $(nproc)
# Mixed CPU/GPU processing
netintel-ocr process batch /path/to/pdfs/ \
--gpu-for-models "llava,qwen2.5vl" \
--cpu-for-models "Nanonets-OCR-s"
Caching Strategy¶
# Enable aggressive caching
netintel-ocr process batch /path/to/pdfs/ \
--cache-models \
--cache-embeddings \
--cache-dir /tmp/netintel-cache
# Share cache across runs
export NETINTEL_CACHE_DIR=/shared/cache
netintel-ocr process batch /path/to/pdfs/ --use-cache
Monitoring and Logging¶
Real-time Monitoring¶
# Enable metrics server
netintel-ocr process batch /path/to/pdfs/ \
--metrics-port 9090 \
--progress-webhook http://monitor/progress
# View metrics
curl http://localhost:9090/metrics
# Use server monitoring
netintel-ocr server health
Detailed Logging¶
# Per-document logs
netintel-ocr --debug process batch /path/to/pdfs/ \
--log-per-document \
--log-dir ./logs
# Structured logging
netintel-ocr process batch /path/to/pdfs/ \
--log-format json \
--log-file batch.jsonl
Error Handling¶
Retry Logic¶
# Automatic retry with backoff
netintel-ocr process batch /path/to/pdfs/ \
--max-retries 3 \
--retry-backoff exponential \
--retry-delay 5
Failed Document Handling¶
# Skip failed documents
netintel-ocr process batch /path/to/pdfs/ \
--skip-on-error \
--failed-list failed.txt
# Reprocess failed documents
netintel-ocr process batch failed.txt \
--max-retries 5
Batch Scripts¶
Shell Script Example¶
#!/bin/bash
# batch-process.sh
DOCS_DIR="/path/to/documents"
OUTPUT_DIR="/path/to/output"
FAILED_LIST="failed_docs.txt"
# Clear previous failures
> $FAILED_LIST
# Process in chunks
find $DOCS_DIR -name "*.pdf" | while read -r file; do
netintel-ocr process pdf "$file" \
--model qwen2.5vl:7b \
--output-dir $OUTPUT_DIR || echo "$file" >> $FAILED_LIST
done
# Retry failed documents
if [ -s $FAILED_LIST ]; then
echo "Retrying failed documents..."
netintel-ocr process batch $FAILED_LIST \
--model minicpm-v:latest
fi
Python Script Example¶
# batch_processor.py
import os
from pathlib import Path
from netintel_ocr import BatchProcessor
processor = BatchProcessor(
max_parallel=4,
model="qwen2.5vl:7b",
output_dir="./processed"
)
# Process all PDFs
pdf_files = Path("/documents").glob("**/*.pdf")
results = processor.process_batch(pdf_files)
# Handle results
for result in results:
if result.success:
print(f"✓ {result.file}: {result.diagrams_found} diagrams")
else:
print(f"✗ {result.file}: {result.error}")
# Generate summary
processor.generate_summary("batch_summary.json")
Best Practices¶
- Chunk Large Batches: Process in groups of 50-100 documents
- Use Checkpoints: Enable resume for long-running batches
- Monitor Memory: Set limits to prevent OOM errors
- Deduplicate First: Remove duplicates before processing
- Test Small Sample: Validate settings on subset first
Next Steps¶
- Troubleshooting - Common batch issues
- Vector Search - Search processed batches
- API Integration - Batch processing via API