Automated .docx and .xlsx Compatibility Tests for LibreOffice Migrations
Automate .docx/.xlsx checks for LibreOffice migrations—visual diffs, macro detection, CI scripts and reusable suites to cut break-fix cycles.
Cut migration fire drills: automated .docx/.xlsx compatibility tests for LibreOffice
Hook: If your team is migrating hundreds or thousands of Office files to LibreOffice, the real cost isn't licenses — it's the weeks spent fixing broken layouts, missing charts, and silent macro failures. This guide gives you reusable test suites and scripts that catch layout and macro issues early, integrate into CI, and dramatically reduce break-fix cycles.
Executive summary — what you can do today
Ship a small, repeatable test harness that converts Office files with LibreOffice headless, exports PDFs as a visual baseline, runs perceptual diffs against known-good PDFs (from Microsoft Word or canonical sources), and performs static macro analysis to flag incompatible or risky VBA. Integrate these checks into CI so every change to your conversion image, filter, or document set is validated automatically.
Why automated compatibility tests matter in 2026
By 2026, many public sector and privacy-focused teams accelerated migrations away from cloud Office suites. LibreOffice compatibility has improved significantly thanks to continued filter work by The Document Foundation and community contributors. But OOXML is complex, and organizations still face frequent regressions when filters or runtime environments change.
Manual QA is slow and inconsistent. Automated tests give you:
- Shift-left detection — catch regressions before they reach users.
- Repeatability — same checks in dev, CI-friendly outputs, and pre-production.
- Actionable outputs — page-level diffs, per-file macro reports, and JUnit-style summaries for dashboards.
Common failure modes when converting .docx/.xlsx to LibreOffice
Before building tests, know what to look for. Typical issues include:
- Layout drift: line breaks, page reflows, table column width differences.
- Font substitution: missing fonts change metrics and pagination.
- Embedded objects: linked charts, OLE objects, or images lost or rasterized.
- Charts & formulas: chart rendering differences and Excel formula incompatibilities.
- Macros: OOXML macros (.docm, .xlsm) often use VBA APIs not supported by LibreOffice.
- Metadata & properties: document properties lost or changed.
Design principles for a reusable compatibility test suite
- Baseline-first: Always produce or store a known-good PDF or image set from Microsoft Word/Excel to compare against. If you can't run MS Office in CI, capture baselines prior to migration.
- Per-page checks: Convert to PDF and compare page-by-page to localize failures.
- Per-file metadata: Record file size, page count, presence/absence of macros, number of embedded images, and fonts used.
- Configurable thresholds: Allow SSIM or pixel-diff tolerances to avoid noise from trivial rendering differences.
- CI-friendly outputs: Produce JUnit XML and HTML reports to visualize diffs and fail builds when thresholds are exceeded.
Tools and building blocks
These are pragmatic, widely-available components you can plug into a pipeline today:
- LibreOffice (soffice) — headless conversion: --headless --convert-to pdf
- unoconv — an alternative wrapper around LibreOffice for conversions
- Poppler (pdftoppm) — render PDFs to PNG for pixel or perceptual comparison
- ImageMagick or scikit-image — compute PSNR or SSIM for visual diffs
- oletools (olevba) — analyze VBA and detect suspicious patterns
- Python + pytest — orchestrate tests and produce JUnit XML
- Docker — containerize the conversion environment for reproducibility
Repository layout (recommended)
tests/
fixtures/
baselines/ # reference PDFs exported from MS Office
samples/ # .docx, .docm, .xlsx, .xlsm files to test
scripts/
convert_and_compare.sh
detect_macros.py
compare_pdf_ssim.py
ci/
github-actions.yml
requirements.txt
Script: batch convert with LibreOffice and export PDFs
This simple Bash script converts Office files to PDF using LibreOffice headless and writes logs you can parse in CI.
#!/usr/bin/env bash
set -euo pipefail
IN_DIR=${1:-fixtures/samples}
OUT_DIR=${2:-artifacts/converted}
mkdir -p "$OUT_DIR"
for f in "$IN_DIR"/*.{docx,docm,xlsx,xlsm} ; do
[ -e "$f" ] || continue
echo "Converting $f"
soffice --headless --convert-to pdf --outdir "$OUT_DIR" "$f" 2>&1 | tee -a convert.log
done
Script: detect macros and risky VBA patterns
Macros in modern OOXML files live inside the ZIP package as vbaProject.bin. This Python script detects macro containers and uses olevba if available to decode and flag suspicious calls.
#!/usr/bin/env python3
import zipfile
import sys
import subprocess
from pathlib import Path
def has_vba(path: Path) -> bool:
try:
with zipfile.ZipFile(path, 'r') as z:
return any('vbaProject.bin' in name for name in z.namelist())
except zipfile.BadZipFile:
return False
if __name__ == '__main__':
for p in Path(sys.argv[1]).glob('*'):
if p.suffix.lower() in ('.docm', '.xlsm', '.docx', '.xlsx'):
vba = has_vba(p)
print(p.name, 'HAS_VBA' if vba else 'NO_VBA')
if vba:
try:
out = subprocess.check_output(['olevba', str(p)], stderr=subprocess.DEVNULL).decode()
print(out)
except FileNotFoundError:
print('olevba not installed; skipping deep scan')
Script: visual comparison using SSIM
Compute a structural similarity score per page. The following Python example uses pdftoppm to rasterize each PDF to PNG, then scikit-image to compute SSIM. Fail the test if any page is below the threshold.
#!/usr/bin/env python3
import subprocess
import tempfile
import sys
from pathlib import Path
import numpy as np
from skimage.metrics import structural_similarity as ssim
from PIL import Image
THRESH = float(sys.argv[3]) if len(sys.argv) > 3 else 0.95
def pdf_to_pngs(pdf_path):
tmp = tempfile.mkdtemp()
# render at 150 DPI
subprocess.check_call(['pdftoppm', '-png', '-r', '150', str(pdf_path), str(Path(tmp)/'page')])
files = sorted(Path(tmp).glob('page-*.png'))
return files
orig = Path(sys.argv[1])
conv = Path(sys.argv[2])
orig_pages = pdf_to_pngs(orig)
conv_pages = pdf_to_pngs(conv)
if len(orig_pages) != len(conv_pages):
print('PAGE_COUNT_MISMATCH', len(orig_pages), 'vs', len(conv_pages))
sys.exit(2)
for i, (a,b) in enumerate(zip(orig_pages, conv_pages), start=1):
ia = np.array(Image.open(a).convert('L'))
ib = np.array(Image.open(b).convert('L'))
# resize to match if necessary
if ia.shape != ib.shape:
from skimage.transform import resize
ib = resize(ib, ia.shape, preserve_range=True).astype(ia.dtype)
score = ssim(ia, ib)
print(f'PAGE {i} SSIM={score:.4f}')
if score < THRESH:
print('FAIL: below threshold')
sys.exit(3)
print('PASS')
CI integration: GitHub Actions example
Run conversion and tests on every push. Key steps: install LibreOffice, poppler-utils, Python deps, then run conversion and comparisons. This YAML is a starting point you can expand.
name: LibreOffice compatibility checks
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install system deps
run: |
sudo apt-get update
sudo apt-get install -y libreoffice poppler-utils python3-pip
- name: Install Python deps
run: |
python3 -m pip install --upgrade pip
pip install scikit-image pillow oletools pytest junit-xml
- name: Convert samples
run: ./scripts/convert_and_compare.sh fixtures/samples artifacts/converted
- name: Detect macros
run: python3 scripts/detect_macros.py fixtures/samples | tee macro_report.txt
- name: Compare PDFs
run: |
python3 scripts/compare_pdf_ssim.py fixtures/baselines/doc1.pdf artifacts/converted/doc1.pdf 0.94
Macro handling and remediation playbook
Macros are the tricky part. You can't reliably execute every VBA macro in LibreOffice — the APIs and object models differ. Use an automated policy-driven workflow:
- Detect macro presence and list the specific modules and calls (olevba output).
- Classify macros: read-only macros (e.g., formatting), automation (CreateObject, Shell), Excel-specific APIs (Range, PivotTable), and external calls (DLL/COM).
- Flag high-risk macros for manual review. For low-risk macros, consider translating to LibreOffice Basic or Python UNO scripts.
- Automate a stubbed execution test in a sandboxed LibreOffice instance for macros that must run during migration; capture errors and stack traces.
Automation doesn't remove the need for domain expertise on macros — it makes the hand-off to that expertise surgical and fast.
Running at scale: batch strategies and tuning
When you have thousands of files, you must balance speed and accuracy.
- Parallel conversions: use a job queue (Celery, Kubernetes jobs) and limit concurrency to avoid exhausting resources.
- Sampling: run strict per-page SSIM on a representative sample; run lighter metadata checks on the rest.
- Threshold tuning: tune SSIM per document type. Brochures and marketing PDFs need higher fidelity than simple memos.
- Cache fonts: ensure conversion nodes have the same fonts installed as baselines. Missing fonts are the most common cause of layout drift.
Deliverables your tests should produce
- Per-file PDF diffs and annotated images showing the area of divergence.
- Macro inventory CSV with risk categories and line numbers for investigation.
- JUnit XML for CI dashboards to show pass/fail per file.
- Summary dashboard (HTML) with top offenders and trends over time.
Case study: cutting fix cycles by 70%
Example (anonymized): a government IT team migrated 12,000 documents in late 2025. They implemented a pipeline similar to this guide: baseline PDF capture, per-file SSIM checks, and macro detection. By automating triage, the team reduced manual remediation from 8 hours per week to 2.5 hours — a ~70% reduction — and eliminated late-release surprises.
Future-proofing: trends for 2026 and beyond
Expect these trends to affect your migration strategy:
- Ongoing OOXML filter improvements: The Document Foundation and community contributors continue to improve filters — keep your conversion image pinned and tested to avoid surprises when upgrading.
- AI-assisted diffing: perceptual diffs supplemented with semantic comparisons (e.g., heading and table structure extraction) are becoming common. You can combine OCR and NLP to detect semantic loss even when the visual layout looks fine.
- Containerized conversion services: use immutable Docker images with pinned LibreOffice builds for reproducibility. Run smoke tests before rolling new images.
- Regulatory drivers: privacy and data residency requirements are accelerating on-prem LibreOffice migrations — making automated verification a compliance enabler.
Practical checklist to get started this week
- Collect 20 representative documents and export baseline PDFs from Microsoft Office.
- Set up a Git repo with the layout suggested above and add the scripts from this article.
- Run batch conversions locally with LibreOffice and review diffs to pick a per-project SSIM threshold.
- Add macro detection via olevba and create a simple classification workflow for triage.
- Integrate the pipeline into CI (start with GitHub Actions) and gate merges on high-severity failures.
Limitations and realities — be transparent
Automated tests reduce risk but do not guarantee perfect functional parity. Macros and complex Excel models often require manual porting or redesign. Visual diffs can generate false positives due to benign rasterization differences. The goal is to create a pragmatic, repeatable process that reduces surprise work and surfaces true compatibility issues quickly.
Actionable takeaways
- Start small and iterate: validate on a sample corpus, refine thresholds, then scale.
- Automate triage: use macro detectors and SSIM diffs to prioritize human attention.
- Pin your conversion environment: use container images and smoke tests so upgrades don’t introduce regressions silently.
- Integrate with CI: make compatibility checks part of your CI/CD pipeline to catch regressions early.
Next steps — practical templates to drop into your repo
Grab the scripts in this article, place them under scripts/, add a simple GitHub Actions pipeline, and seed fixtures/baselines/ with reference PDFs. If you need a turnkey starting point, consider a small Docker image that bundles LibreOffice, poppler-utils, Python, and the Python deps so your CI runs identically across environments.
Closing: reduce break-fix cycles, not functionality
Switching to LibreOffice can save money and improve privacy, but migrating without automated verification creates hidden costs. Build a small, repeatable compatibility suite: visual diffs for layout, static macro analysis for risk, CI gates for repeatability. The result is fewer surprises, faster remediation, and a migration you can defend to stakeholders with data.
Call to action: Start a migration run today: add one document to the sample set, export a baseline from Microsoft Office, and run the conversion + SSIM script. If you want a ready-made repo starter or a Docker image with the tooling configured, reach out or clone our sample toolkit to get a working pipeline in under an hour. For tips on communicating your rollout and building audience touchpoints, see this practical guide on launching a small, focused newsletter.
Related Reading
- Open-Source Office vs Microsoft 365: a Total Cost of Ownership
- Building and Hosting Micro‑Apps: a Pragmatic DevOps Playbook
- Live Explainability APIs — Describe.Cloud
- Tool Sprawl for Tech Teams: A Rationalization Framework
- ABLE Accounts Expanded — How This Helps Beneficiaries Manage Rising Living Costs
- Packing Your Beauty Bag for the Top 17 2026 Destinations
- Home Office Power Guide: Pairing a Mac mini M4 with Monitors, Chargers, and Surge Protection
- Fragrance Without Footprint: Biotech Pathways to Replace Animal- or Habitat-Dependent Ingredients
- Finding Trans‑Inclusive Care in Your City: A Neighborhood Directory and How to Ask the Right Questions
Related Topics
toolkit
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you