Automated .docx and .xlsx Compatibility Tests for LibreOffice Migrations
TestingAutomationOffice

Automated .docx and .xlsx Compatibility Tests for LibreOffice Migrations

ttoolkit
2026-02-06 12:00:00
10 min read
Advertisement

Automate .docx/.xlsx checks for LibreOffice migrations—visual diffs, macro detection, CI scripts and reusable suites to cut break-fix cycles.

Cut migration fire drills: automated .docx/.xlsx compatibility tests for LibreOffice

Hook: If your team is migrating hundreds or thousands of Office files to LibreOffice, the real cost isn't licenses — it's the weeks spent fixing broken layouts, missing charts, and silent macro failures. This guide gives you reusable test suites and scripts that catch layout and macro issues early, integrate into CI, and dramatically reduce break-fix cycles.

Executive summary — what you can do today

Ship a small, repeatable test harness that converts Office files with LibreOffice headless, exports PDFs as a visual baseline, runs perceptual diffs against known-good PDFs (from Microsoft Word or canonical sources), and performs static macro analysis to flag incompatible or risky VBA. Integrate these checks into CI so every change to your conversion image, filter, or document set is validated automatically.

Why automated compatibility tests matter in 2026

By 2026, many public sector and privacy-focused teams accelerated migrations away from cloud Office suites. LibreOffice compatibility has improved significantly thanks to continued filter work by The Document Foundation and community contributors. But OOXML is complex, and organizations still face frequent regressions when filters or runtime environments change.

Manual QA is slow and inconsistent. Automated tests give you:

  • Shift-left detection — catch regressions before they reach users.
  • Repeatability — same checks in dev, CI-friendly outputs, and pre-production.
  • Actionable outputs — page-level diffs, per-file macro reports, and JUnit-style summaries for dashboards.

Common failure modes when converting .docx/.xlsx to LibreOffice

Before building tests, know what to look for. Typical issues include:

  • Layout drift: line breaks, page reflows, table column width differences.
  • Font substitution: missing fonts change metrics and pagination.
  • Embedded objects: linked charts, OLE objects, or images lost or rasterized.
  • Charts & formulas: chart rendering differences and Excel formula incompatibilities.
  • Macros: OOXML macros (.docm, .xlsm) often use VBA APIs not supported by LibreOffice.
  • Metadata & properties: document properties lost or changed.

Design principles for a reusable compatibility test suite

  • Baseline-first: Always produce or store a known-good PDF or image set from Microsoft Word/Excel to compare against. If you can't run MS Office in CI, capture baselines prior to migration.
  • Per-page checks: Convert to PDF and compare page-by-page to localize failures.
  • Per-file metadata: Record file size, page count, presence/absence of macros, number of embedded images, and fonts used.
  • Configurable thresholds: Allow SSIM or pixel-diff tolerances to avoid noise from trivial rendering differences.
  • CI-friendly outputs: Produce JUnit XML and HTML reports to visualize diffs and fail builds when thresholds are exceeded.

Tools and building blocks

These are pragmatic, widely-available components you can plug into a pipeline today:

  • LibreOffice (soffice) — headless conversion: --headless --convert-to pdf
  • unoconv — an alternative wrapper around LibreOffice for conversions
  • Poppler (pdftoppm) — render PDFs to PNG for pixel or perceptual comparison
  • ImageMagick or scikit-image — compute PSNR or SSIM for visual diffs
  • oletools (olevba) — analyze VBA and detect suspicious patterns
  • Python + pytest — orchestrate tests and produce JUnit XML
  • Dockercontainerize the conversion environment for reproducibility
tests/
  fixtures/
    baselines/          # reference PDFs exported from MS Office
    samples/            # .docx, .docm, .xlsx, .xlsm files to test
  scripts/
    convert_and_compare.sh
    detect_macros.py
    compare_pdf_ssim.py
  ci/
    github-actions.yml
  requirements.txt
  

Script: batch convert with LibreOffice and export PDFs

This simple Bash script converts Office files to PDF using LibreOffice headless and writes logs you can parse in CI.

#!/usr/bin/env bash
set -euo pipefail
IN_DIR=${1:-fixtures/samples}
OUT_DIR=${2:-artifacts/converted}
mkdir -p "$OUT_DIR"
for f in "$IN_DIR"/*.{docx,docm,xlsx,xlsm} ; do
  [ -e "$f" ] || continue
  echo "Converting $f"
  soffice --headless --convert-to pdf --outdir "$OUT_DIR" "$f" 2>&1 | tee -a convert.log
done

Script: detect macros and risky VBA patterns

Macros in modern OOXML files live inside the ZIP package as vbaProject.bin. This Python script detects macro containers and uses olevba if available to decode and flag suspicious calls.

#!/usr/bin/env python3
import zipfile
import sys
import subprocess
from pathlib import Path

def has_vba(path: Path) -> bool:
    try:
        with zipfile.ZipFile(path, 'r') as z:
            return any('vbaProject.bin' in name for name in z.namelist())
    except zipfile.BadZipFile:
        return False

if __name__ == '__main__':
    for p in Path(sys.argv[1]).glob('*'):
        if p.suffix.lower() in ('.docm', '.xlsm', '.docx', '.xlsx'):
            vba = has_vba(p)
            print(p.name, 'HAS_VBA' if vba else 'NO_VBA')
            if vba:
                try:
                    out = subprocess.check_output(['olevba', str(p)], stderr=subprocess.DEVNULL).decode()
                    print(out)
                except FileNotFoundError:
                    print('olevba not installed; skipping deep scan')

Script: visual comparison using SSIM

Compute a structural similarity score per page. The following Python example uses pdftoppm to rasterize each PDF to PNG, then scikit-image to compute SSIM. Fail the test if any page is below the threshold.

#!/usr/bin/env python3
import subprocess
import tempfile
import sys
from pathlib import Path
import numpy as np
from skimage.metrics import structural_similarity as ssim
from PIL import Image

THRESH = float(sys.argv[3]) if len(sys.argv) > 3 else 0.95

def pdf_to_pngs(pdf_path):
    tmp = tempfile.mkdtemp()
    # render at 150 DPI
    subprocess.check_call(['pdftoppm', '-png', '-r', '150', str(pdf_path), str(Path(tmp)/'page')])
    files = sorted(Path(tmp).glob('page-*.png'))
    return files

orig = Path(sys.argv[1])
conv = Path(sys.argv[2])
orig_pages = pdf_to_pngs(orig)
conv_pages = pdf_to_pngs(conv)
if len(orig_pages) != len(conv_pages):
    print('PAGE_COUNT_MISMATCH', len(orig_pages), 'vs', len(conv_pages))
    sys.exit(2)

for i, (a,b) in enumerate(zip(orig_pages, conv_pages), start=1):
    ia = np.array(Image.open(a).convert('L'))
    ib = np.array(Image.open(b).convert('L'))
    # resize to match if necessary
    if ia.shape != ib.shape:
        from skimage.transform import resize
        ib = resize(ib, ia.shape, preserve_range=True).astype(ia.dtype)
    score = ssim(ia, ib)
    print(f'PAGE {i} SSIM={score:.4f}')
    if score < THRESH:
        print('FAIL: below threshold')
        sys.exit(3)
print('PASS')

CI integration: GitHub Actions example

Run conversion and tests on every push. Key steps: install LibreOffice, poppler-utils, Python deps, then run conversion and comparisons. This YAML is a starting point you can expand.

name: LibreOffice compatibility checks
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install system deps
        run: |
          sudo apt-get update
          sudo apt-get install -y libreoffice poppler-utils python3-pip
      - name: Install Python deps
        run: |
          python3 -m pip install --upgrade pip
          pip install scikit-image pillow oletools pytest junit-xml
      - name: Convert samples
        run: ./scripts/convert_and_compare.sh fixtures/samples artifacts/converted
      - name: Detect macros
        run: python3 scripts/detect_macros.py fixtures/samples | tee macro_report.txt
      - name: Compare PDFs
        run: |
          python3 scripts/compare_pdf_ssim.py fixtures/baselines/doc1.pdf artifacts/converted/doc1.pdf 0.94

Macro handling and remediation playbook

Macros are the tricky part. You can't reliably execute every VBA macro in LibreOffice — the APIs and object models differ. Use an automated policy-driven workflow:

  1. Detect macro presence and list the specific modules and calls (olevba output).
  2. Classify macros: read-only macros (e.g., formatting), automation (CreateObject, Shell), Excel-specific APIs (Range, PivotTable), and external calls (DLL/COM).
  3. Flag high-risk macros for manual review. For low-risk macros, consider translating to LibreOffice Basic or Python UNO scripts.
  4. Automate a stubbed execution test in a sandboxed LibreOffice instance for macros that must run during migration; capture errors and stack traces.
Automation doesn't remove the need for domain expertise on macros — it makes the hand-off to that expertise surgical and fast.

Running at scale: batch strategies and tuning

When you have thousands of files, you must balance speed and accuracy.

  • Parallel conversions: use a job queue (Celery, Kubernetes jobs) and limit concurrency to avoid exhausting resources.
  • Sampling: run strict per-page SSIM on a representative sample; run lighter metadata checks on the rest.
  • Threshold tuning: tune SSIM per document type. Brochures and marketing PDFs need higher fidelity than simple memos.
  • Cache fonts: ensure conversion nodes have the same fonts installed as baselines. Missing fonts are the most common cause of layout drift.

Deliverables your tests should produce

  • Per-file PDF diffs and annotated images showing the area of divergence.
  • Macro inventory CSV with risk categories and line numbers for investigation.
  • JUnit XML for CI dashboards to show pass/fail per file.
  • Summary dashboard (HTML) with top offenders and trends over time.

Case study: cutting fix cycles by 70%

Example (anonymized): a government IT team migrated 12,000 documents in late 2025. They implemented a pipeline similar to this guide: baseline PDF capture, per-file SSIM checks, and macro detection. By automating triage, the team reduced manual remediation from 8 hours per week to 2.5 hours — a ~70% reduction — and eliminated late-release surprises.

Expect these trends to affect your migration strategy:

  • Ongoing OOXML filter improvements: The Document Foundation and community contributors continue to improve filters — keep your conversion image pinned and tested to avoid surprises when upgrading.
  • AI-assisted diffing: perceptual diffs supplemented with semantic comparisons (e.g., heading and table structure extraction) are becoming common. You can combine OCR and NLP to detect semantic loss even when the visual layout looks fine.
  • Containerized conversion services: use immutable Docker images with pinned LibreOffice builds for reproducibility. Run smoke tests before rolling new images.
  • Regulatory drivers: privacy and data residency requirements are accelerating on-prem LibreOffice migrations — making automated verification a compliance enabler.

Practical checklist to get started this week

  1. Collect 20 representative documents and export baseline PDFs from Microsoft Office.
  2. Set up a Git repo with the layout suggested above and add the scripts from this article.
  3. Run batch conversions locally with LibreOffice and review diffs to pick a per-project SSIM threshold.
  4. Add macro detection via olevba and create a simple classification workflow for triage.
  5. Integrate the pipeline into CI (start with GitHub Actions) and gate merges on high-severity failures.

Limitations and realities — be transparent

Automated tests reduce risk but do not guarantee perfect functional parity. Macros and complex Excel models often require manual porting or redesign. Visual diffs can generate false positives due to benign rasterization differences. The goal is to create a pragmatic, repeatable process that reduces surprise work and surfaces true compatibility issues quickly.

Actionable takeaways

  • Start small and iterate: validate on a sample corpus, refine thresholds, then scale.
  • Automate triage: use macro detectors and SSIM diffs to prioritize human attention.
  • Pin your conversion environment: use container images and smoke tests so upgrades don’t introduce regressions silently.
  • Integrate with CI: make compatibility checks part of your CI/CD pipeline to catch regressions early.

Next steps — practical templates to drop into your repo

Grab the scripts in this article, place them under scripts/, add a simple GitHub Actions pipeline, and seed fixtures/baselines/ with reference PDFs. If you need a turnkey starting point, consider a small Docker image that bundles LibreOffice, poppler-utils, Python, and the Python deps so your CI runs identically across environments.

Closing: reduce break-fix cycles, not functionality

Switching to LibreOffice can save money and improve privacy, but migrating without automated verification creates hidden costs. Build a small, repeatable compatibility suite: visual diffs for layout, static macro analysis for risk, CI gates for repeatability. The result is fewer surprises, faster remediation, and a migration you can defend to stakeholders with data.

Call to action: Start a migration run today: add one document to the sample set, export a baseline from Microsoft Office, and run the conversion + SSIM script. If you want a ready-made repo starter or a Docker image with the tooling configured, reach out or clone our sample toolkit to get a working pipeline in under an hour. For tips on communicating your rollout and building audience touchpoints, see this practical guide on launching a small, focused newsletter.

Advertisement

Related Topics

#Testing#Automation#Office
t

toolkit

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T06:17:10.775Z