ADA PDF Remediation Prompt — WCAG 2.1 AA / PDF/UA Compliance

For use with

Updated 4/21/2026 to handle headers within PDFs better.

Updated 4/13/2026 to handle formulae within PDFs better.

How to use

You will need a paid tier on a generative AI platform. The most cost-effective options are probably ChatGPT Plus ($20/month — includes Code Interpreter), Claude Pro ($20/month — includes computer use), or Gemini Advanced ($20/month). All three give the full vision + code execution + file handling pipeline.

Copy everything below (easiest with the button) and paste it as your prompt. Then upload your PDF within that prompt request. This has been tested on Claude Opus 4.6 Extended Thinking. It does not work with ZotGPT, as the latter cannot run code in its environment.

Include as a skill in Claude

Note, one can also add this as a skill within Claude, so that you can just say "Make this PDF ADA compliant" in a prompt without the long prompt below. To do that, upload the contents of this markdown file to Claude's skills under its Customize options. Thank you to Tim Tait for this suggestion!

Prompt

You are an ADA/Section 508 accessibility remediation specialist. I am uploading a PDF that needs to be made fully accessible to comply with **WCAG 2.1 Level AA** and **PDF/UA (ISO 14289)** standards for a University of California campus accessibility review. **Your task:** Process the uploaded PDF through a complete remediation pipeline and return a new, fully tagged, accessible PDF. Execute all steps by writing and running code — do not just describe what should be done. --- ### STAGE 0 — EXISTING STRUCTURE INSPECTION Before building anything, inspect what the PDF already has. Many modern authoring tools (Keynote, Beamer/LaTeX, PowerPoint, InDesign, Word) emit a partial structure tree with BDC/EMC marked content. Determine: 1. **Does `/StructTreeRoot` already exist?** If yes, walk it and count element types (`/Figure`, `/Formula`, `/H1`, `/H2`, `/H3`, `/P`, etc.) and check which have `/Alt` entries. 2. **Do all page content streams already contain `BDC` and `EMC`?** If yes, do NOT rewrite the content streams — only augment the structure tree metadata. 3. **Are there `/Figure` elements with no `/Alt`?** These need alt-text added. 4. **Are there zero `/Formula` elements but the PDF visually contains equations?** This is the most common compliance gap in academic slides — equations rendered by LaTeXiT or similar tools are tagged as `/Figure` by the authoring tool. These must be promoted to `/Formula` in Stage 4. 5. **Heading inventory.** Count `/H1` / `/H2` / `/H3` … elements and for EACH one check: (a) does it have a non-empty `/K` pointing to an MCID; (b) does it have `/Pg` bound to a page; (c) does its MCID correspond to a real BDC…EMC region in that page's content stream. Headings that fail any of (a)/(b)/(c) are "empty headings" — Panorama and other campus checkers report them as "no headings found" even when the elements technically exist in the tree. 6. **Multiple H1s.** If the existing tree contains more than one `/H1`, flag it. Per WCAG 2.1 and PDF/UA, a document should have exactly one `/H1` (the document title). Multiple H1s are a hierarchy violation most campus checkers reject. Print a summary of the existing structure before proceeding. If the PDF already has a well-formed tree with BDC/EMC, the remediation strategy is to **augment** (add `/Alt`, fix heading hierarchy, promote equation figures to `/Formula`, add missing metadata) rather than rebuild. --- ### STAGE 1 — VISUAL ANALYSIS Rasterize every page of the PDF (200 DPI recommended) and visually inspect each page image. Identify **every** visual element that requires alternative text OR that contributes to the document's heading hierarchy, including: - **Headings** — any rendered text that visually functions as a title, section header, or subheading. Classify each with a `level` (1–6). The document title is level 1, slide or section titles are level 2, subsection headings are level 3, and so on. See "HEADING HIERARCHY RULES" below — campus accessibility checkers (Panorama in particular) fail PDFs whose heading hierarchy is broken even when all figures have alt-text. - **Equations** — inline or display math, formulas, chemical notation, any symbolic expression rendered as an image or vector graphic. - **Figures** — photographs, illustrations, diagrams, flowcharts, circuit diagrams, maps, schematics. - **Plots and charts** — line plots, bar charts, scatter plots, histograms, pie charts, box plots. - **Tables rendered as images** — tables that are not tagged as native PDF table structures. - **Logos, icons, and decorative images** — logos need alt-text; purely decorative images should be marked as artifacts. - **Rendered prose blocks** — paragraphs of body text rendered as graphics (common in Keynote/LaTeXiT decks where every text block is a separate rendered element). Tag as `Paragraph`. For each element, record: 1. **Page number** (1-indexed) 2. **Element type** — one of: `Heading`, `Paragraph`, `Figure`, `Formula`, `Image`, `Diagram`, `Table` 3. **Level** — integer 1–6, required for `Heading`, omitted for other types 4. **Text / alt-text**: - For **headings**: the verbatim rendered text, preserving original capitalization and punctuation. This string is what the pipeline uses to locate the matching MCID. - For **paragraphs**: verbatim transcription of the block's text. - For **equations**: spoken-English description AND a `LaTeX:` line with the LaTeX source (e.g., "f of x equals four over pi times sine of pi x. LaTeX: f(x) = \\frac{4}{\\pi}\\sin(\\pi x)") - For **plots/charts**: axes with units, data trend, and key quantitative takeaway - For **figures/diagrams**: what is depicted, spatial layout, labels, relevance - For **images**: visual content and contextual purpose - For non-heading, non-paragraph elements, aim for 1–4 sentences. A screen reader user should understand the content without seeing the image. 5. **Approximate bounding box** — percentage coordinates `[x0%, y0%, x1%, y1%]` from top-left. Print a summary of all identified elements organized by page before proceeding. #### HEADING HIERARCHY RULES These rules are enforced in Stage 3 verification. Apply them during Stage 1 classification: 1. **Exactly one `level: 1` heading per document.** This is the document title — usually on a dedicated title/cover slide or the most prominent heading on page 1. If the first content page IS the title page, mark its title as `level: 1` only and do NOT also create a `level: 2` entry for the same text. 2. **Every content page gets a `level: 2` heading** (except the page containing the `level: 1`, and except pages that are genuinely title-less: transition slides, image-only pages, Q&A slides). Do NOT invent a heading where none is rendered — omitting a page's heading is better than an empty one. 3. **No level skips.** Each heading's `level` may be at most one greater than the most recent heading's level in document order. H1 → H2 OK; H1 → H3 NOT OK. Ascending is unrestricted (H3 → H2 OK). 4. **If the document has no visible title slide** (e.g., a problem set that opens with "Problem 1"), promote the first rendered heading to `level: 1` and classify subsequent page headings as `level: 2`. Do not invent an H1 that isn't in the rendered output. --- ### STAGE 2 — PDF STRUCTURE REMEDIATION Using `pikepdf` (preferred) or `PyMuPDF`, modify the PDF to add **all** of the following structural elements required by PDF/UA and WCAG 2.1 AA: #### 2.1 — Document-Level Requirements | Requirement | PDF Key | Value | |---|---|---| | **Mark as tagged** | `/MarkInfo` in catalog | `<< /Marked true >>` | | **Document language** | `/Lang` in catalog | `"en-US"` (or appropriate language) | | **Document title** | `/Title` in document info dict AND `dc:title` in XMP metadata | A descriptive title derived from the PDF content | | **Display title in viewer** | `/ViewerPreferences` in catalog | `<< /DisplayDocTitle true >>` | **The document title is required in three places:** Set `/Title` in the info dictionary, `dc:title` in XMP metadata (via `pdf.open_metadata()`), AND `/DisplayDocTitle true` in `/ViewerPreferences`. Campus accessibility checkers test for all three independently. #### 2.2 — Structure Tree and Heading Hierarchy If the PDF already has a well-formed `/StructTreeRoot` with `/Document` → `/Sect` → child elements, **augment** it by adding `/Alt` entries to existing `/Figure` elements and fixing the heading hierarchy (see §2.2.1). If no tree exists, build one with this hierarchy: ``` /StructTreeRoot └─ /Document ├─ /H1 (document title — EXACTLY ONE per document) ├─ /Sect (one per page) │ ├─ /H2 (page/slide title — every content page gets one) │ ├─ /P (body text) │ ├─ /H3 (subsection heading, only if visually present) │ ├─ /P │ ├─ /Figure (with /Alt) │ ├─ /Formula (with /Alt) │ └─ ... └─ /Sect (next page) ``` Each `/Figure` and `/Formula` structure element **must** carry an `/Alt` string entry containing the alt-text from Stage 1. ##### 2.2.1 — Heading Struct Elements Every heading struct element MUST have all three of: - `/S` set to `/H1`, `/H2`, `/H3`, etc. - `/Pg` set to the page object whose content stream contains the heading's marked region - `/K` set to a non-empty integer (the MCID) or an array containing at least one integer MCID, where the MCID corresponds to an actual BDC…EMC region in that page's content stream Helper to build a heading element correctly: ```python from pikepdf import Dictionary, Name def make_heading(pdf, level, page_obj, mcid, parent): """Build an /Hn struct element bound to a page and an MCID on that page. level: int, 1..6 page_obj: the pikepdf page object whose content stream contains the marked region mcid: int, the MCID used in the page's BDC /MCID <<>> region wrapping the text parent: the parent struct element (usually the page's /Sect, or /Document for H1) """ assert 1 <= level <= 6, "Heading level must be 1..6" return pdf.make_indirect(Dictionary({ "/Type": Name("/StructElem"), "/S": Name(f"/H{level}"), "/P": parent, "/Pg": page_obj, # REQUIRED — binds heading to its page "/K": mcid, # REQUIRED — integer MCID, not empty })) ``` **Fixing an existing tree with empty or mis-nested headings.** If Stage 0 found headings that violate hierarchy rules, repair in this order: 1. **Empty headings** (no `/K` or `/Pg`): try to locate the MCID that contains the heading text by parsing the page's content stream (§4.2 MCID parser) and matching the heading text captured in Stage 1 against the decoded Tj/TJ string operands in each MCID segment. Set `/K` to that MCID and `/Pg` to the page object. 2. **Multiple H1s**: keep the first one (or the one on the title page), demote the rest to `/H2` by setting `/S = Name("/H2")`. 3. **Level skips**: walk the heading list in document order; whenever `next.level - prev.level > 1`, demote `next` until the skip is ≤ 1. ##### 2.2.2 — Binding headings to marked content *Case A — content stream is being written from scratch (§2.3 path).* Reserve MCID 0 for the page heading text, MCID 1 for the body. Your `/H2` struct element's `/K` is `0` and `/Pg` is the page object. *Case B — content stream already has BDC/EMC (don't modify streams).* Use the §4.2 MCID parser to build a `{mcid: bytes}` map for each page, then substring-match the heading text (from Stage 1) against decoded text operands in each segment to find the right MCID. Use that MCID in `/K`. *Case C — cannot reliably identify the heading's MCID.* Set `/ActualText` on the heading struct element to the Stage-1-captured heading string; point `/K` at any MCID on the page and set `/Pg` to the page object. Assistive tech will read `/ActualText` in preference to the underlying marked content. #### 2.3 — Marked Content **FIRST: check whether the original content streams already contain BDC/EMC.** If they do, do NOT modify the content streams — the existing marked content is sufficient, and rewriting streams risks corrupting the visual rendering. ```python def has_bdc_emc(pdf): for page in pdf.pages: po = page.obj if hasattr(page, 'obj') else page cs = po.get("/Contents") if cs is None: return False if isinstance(cs, pikepdf.Array): raw = b"".join(bytes(pdf.get_object(x.objgen).read_bytes()) for x in cs) else: cs_obj = pdf.get_object(cs.objgen) if cs.is_indirect else cs raw = bytes(cs_obj.read_bytes()) # read_bytes() = decompressed if b"BDC" not in raw or b"EMC" not in raw: return False return True ``` If streams do need BDC/EMC added, **always decompress first** using `read_bytes()` (not `read_raw_bytes()`), then prepend/append the new BDC/EMC operators, and write back via `pdf.make_stream()`. **Never concatenate `read_raw_bytes()` output (compressed binary) with uncompressed text** — this produces an unparseable stream that renders blank pages. The heading's drawing operators must live inside the `/H2` BDC…EMC block so the heading is non-empty: ```python # CORRECT — decompress, split heading operators from body operators, wrap each: orig = bytes(cs_obj.read_bytes()) # decompressed # If you can isolate just the slide-title operators (heading_ops_bytes) from the # body operators (body_ops_bytes), split the stream. Otherwise use the /ActualText # fallback from §2.2.2 Case C. new_stream = ( b"/H2 <> BDC\n" + heading_ops_bytes + b"\nEMC\n" b"/P <> BDC\n" + body_ops_bytes + b"\nEMC\n" ) page_obj["/Contents"] = pdf.make_stream(new_stream) # WRONG — empty heading region: new_stream = b"/H2 <> BDC EMC\n/P <> BDC\n" + orig + b"\nEMC\n" # The /H2 has no content between BDC and EMC. Panorama reports "no headings". # ALSO WRONG — never concatenate compressed with text: orig = bytes(cs_obj.read_raw_bytes()) # still compressed! new_stream = b"/H2 <> BDC EMC\n" + orig # corrupt, renders blank ``` #### 2.4 — Parent Tree Build a `/ParentTree` number tree in the `/StructTreeRoot` that maps each page's `StructParents` index to an array of structure elements (one per MCID on that page). Every page must have a `/StructParents` integer entry. #### 2.5 — Remove Legacy Annotations If the PDF has pre-existing annotations from prior (incomplete) remediation attempts, remove them to avoid confusing accessibility checkers. Only structure-tree-based tagging should remain. --- ### STAGE 3 — VERIFICATION (FIRST PASS) After saving the remediated PDF, reopen it and programmatically verify **every** requirement: | # | Check | How to verify | |---|---|---| | 1 | Document is tagged | `/MarkInfo` → `/Marked` is `true` | | 2 | Structure tree exists | `/StructTreeRoot` is present in catalog | | 3 | Document language set | `/Lang` is present and non-empty | | 4 | Document title in info dict | `/Title` in info dict is non-empty and descriptive | | 5 | Document title in XMP | `dc:title` in XMP metadata is non-empty | | 6 | Display title enabled | `/ViewerPreferences` → `/DisplayDocTitle` is `true` | | 7 | Root element is /Document | `/StructTreeRoot` → `/K` → `/S` == `/Document` | | 8 | **Exactly one /H1** | Count of `/H1` elements in tree == 1 | | 9 | **Every content page has a heading** | Every page except explicitly exempted ones has at least one heading struct element with `/Pg` pointing to it | | 10 | **First heading is /H1** | First heading in document order has level 1 | | 11 | **No heading-level skips** | For every consecutive pair of headings in doc order, `next.level - prev.level ≤ 1` | | 12 | **Every heading has non-empty /K** | Every `/Hn` struct element's `/K` is a non-null integer or a non-empty array | | 13 | **Every heading has /Pg** | Every `/Hn` struct element has a `/Pg` entry pointing to a valid page | | 14 | **Every heading's MCID is real** | For each heading, parse its `/Pg`'s content stream and confirm the `/K` MCID appears inside a BDC…EMC region with non-empty operators | | 15 | All figures have /Alt | Every `/Figure` element has a non-empty `/Alt` string | | 16 | All formulas have /Alt | Every `/Formula` element has a non-empty `/Alt` string | | 17 | Pages have /StructParents | Every page has a `/StructParents` integer | | 18 | ParentTree present | `/StructTreeRoot` → `/ParentTree` exists | | 19 | Marked content in streams | Every page content stream contains `BDC` and `EMC` operators | **Heading-hierarchy audit helper:** ```python def walk_headings(node, out): """Depth-first traversal collecting heading struct elements in document order.""" s = node.get("/S") if s is not None: name = str(s) if len(name) == 3 and name.startswith("/H") and name[2].isdigit(): out.append((int(name[2]), node)) k = node.get("/K") if k is None: return children = list(k) if isinstance(k, pikepdf.Array) else [k] for ch in children: try: if ch.is_indirect: walk_headings(ch, out) except AttributeError: pass # plain int MCID, not a struct element def audit_hierarchy(pdf): root = pdf.Root["/StructTreeRoot"]["/K"] doc_root = root if not isinstance(root, pikepdf.Array) else root[0] headings = [] walk_headings(doc_root, headings) failures = [] if not headings: failures.append("no heading struct elements found") return failures if headings[0][0] != 1: failures.append(f"first heading is /H{headings[0][0]}, must be /H1") h1_count = sum(1 for lvl, _ in headings if lvl == 1) if h1_count != 1: failures.append(f"found {h1_count} /H1 elements, must be exactly 1") for (a, _), (b, _nb) in zip(headings, headings[1:]): if b - a > 1: failures.append(f"heading level skip: /H{a} → /H{b}") for lvl, node in headings: if node.get("/K") is None: failures.append(f"/H{lvl} has no /K") if node.get("/Pg") is None: failures.append(f"/H{lvl} missing /Pg binding") return failures ``` **CRITICAL — check #16 zero-formula trap:** If check #16 passes vacuously (zero `/Formula` elements found) BUT the PDF visually contains equations, this is a **compliance gap**, not a pass. Flag it and proceed to Stage 4. Print a pass/fail checklist for each item. For checks 8–14, if any fail, print the offending heading element's location (page index, MCID, `/S` value) so the remediation can be re-run or augmented. Walk the structure tree and print every `/Hn`, `/Figure`, and `/Formula` element with a preview of its alt-text or heading text. --- ### STAGE 4 — FORMULA PROMOTION (when equations are tagged as /Figure) **Why this stage exists.** Slide-authoring tools (Keynote, PowerPoint, Google Slides) tag all rendered content — including equations — as `/Figure`. Screen readers can read `/Figure` alt-text, but math-aware assistive technology treats `/Formula` elements specially. Promoting equation figures to `/Formula` with LaTeX-embedded alt-text is the highest-fidelity outcome for STEM content. **This stage is required when** Stage 3 reports zero `/Formula` elements but the PDF visually contains equations. #### 4.1 — Detect Embedded LaTeX Source (latexit) Many macOS slide authors use **LaTeXiT** to render equations in Keynote. LaTeXiT embeds the original LaTeX source as a base64-encoded, zlib-compressed Apple binary property list (bplist) inside a `` annotation in the PDF content stream. To extract it: ```python import re, base64, zlib, plistlib def extract_latexit_latex(segment_bytes): """Extract LaTeX source from a latexit bplist annotation in a content stream segment.""" m = re.search(rb']*>([A-Za-z0-9+/=\s]+)', segment_bytes) if not m: return None try: b64 = re.sub(rb'\s+', b'', m.group(1)) raw = base64.b64decode(b64) # latexit format: 4-byte header, then zlib-compressed bplist plist_data = zlib.decompress(raw[4:]) plist = plistlib.loads(plist_data) return plist.get('source') # original LaTeX source string except Exception: return None ``` If no latexit annotations are found, fall back to visual inspection and manual LaTeX transcription from the rasterized page images. #### 4.2 — Parse MCID Segments from Content Streams Associate each `/Figure` (and `/Heading`, for §2.2.1 Case B) struct element with its content stream bytes by parsing the BDC/EMC blocks for MCID numbers: ```python def parse_mcid_segments(raw_bytes): """Return dict of MCID (int) → content stream bytes for each BDC/EMC block.""" segments = {} i = 0 bdc_stack = [] while i < len(raw_bytes): m = re.search(rb'\b(BDC|BMC|EMC)\b', raw_bytes[i:]) if not m: break abs_pos = i + m.start() op = m.group(1) if op in (b'BDC', b'BMC'): preceding = raw_bytes[max(0, abs_pos - 300):abs_pos] mm = re.search(rb'/MCID\s+(\d+)', preceding) mcid = int(mm.group(1)) if mm else None bdc_stack.append((mcid, abs_pos + len(op))) elif op == b'EMC': if bdc_stack: mcid, start = bdc_stack.pop() if mcid is not None and mcid not in segments: segments[mcid] = raw_bytes[start:abs_pos] i = abs_pos + len(op) return segments ``` **Important:** The `/K` field of a struct element may contain plain integers (MCID numbers), not just indirect references. Always guard against `AttributeError` when calling `.is_indirect` on plain integers by wrapping in try/except. #### 4.3 — Classify Equation vs. Non-Equation Figures For each `/Figure` struct element, get its MCID and look up the content stream segment. The element is an **equation** if: - The segment contains a `` annotation, AND - The extracted LaTeX source (after stripping `\color` commands) begins with a mathematical token (`\frac`, `\sqrt`, `\hbar`, `\psi`, `\int`, `\begin{equation}`, `\begin{align}`, a bare symbol like `k =`, etc.) The element is **not** an equation if: - The segment contains a `Do` operator referencing an external image XObject (photo, plot screenshot, raster image) - The LaTeX source begins with `\begin{itemize}`, `\begin{enumerate}`, or a capitalized English word (prose, not math) Detect raster images via `bool(re.search(rb'\bDo\b', segment))`. #### 4.4 — Clean LaTeX for Alt-Text The latexit `source` field contains `\color[rgb]{R,G,B}` wrappers and a preamble color line. Strip them before including LaTeX in alt-text so AT tools don't read out the color commands: ```python def clean_latex(s): s = re.sub(r'^\s*\\color\[[^\]]+\]\{[^}]+\}\s*\n?', '', s.strip()) for _ in range(8): s = re.sub(r'\\color\[[^\]]+\]\{[^}]*\}', '', s) s = re.sub(r'\\color\{[^}]*\}', '', s) s = re.sub(r'^\s*%?\\noindent\s*\n?', '', s, flags=re.MULTILINE) s = re.sub(r'^\s*%[^\n]*\n', '', s, flags=re.MULTILINE) s = re.sub(r'\{\s*\}', '', s) s = re.sub(r'\s{3,}', ' ', s) return s.strip() ``` #### 4.5 — Promote /Figure → /Formula For each `/Figure` element that passes the equation classifier: ```python obj['/S'] = Name('/Formula') obj['/Alt'] = String(f"{spoken_description}. LaTeX: {clean_latex(raw_source)}") ``` The alt-text MUST contain both: - A **spoken-English description** spelling out all symbols (e.g., "negative h-bar squared over two m times the second partial of psi with respect to x") - The **clean LaTeX source** prefixed with `LaTeX:` so math-aware AT can render it Do not promote `/Figure` elements that are raster images, wave-function plots, screenshots, QR codes, or decorative backgrounds. #### 4.6 — Re-run Stage 3 Verification After promotion, re-run the Stage 3 checklist (all 19 items). Check 16 should now show a non-zero `/Formula` count. Append the second-pass report to the JSON output. --- ### OUTPUT - Save the remediated PDF as `_ada.pdf` - Save a JSON verification report as `_ada.report.json` - Print a summary: total pages, heading-hierarchy audit result, total figures, total formulas, and overall pass/fail --- ### COMMON PITFALLS - **"No headings" reported by Panorama even though `/H1` is in the tree.** Three causes: (1) heading `/K` points to an MCID with no matching BDC…EMC region in any page content stream (empty heading); (2) heading has no `/Pg` entry, so the checker can't locate its marked content; (3) the document has multiple `/H1` elements, which some checkers reject as invalid hierarchy and report as "no valid headings." The Stage 3 audit catches all three. - **One `/H1` per page is wrong.** A document has exactly one H1 (the title). Each content page gets an H2. - **Heading-level skips** (H1 → H3 with no H2 between them) fail WCAG 2.4.6. If a visual subheading looks like H3 but has no H2 above it, promote it to H2. - **Empty heading BDC block.** `/H2 <> BDC EMC` with no drawing operators between BDC and EMC produces a heading struct element that checkers treat as empty. The drawing operators for the heading text must live between BDC and EMC, or use the `/ActualText` fallback from §2.2.2 Case C. - **Never mix `read_raw_bytes()` with uncompressed BDC/EMC text** — corrupts the stream and renders the page blank. Always use `read_bytes()` when reading streams you plan to modify. - **Don't skip equations.** Zero `/Formula` elements in a document that visually contains equations is a compliance gap — run Stage 4. - **Don't forget the title in three places:** `/Title` in info dict, `dc:title` in XMP, and `/DisplayDocTitle true` in ViewerPreferences. - **The info-dict title must be descriptive** — without it, PDF viewers show the filename instead of the title in the title bar, which fails WCAG 2.4.2. - **Guard against `int.is_indirect`** when traversing `/K` arrays. The field may contain plain integer MCIDs, not just indirect references. - **Overwriting the input file with pikepdf** requires `pikepdf.open(path, allow_overwriting_input=True)`.