# AI PDF Prep FAQ

Prepare scan-like and OCR-heavy PDFs with your own LLM so SynapQ's fast parser can ingest them cleanly.

Canonical HTML guide: https://www.synapq.app/docs/ai-prep
Plain-text mirror: https://www.synapq.app/docs/ai-prep.txt

## When to use this flow
- Scanned PDFs with broken OCR, double columns, and noisy headers.
- Image-heavy sources where the normal parse path would need visual enrichment.
- Legacy materials that are readable to a strong LLM but expensive to normalize on SynapQ servers.
- Cases where the user wants to control the preprocessing step in their own model account.

## Which models fit this flow best
- This flow works best with strong multimodal models that can inspect the PDF visually, not just extract text.
- Weaker text-only or OCR-first models may recover the question text but still miss visually marked answers.
- If the model misses answer recovery on the first pass, strengthen the prompt rather than trusting a weak output.

## What comes back into SynapQ
- Attach the original PDF and the copied prompt to your own LLM.
- If the model can browse, let it visit the public guide for the detailed rules.
- Review the generated Markdown quickly for dropped questions, malformed choices, missing recovered answers, or synthesized distractors that are too weak.
- If the model appended an `Extracted Images` section, keep that appendix intact so SynapQ can attach those visuals automatically by question number.
- If the output is long, prefer downloading a single `.md` file from the LLM interface instead of copying a huge chat message.
- Upload the cleaned result into SynapQ as `.md`, `.txt`, pasted text, or a converted PDF.

## NotebookLM outputs
- SynapQ does not currently import NotebookLM links directly.
- If NotebookLM generates a quiz or test for you, copy the output and bring it into SynapQ with `Paste Text`, or export it into `.md` / `.txt` first.
- Treat NotebookLM links as viewing links, not as a stable machine-to-machine import format.
- If the NotebookLM output is noisy or not in SynapQ's question contract yet, run it through the SynapQ AI prep prompt first, then paste or upload the cleaned result.

## Prompt to paste into your LLM
A PDF document will be attached. Convert it into clean, parser-friendly Markdown for SynapQ.

If you can browse the web, first read these detailed instructions:
https://www.synapq.app/docs/ai-prep

If you cannot access that page, continue using the rules below without stopping.

Goal:
- Preserve the original question order.
- Preserve exact medical, scientific, and exam terminology.
- Produce output that can be uploaded into SynapQ as Markdown, plain text, pasted text, or a converted PDF.

Answer recovery:
- Treat the PDF as a visual document, not as text-only OCR. Use multimodal inspection when available.
- Inspect the entire PDF for answer signals, not just the question pages.
- There may be no separate answer key. In some booklets, the correct option is marked directly on the question page.
- Some "çıkmış" or recall sheets contain only the correct answer text under the question, with no distractor options.
- If a separate answer key exists, it may appear at the end of the document. Match those entries back to the correct question numbers and add: Answer: X
- If the correct answer is visually marked in the question itself, recover it when possible. Use cues such as bold text, different text color, underline, highlight, check marks, filled circles, or other explicit formatting differences.
- Markings can distort OCR. If a choice label or word looks corrupted because of a mark, stamp, or highlight, interpret that anomaly as a possible answer signal instead of discarding it immediately.
- If both an inline visual cue and a final answer key exist, prefer the explicit answer key when they conflict.
- If the PDF clearly gives only one correct answer for a single-best-answer question and no alternative options, keep the stem unchanged, keep that original correct answer text unchanged, and synthesize exactly 3 plausible distractors from the stem context.
- Only synthesize distractors for standard single-best-answer questions. Do not synthesize distractors for matching, ordering, true/false, multi-statement, multi-answer, or open-ended questions.
- Only add Answer: X when the answer is recoverable from the PDF with reasonable confidence. If the answer is unclear, omit Answer: X and add Needs review: yes.

Output format:
- Output only Markdown. Do not include commentary, JSON, XML, HTML, or code fences.
- If your interface can actually generate a downloadable file and the result would be long, prefer returning a single .md file containing only the final cleaned document.
- Treat 40 or more questions, or any clearly long output, as a strong reason to prefer a file artifact when the interface supports it.
- If the interface does not actually produce downloadable files, return only the raw final Markdown document in the chat response without explaining that limitation.
- Keep exactly the same SynapQ format in either case. Switching from chat output to a .md or .txt file must not change the `## Question N` / choices / metadata structure.
- Do not add any preface, summary, explanation, confidence note, or closing text before or after the Markdown.
- Start each question with: ## Question N
- Write the full question stem directly below the heading.
- Then write: Visual dependency: yes or Visual dependency: no
- Put each choice on its own line as:
  A. ...
  B. ...
  C. ...
  D. ...
  E. ...
- Preserve the real number of choices from the source unless the source gives only one correct-answer text for a single-best-answer question. In that special case, output exactly 4 choices total: the original correct answer plus 3 synthesized distractors.
- When distractors are synthesized, do not rewrite the stem and do not rewrite the original correct answer text. Only add the 3 new distractors around it.
- If an answer key is present, add: Answer: X
- If an explanation is present, add: Explanation: ...
- If distractors are synthesized, add: Needs review: yes
- If any important part is unreadable, ambiguous, truncated, or visually dependent in a way that is not recoverable, add: Needs review: yes
- If you can faithfully extract one question-specific image, do not place it inline inside the question block. Append a final section exactly titled:
  ## Extracted Images
- If you are returning chat output, prefer precise crop coordinates instead of base64 in exactly this format:
  Question N: page=12 x0=120 y0=340 x1=820 y1=1260
- If you are returning a downloadable file artifact and can faithfully include the image bytes, you may use:
  Question N: data:image/png;base64,...
- Every image entry must stay on one physical line.
- Do not use bold, bullets, citations, labels in parentheses, or commentary in the image appendix.
- Do not put the data URL or coordinates on the next line.
- Do not write anything after the last image line.
- Use at most one extracted image per question, and only when it clearly belongs to that specific question number.
- If the image cannot be extracted faithfully, skip the appendix entry and rely on Visual dependency: yes / Needs review: yes instead.

Cleanup rules:
- Remove page numbers, headers, footers, watermarks, repeated titles, and duplicated scan noise.
- Reconstruct broken line wraps so the question reads naturally.
- Keep separate questions separate. Do not merge or split them incorrectly.
- Do not translate the content.
- Do not simplify terminology.
- Do not guess missing words, figures, tables, or answer choices, except for the explicit single-answer-only distractor synthesis rule above.
- If a table or figure is essential and readable, summarize only the minimum needed to understand the question.
- If a figure, table, image, or scan region is essential but unclear, mark Visual dependency: yes and Needs review: yes.

Return the final Markdown only.

## Target output shape
- If the interface can actually generate downloadable files, return a single `.md` file containing only the cleaned document when the output is long.
- Treat 40 or more questions, or any obviously long output, as a reason to prefer a `.md` file over a chat message when the interface supports it.
- If the interface does not actually produce files, return only the raw Markdown document with no preface or trailing explanation.
- Use the exact same question contract in chat output and in `.md` / `.txt` files so SynapQ can ingest either form.
- One question at a time, in original order.
- Each question starts with `## Question N`.
- Question stem appears as plain text immediately below the heading.
- Choices stay on separate lines in `A. ...` format.
- If the source gives only one correct answer text for a standard single-best-answer question, output exactly 4 choices total by adding 3 synthesized distractors around the untouched correct answer.
- `Visual dependency: yes|no` is always included.
- Optional metadata is limited to `Answer: X`, `Explanation: ...`, and `Needs review: yes`.
- If the model can faithfully extract a question image, append it only once in a final section exactly titled `## Extracted Images`.
- For chat output, prefer a single-line coordinate entry in exactly this format: `Question N: page=12 x0=120 y0=340 x1=820 y1=1260`.
- For downloadable file artifacts, `Question N: data:image/png;base64,...` is also allowed on one physical line.
- Do not use bold, bullets, citations, labels in parentheses, or commentary in the image appendix.
- Do not move the coordinates or data URL onto the next line and do not write anything after the last image line.

## What the model should clean up
- Remove page numbers, scan headers, watermarks, repeated section titles, and duplicated OCR fragments.
- Rejoin broken line wraps so each question reads like normal prose.
- Keep exact terminology for diagnoses, drugs, anatomy, and scientific notation.
- Compress readable tables or figure labels into the minimum text needed for the question.
- If a figure or table is essential but unclear, mark the question for review instead of guessing.

## How the model should recover answers
- Treat the PDF as a visual document, not as text-only OCR. Use multimodal inspection when the model supports it.
- Inspect the whole PDF for answer signals, including separate answer-key pages at the end.
- Do not assume a separate answer key exists. Some PDFs mark the correct choice directly on the question page.
- Some 'çıkmış' sheets give only the correct answer text with no distractors. Detect that layout explicitly instead of assuming the options were merely lost.
- Map final answer-key entries back to the correct question numbers and emit `Answer: X` when the mapping is clear.
- Use explicit visual cues inside the question such as bold, different color, underline, highlight, check marks, or filled markers when they clearly indicate the correct choice.
- Treat OCR anomalies caused by marking, stamps, or highlights as possible answer signals instead of ignoring them automatically.
- If inline formatting conflicts with a separate answer key, prefer the explicit answer key.
- If the PDF clearly supplies only one correct answer for a standard single-best-answer question, keep that answer text unchanged and synthesize exactly 3 plausible distractors from the stem context.
- Never synthesize distractors for matching, ordering, true/false, multi-statement, multi-answer, or open-ended questions.
- If the answer is still uncertain, omit `Answer: X` and mark `Needs review: yes` instead of guessing.

## Single-answer recall sheets
- Use this only when the PDF clearly shows a single-best-answer question with one answer text and no existing distractor options.
- Do not rewrite the question stem.
- Do not rewrite the original correct answer text from the PDF.
- Add exactly 3 plausible distractors so the final question has 4 options total.
- Keep the distractors medically plausible but clearly not better than the original correct answer.
- Always add `Answer: X` for the preserved correct choice and `Needs review: yes` when distractors were synthesized.
- Do not use this for matching, ordering, Roman numeral combination, true/false, multi-answer, or open-ended items.

## Extracted image appendix
- Only append extracted images when the model can faithfully recover a question-specific image from the PDF.
- Do not place image data or crop coordinates inside the main question block. Keep them in a final section exactly titled `## Extracted Images` after the last question.
- For chat output, prefer one line per image in exactly this format: `Question N: page=12 x0=120 y0=340 x1=820 y1=1260`.
- For downloadable file artifacts, one line per image in `Question N: data:image/png;base64,...` format is also allowed.
- Do not use bold, bullets, citations, labels in parentheses, or commentary in that appendix.
- Do not move the coordinates or data URL onto the next line and do not write anything after the last image line.
- Use at most one extracted image per question in this exact format.
- If the image is ambiguous, low-quality, or not clearly tied to a specific question number, skip the appendix entry and mark the question for review instead.

## What the model must not do
- Do not invent missing options, answer keys, labels, or image content outside the explicit single-answer-only distractor rule.
- Do not merge adjacent questions that only look close together on the scan.
- Do not translate, simplify, paraphrase, or modernize the wording.
- Do not add any introductory note, summary, or closing comment around the Markdown output.
- Do not wrap the result in JSON, XML, HTML, or markdown code fences.

## Use Visual dependency: yes when
- The question depends on an image, chart, pathology slide, ECG, radiology figure, or table.
- The stem is only understandable with labels or annotations from the source image.
- The model can summarize the visible context but should not pretend the visual has been fully converted into text.

## Use Needs review: yes when
- A choice is clipped, garbled, or missing.
- A figure or table is required but unreadable.
- The scan is too degraded to preserve the original meaning with confidence.

## Good output example
## Question 1
A 62-year-old patient presents with sudden painless vision loss in the right eye. Which diagnosis is most likely?
Visual dependency: no
A. Central retinal artery occlusion
B. Optic neuritis
C. Acute angle-closure glaucoma
D. Retinal migraine
Answer: A

## Question 2
The lesion shown in the fundus image is most consistent with which condition?
Visual dependency: yes
A. Diabetic retinopathy
B. Choroidal melanoma
C. Retinal detachment
D. Hypertensive retinopathy
Needs review: yes

## Example when distractors were synthesized
## Question 35
Aşağıda yazılan yasaklı madde ve yöntemlerden hangisi müsabaka dışı dönemlerde kullanıldığı zaman doping olarak kabul edilmez?
Visual dependency: no
A. Beta-2 agonistler
B. Narkotik analjezikler
C. Eritropoetin
D. Anabolik ajanlar
Answer: B
Needs review: yes

## Extracted image appendix example
## Question 3
The ECG shown is most consistent with which rhythm?
Visual dependency: yes
A. Atrial fibrillation
B. Ventricular tachycardia
C. Supraventricular tachycardia
D. Sinus tachycardia
Needs review: yes

## Extracted Images
Question 3: page=4 x0=120 y0=340 x1=820 y1=1260

## Bad output example
Question 1) Sudden vision loss question
A) CRAO
B) ON
C) glaucoma
D) migraine

Question 2 uses an image and is probably diabetic retinopathy.

This fails because it drops the required question heading format, skips the visual flag, shortens medical terms, and invents a probable answer for the second question.