Capture Documents API¶

Base path: /api/v1/capture/documents

See API Reference for auth, errors, and pagination.

Document capture accepts file uploads for asynchronous parsing. The API stores the file, creates a CaptureJob record, and enqueues a Celery worker task. On success the parsed text is written to a KnowledgeEntry. Clients poll the job endpoint to track progress.

Plan requirement: Starter+ (@plan_required("starter", feature="document_capture") on all routes)

Supported File Formats¶

Extension	MIME type	Parser
`.pdf`	`application/pdf`	Unstructured.io
`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Unstructured.io
`.doc`	`application/msword`	Unstructured.io (requires LibreOffice on worker host)
`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	Unstructured.io
`.png`	`image/png`	Tesseract OCR (`ara+eng`)
`.jpg` / `.jpeg`	`image/jpeg`	Tesseract OCR (`ara+eng`)

Maximum file size: 50 MB

Job Status Lifecycle¶

pending → processing → completed
                     ↘ failed

Status	Meaning
`pending`	Job created; task queued but not yet picked up by a worker
`processing`	Worker has started parsing the document
`completed`	Parsing succeeded; `result_entry_id` points to the new `KnowledgeEntry`
`failed`	All retries exhausted; `error_message` contains the reason

A background beat task (capture.documents.recover_stuck_jobs) periodically transitions any job stuck in processing for more than 30 minutes to failed (covers worker crashes).

Endpoints¶

POST /capture/documents/upload¶

Auth: JWT required — Plan: Starter+

Accepts a multipart file upload, validates the file, persists it to storage, creates a CaptureJob, and enqueues an async parsing task. Returns immediately with the new job ID.

Request

Header	Value
`Authorization`	`Bearer <access_token>`
`Content-Type`	`multipart/form-data`

Field	Type	Required	Description
`file`	file	yes	Document to upload. Field name must be exactly `file`.

Response — 201 Created

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "status": "pending",
  "message": "Document uploaded successfully. Parsing is in progress."
}

Field	Type	Description
`job_id`	UUID string	ID of the newly created `CaptureJob`. Use this to poll status.
`status`	string	Always `"pending"` at creation time.
`message`	string	Human-readable confirmation.

Errors

HTTP	Code	Cause
400	`BAD_REQUEST`	`file` field missing or filename is empty
400	`INVALID_DOCUMENT_FILE`	Extension or MIME type not in supported list
401	`AUTHENTICATION_FAILED`	Missing or expired JWT
402	`SUBSCRIPTION_REQUIRED`	Subscription inactive or trial expired
403	`PLAN_UPGRADE_REQUIRED`	Plan below Starter
413	`DOCUMENT_FILE_TOO_LARGE`	File exceeds 50 MB limit
500	`INTERNAL_ERROR`	Unexpected error during storage or job creation

curl -X POST https://api.knora.io/api/v1/capture/documents/upload \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@/path/to/report.pdf"

GET /capture/documents/jobs¶

Auth: JWT required — Plan: Starter+

Returns a paginated list of the authenticated user's document capture jobs, scoped to their organisation. Results are ordered most-recently created first.

Query Parameters

Parameter	Type	Default	Max	Description
`page`	integer	`1`	—	Page number (1-indexed)
`per_page`	integer	`20`	`100`	Results per page; values above 100 are clamped

Response — 200 OK

{
  "jobs": [
    {
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
      "type": "document",
      "status": "completed",
      "source_filename": "annual_report.pdf",
      "file_size": 1048576,
      "mime_type": "application/pdf",
      "created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
      "created_at": "2026-06-01T10:00:00+00:00",
      "updated_at": "2026-06-01T10:00:45+00:00",
      "completed_at": "2026-06-01T10:00:45+00:00",
      "error_message": null,
      "result_entry_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "metadata_json": "{\"page_count\": 12, \"parsed_by\": \"unstructured\"}"
    }
  ],
  "total": 42,
  "page": 1,
  "per_page": 20
}

Job object fields

Field	Type	Nullable	Description
`id`	UUID string	no	Unique job identifier
`org_id`	UUID string	no	Organisation that owns the job
`type`	string	no	Always `"document"`
`status`	string	no	`pending`, `processing`, `completed`, or `failed`
`source_filename`	string	no	Original filename supplied by the client
`file_size`	integer	no	File size in bytes
`mime_type`	string	no	Detected or declared MIME type
`created_by`	UUID string	no	User who uploaded the file
`created_at`	ISO 8601 string	yes	Job creation timestamp (UTC)
`updated_at`	ISO 8601 string	yes	Last status change timestamp (UTC)
`completed_at`	ISO 8601 string	yes	Timestamp when the job reached a terminal state
`error_message`	string	yes	Populated only when `status` is `"failed"`
`result_entry_id`	UUID string	yes	ID of the `KnowledgeEntry` created on success
`metadata_json`	JSON string	yes	Parser metadata: `page_count`, `parsed_by` (`"unstructured"` or `"tesseract"`)

Errors

HTTP	Code	Cause
401	`AUTHENTICATION_FAILED`	Missing or expired JWT
402	`SUBSCRIPTION_REQUIRED`	Subscription inactive or trial expired
403	`PLAN_UPGRADE_REQUIRED`	Plan below Starter

curl "https://api.knora.io/api/v1/capture/documents/jobs?page=1&per_page=20" \
  -H "Authorization: Bearer <access_token>"

GET /capture/documents/jobs/{job_id}¶

Auth: JWT required — Plan: Starter+

Returns the current status and full detail of a single capture job. Use this endpoint to poll for completion after uploading a document. The job must belong to the requesting user and their organisation.

Path Parameters

Parameter	Type	Description
`job_id`	UUID string	The `job_id` returned by the upload endpoint

Response — 200 OK

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "type": "document",
  "status": "completed",
  "source_filename": "invoice_q1.xlsx",
  "file_size": 204800,
  "mime_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  "created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
  "created_at": "2026-06-01T09:30:00+00:00",
  "updated_at": "2026-06-01T09:30:22+00:00",
  "completed_at": "2026-06-01T09:30:22+00:00",
  "error_message": null,
  "result_entry_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "metadata_json": "{\"page_count\": 1, \"parsed_by\": \"unstructured\"}"
}

The response shape is identical to a single item in the jobs array from the list endpoint.

Errors

HTTP	Code	Cause
400	`BAD_REQUEST`	`job_id` is not a valid UUID
401	`AUTHENTICATION_FAILED`	Missing or expired JWT
402	`SUBSCRIPTION_REQUIRED`	Subscription inactive or trial expired
403	`PLAN_UPGRADE_REQUIRED`	Plan below Starter
404	`DOCUMENT_JOB_NOT_FOUND`	No job with that ID exists for the requesting user's organisation

curl "https://api.knora.io/api/v1/capture/documents/jobs/3fa85f64-5717-4562-b3fc-2c963f66afa6" \
  -H "Authorization: Bearer <access_token>"

Polling Pattern¶

Because parsing is asynchronous, clients should poll after uploading:

POST /upload — receive job_id, status: "pending"
Poll GET /jobs/{job_id} every 2–5 seconds
Stop when status is "completed" or "failed"
On "completed": use result_entry_id to fetch the parsed KnowledgeEntry
On "failed": read error_message and prompt the user to re-upload

Most documents complete in under 30 seconds; large PDFs may take longer.

Notes¶

Tenant isolation: Jobs are scoped to both user_id and org_id. A user cannot access jobs belonging to other users within the same organisation. The org_id is passed to the Celery task to prevent cross-tenant task injection.
Idempotency: If a completed or failed job is re-delivered (e.g. after a worker SIGKILL), the task exits early without re-parsing.
File storage: Files are persisted to the configured storage backend (key path: documents/<org_id>/<filename>) before the job record is created. The storage key is not exposed in API responses.
Knowledge entry status: Parsed entries are created with status: needs_review and confidence: medium, requiring human review before use in knowledge operations.
Image OCR: Tesseract uses ara+eng. Both language packs must be installed on the worker host.
.doc files: Legacy Word 97-2003 format requires LibreOffice on the worker host.