Skip to content

Capture Documents API

Base path: /api/v1/capture/documents

See API Reference for auth, errors, and pagination.

Document capture accepts file uploads for asynchronous parsing. The API stores the file, creates a CaptureJob record, and enqueues a Celery worker task. On success the parsed text is written to a KnowledgeEntry. Clients poll the job endpoint to track progress.

Plan requirement: Starter+ (@plan_required("starter", feature="document_capture") on all routes)

Supported File Formats

Extension MIME type Parser
.pdf application/pdf Unstructured.io
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Unstructured.io
.doc application/msword Unstructured.io (requires LibreOffice on worker host)
.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Unstructured.io
.png image/png Tesseract OCR (ara+eng)
.jpg / .jpeg image/jpeg Tesseract OCR (ara+eng)

Maximum file size: 50 MB

Job Status Lifecycle

pending → processing → completed
                     ↘ failed
Status Meaning
pending Job created; task queued but not yet picked up by a worker
processing Worker has started parsing the document
completed Parsing succeeded; result_entry_id points to the new KnowledgeEntry
failed All retries exhausted; error_message contains the reason

A background beat task (capture.documents.recover_stuck_jobs) periodically transitions any job stuck in processing for more than 30 minutes to failed (covers worker crashes).

Endpoints

POST /capture/documents/upload

Auth: JWT required — Plan: Starter+

Accepts a multipart file upload, validates the file, persists it to storage, creates a CaptureJob, and enqueues an async parsing task. Returns immediately with the new job ID.

Request

Header Value
Authorization Bearer <access_token>
Content-Type multipart/form-data
Field Type Required Description
file file yes Document to upload. Field name must be exactly file.

Response — 201 Created

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "status": "pending",
  "message": "Document uploaded successfully. Parsing is in progress."
}
Field Type Description
job_id UUID string ID of the newly created CaptureJob. Use this to poll status.
status string Always "pending" at creation time.
message string Human-readable confirmation.

Errors

HTTP Code Cause
400 BAD_REQUEST file field missing or filename is empty
400 INVALID_DOCUMENT_FILE Extension or MIME type not in supported list
401 AUTHENTICATION_FAILED Missing or expired JWT
402 SUBSCRIPTION_REQUIRED Subscription inactive or trial expired
403 PLAN_UPGRADE_REQUIRED Plan below Starter
413 DOCUMENT_FILE_TOO_LARGE File exceeds 50 MB limit
500 INTERNAL_ERROR Unexpected error during storage or job creation
curl -X POST https://api.knora.io/api/v1/capture/documents/upload \
  -H "Authorization: Bearer <access_token>" \
  -F "file=@/path/to/report.pdf"

GET /capture/documents/jobs

Auth: JWT required — Plan: Starter+

Returns a paginated list of the authenticated user's document capture jobs, scoped to their organisation. Results are ordered most-recently created first.

Query Parameters

Parameter Type Default Max Description
page integer 1 Page number (1-indexed)
per_page integer 20 100 Results per page; values above 100 are clamped

Response — 200 OK

{
  "jobs": [
    {
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
      "type": "document",
      "status": "completed",
      "source_filename": "annual_report.pdf",
      "file_size": 1048576,
      "mime_type": "application/pdf",
      "created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
      "created_at": "2026-06-01T10:00:00+00:00",
      "updated_at": "2026-06-01T10:00:45+00:00",
      "completed_at": "2026-06-01T10:00:45+00:00",
      "error_message": null,
      "result_entry_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "metadata_json": "{\"page_count\": 12, \"parsed_by\": \"unstructured\"}"
    }
  ],
  "total": 42,
  "page": 1,
  "per_page": 20
}

Job object fields

Field Type Nullable Description
id UUID string no Unique job identifier
org_id UUID string no Organisation that owns the job
type string no Always "document"
status string no pending, processing, completed, or failed
source_filename string no Original filename supplied by the client
file_size integer no File size in bytes
mime_type string no Detected or declared MIME type
created_by UUID string no User who uploaded the file
created_at ISO 8601 string yes Job creation timestamp (UTC)
updated_at ISO 8601 string yes Last status change timestamp (UTC)
completed_at ISO 8601 string yes Timestamp when the job reached a terminal state
error_message string yes Populated only when status is "failed"
result_entry_id UUID string yes ID of the KnowledgeEntry created on success
metadata_json JSON string yes Parser metadata: page_count, parsed_by ("unstructured" or "tesseract")

Errors

HTTP Code Cause
401 AUTHENTICATION_FAILED Missing or expired JWT
402 SUBSCRIPTION_REQUIRED Subscription inactive or trial expired
403 PLAN_UPGRADE_REQUIRED Plan below Starter
curl "https://api.knora.io/api/v1/capture/documents/jobs?page=1&per_page=20" \
  -H "Authorization: Bearer <access_token>"

GET /capture/documents/jobs/{job_id}

Auth: JWT required — Plan: Starter+

Returns the current status and full detail of a single capture job. Use this endpoint to poll for completion after uploading a document. The job must belong to the requesting user and their organisation.

Path Parameters

Parameter Type Description
job_id UUID string The job_id returned by the upload endpoint

Response — 200 OK

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "type": "document",
  "status": "completed",
  "source_filename": "invoice_q1.xlsx",
  "file_size": 204800,
  "mime_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  "created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
  "created_at": "2026-06-01T09:30:00+00:00",
  "updated_at": "2026-06-01T09:30:22+00:00",
  "completed_at": "2026-06-01T09:30:22+00:00",
  "error_message": null,
  "result_entry_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
  "metadata_json": "{\"page_count\": 1, \"parsed_by\": \"unstructured\"}"
}

The response shape is identical to a single item in the jobs array from the list endpoint.

Errors

HTTP Code Cause
400 BAD_REQUEST job_id is not a valid UUID
401 AUTHENTICATION_FAILED Missing or expired JWT
402 SUBSCRIPTION_REQUIRED Subscription inactive or trial expired
403 PLAN_UPGRADE_REQUIRED Plan below Starter
404 DOCUMENT_JOB_NOT_FOUND No job with that ID exists for the requesting user's organisation
curl "https://api.knora.io/api/v1/capture/documents/jobs/3fa85f64-5717-4562-b3fc-2c963f66afa6" \
  -H "Authorization: Bearer <access_token>"

Polling Pattern

Because parsing is asynchronous, clients should poll after uploading:

  1. POST /upload — receive job_id, status: "pending"
  2. Poll GET /jobs/{job_id} every 2–5 seconds
  3. Stop when status is "completed" or "failed"
  4. On "completed": use result_entry_id to fetch the parsed KnowledgeEntry
  5. On "failed": read error_message and prompt the user to re-upload

Most documents complete in under 30 seconds; large PDFs may take longer.

Notes

  • Tenant isolation: Jobs are scoped to both user_id and org_id. A user cannot access jobs belonging to other users within the same organisation. The org_id is passed to the Celery task to prevent cross-tenant task injection.
  • Idempotency: If a completed or failed job is re-delivered (e.g. after a worker SIGKILL), the task exits early without re-parsing.
  • File storage: Files are persisted to the configured storage backend (key path: documents/<org_id>/<filename>) before the job record is created. The storage key is not exposed in API responses.
  • Knowledge entry status: Parsed entries are created with status: needs_review and confidence: medium, requiring human review before use in knowledge operations.
  • Image OCR: Tesseract uses ara+eng. Both language packs must be installed on the worker host.
  • .doc files: Legacy Word 97-2003 format requires LibreOffice on the worker host.