Capture Documents API¶
Base path: /api/v1/capture/documents
See API Reference for auth, errors, and pagination.
Document capture accepts file uploads for asynchronous parsing. The API stores the file, creates a CaptureJob record, and enqueues a Celery worker task. On success the parsed text is written to a KnowledgeEntry. Clients poll the job endpoint to track progress.
Plan requirement: Starter+ (@plan_required("starter", feature="document_capture") on all routes)
Supported File Formats¶
| Extension | MIME type | Parser |
|---|---|---|
.pdf |
application/pdf |
Unstructured.io |
.docx |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Unstructured.io |
.doc |
application/msword |
Unstructured.io (requires LibreOffice on worker host) |
.xlsx |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Unstructured.io |
.png |
image/png |
Tesseract OCR (ara+eng) |
.jpg / .jpeg |
image/jpeg |
Tesseract OCR (ara+eng) |
Maximum file size: 50 MB
Job Status Lifecycle¶
| Status | Meaning |
|---|---|
pending |
Job created; task queued but not yet picked up by a worker |
processing |
Worker has started parsing the document |
completed |
Parsing succeeded; result_entry_id points to the new KnowledgeEntry |
failed |
All retries exhausted; error_message contains the reason |
A background beat task (capture.documents.recover_stuck_jobs) periodically transitions any job stuck in processing for more than 30 minutes to failed (covers worker crashes).
Endpoints¶
POST /capture/documents/upload¶
Auth: JWT required — Plan: Starter+
Accepts a multipart file upload, validates the file, persists it to storage, creates a CaptureJob, and enqueues an async parsing task. Returns immediately with the new job ID.
Request
| Header | Value |
|---|---|
Authorization |
Bearer <access_token> |
Content-Type |
multipart/form-data |
| Field | Type | Required | Description |
|---|---|---|---|
file |
file | yes | Document to upload. Field name must be exactly file. |
Response — 201 Created
{
"job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"status": "pending",
"message": "Document uploaded successfully. Parsing is in progress."
}
| Field | Type | Description |
|---|---|---|
job_id |
UUID string | ID of the newly created CaptureJob. Use this to poll status. |
status |
string | Always "pending" at creation time. |
message |
string | Human-readable confirmation. |
Errors
| HTTP | Code | Cause |
|---|---|---|
| 400 | BAD_REQUEST |
file field missing or filename is empty |
| 400 | INVALID_DOCUMENT_FILE |
Extension or MIME type not in supported list |
| 401 | AUTHENTICATION_FAILED |
Missing or expired JWT |
| 402 | SUBSCRIPTION_REQUIRED |
Subscription inactive or trial expired |
| 403 | PLAN_UPGRADE_REQUIRED |
Plan below Starter |
| 413 | DOCUMENT_FILE_TOO_LARGE |
File exceeds 50 MB limit |
| 500 | INTERNAL_ERROR |
Unexpected error during storage or job creation |
curl -X POST https://api.knora.io/api/v1/capture/documents/upload \
-H "Authorization: Bearer <access_token>" \
-F "file=@/path/to/report.pdf"
GET /capture/documents/jobs¶
Auth: JWT required — Plan: Starter+
Returns a paginated list of the authenticated user's document capture jobs, scoped to their organisation. Results are ordered most-recently created first.
Query Parameters
| Parameter | Type | Default | Max | Description |
|---|---|---|---|---|
page |
integer | 1 |
— | Page number (1-indexed) |
per_page |
integer | 20 |
100 |
Results per page; values above 100 are clamped |
Response — 200 OK
{
"jobs": [
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"type": "document",
"status": "completed",
"source_filename": "annual_report.pdf",
"file_size": 1048576,
"mime_type": "application/pdf",
"created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
"created_at": "2026-06-01T10:00:00+00:00",
"updated_at": "2026-06-01T10:00:45+00:00",
"completed_at": "2026-06-01T10:00:45+00:00",
"error_message": null,
"result_entry_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"metadata_json": "{\"page_count\": 12, \"parsed_by\": \"unstructured\"}"
}
],
"total": 42,
"page": 1,
"per_page": 20
}
Job object fields
| Field | Type | Nullable | Description |
|---|---|---|---|
id |
UUID string | no | Unique job identifier |
org_id |
UUID string | no | Organisation that owns the job |
type |
string | no | Always "document" |
status |
string | no | pending, processing, completed, or failed |
source_filename |
string | no | Original filename supplied by the client |
file_size |
integer | no | File size in bytes |
mime_type |
string | no | Detected or declared MIME type |
created_by |
UUID string | no | User who uploaded the file |
created_at |
ISO 8601 string | yes | Job creation timestamp (UTC) |
updated_at |
ISO 8601 string | yes | Last status change timestamp (UTC) |
completed_at |
ISO 8601 string | yes | Timestamp when the job reached a terminal state |
error_message |
string | yes | Populated only when status is "failed" |
result_entry_id |
UUID string | yes | ID of the KnowledgeEntry created on success |
metadata_json |
JSON string | yes | Parser metadata: page_count, parsed_by ("unstructured" or "tesseract") |
Errors
| HTTP | Code | Cause |
|---|---|---|
| 401 | AUTHENTICATION_FAILED |
Missing or expired JWT |
| 402 | SUBSCRIPTION_REQUIRED |
Subscription inactive or trial expired |
| 403 | PLAN_UPGRADE_REQUIRED |
Plan below Starter |
curl "https://api.knora.io/api/v1/capture/documents/jobs?page=1&per_page=20" \
-H "Authorization: Bearer <access_token>"
GET /capture/documents/jobs/{job_id}¶
Auth: JWT required — Plan: Starter+
Returns the current status and full detail of a single capture job. Use this endpoint to poll for completion after uploading a document. The job must belong to the requesting user and their organisation.
Path Parameters
| Parameter | Type | Description |
|---|---|---|
job_id |
UUID string | The job_id returned by the upload endpoint |
Response — 200 OK
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"org_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"type": "document",
"status": "completed",
"source_filename": "invoice_q1.xlsx",
"file_size": 204800,
"mime_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"created_by": "1d5a8b3e-4f2c-4a1b-9c3d-8e7f6a5b4c3d",
"created_at": "2026-06-01T09:30:00+00:00",
"updated_at": "2026-06-01T09:30:22+00:00",
"completed_at": "2026-06-01T09:30:22+00:00",
"error_message": null,
"result_entry_id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"metadata_json": "{\"page_count\": 1, \"parsed_by\": \"unstructured\"}"
}
The response shape is identical to a single item in the jobs array from the list endpoint.
Errors
| HTTP | Code | Cause |
|---|---|---|
| 400 | BAD_REQUEST |
job_id is not a valid UUID |
| 401 | AUTHENTICATION_FAILED |
Missing or expired JWT |
| 402 | SUBSCRIPTION_REQUIRED |
Subscription inactive or trial expired |
| 403 | PLAN_UPGRADE_REQUIRED |
Plan below Starter |
| 404 | DOCUMENT_JOB_NOT_FOUND |
No job with that ID exists for the requesting user's organisation |
curl "https://api.knora.io/api/v1/capture/documents/jobs/3fa85f64-5717-4562-b3fc-2c963f66afa6" \
-H "Authorization: Bearer <access_token>"
Polling Pattern¶
Because parsing is asynchronous, clients should poll after uploading:
POST /upload— receivejob_id,status: "pending"- Poll
GET /jobs/{job_id}every 2–5 seconds - Stop when
statusis"completed"or"failed" - On
"completed": useresult_entry_idto fetch the parsedKnowledgeEntry - On
"failed": readerror_messageand prompt the user to re-upload
Most documents complete in under 30 seconds; large PDFs may take longer.
Notes¶
- Tenant isolation: Jobs are scoped to both
user_idandorg_id. A user cannot access jobs belonging to other users within the same organisation. Theorg_idis passed to the Celery task to prevent cross-tenant task injection. - Idempotency: If a completed or failed job is re-delivered (e.g. after a worker SIGKILL), the task exits early without re-parsing.
- File storage: Files are persisted to the configured storage backend (key path:
documents/<org_id>/<filename>) before the job record is created. The storage key is not exposed in API responses. - Knowledge entry status: Parsed entries are created with
status: needs_reviewandconfidence: medium, requiring human review before use in knowledge operations. - Image OCR: Tesseract uses
ara+eng. Both language packs must be installed on the worker host. .docfiles: Legacy Word 97-2003 format requires LibreOffice on the worker host.