Uploading Documents

This guide covers everything you need to know about uploading and managing documents in Orka, including supported formats, best practices, and troubleshooting.

Supported File Types

| Format | Extensions | Max Size | Notes | |--------|------------|----------|-------| | PDF | .pdf | 50 MB | Best for structured documents | | Word | .docx, .doc | 25 MB | Preserves formatting | | Text | .txt, .md | 10 MB | Plain text and Markdown | | HTML | .html, .htm | 10 MB | Web content |

Need support for additional file types? Contact us at support@orka.ai.

Creating a Datastore

Before uploading documents, you need a datastore. The datastore creation wizard guides you through configuration.

Using the Dashboard

Navigate to Datastores in the sidebar and click Create Datastore to begin.

Step 1: General

Basic information and document upload:

  • Name: A descriptive name for the datastore
  • Description: What this datastore contains
  • File Upload: Drag and drop files or click to browse
  • Third-Party Connection (Coming Soon): Connect to external sources like Google Drive, Notion, etc.

Step 2: Metadata Fields (Coming Soon)

Define custom metadata fields for documents:

  • Field Name: Identifier for the metadata field
  • Field Type: Select from:
    • Select: Single choice from predefined options
    • Multi-select: Multiple choices from predefined options
    • Date: Date values
    • Text: Free-form text

Metadata fields enable filtering during retrieval.

Step 3: Parsing (Coming Soon)

Configure document parsing options:

  • Figure Captioning: Extract and caption figures/images
  • Table Splitting: Split tables into separate chunks

Step 4: Chunking Strategy (Coming Soon)

Control how documents are split into chunks:

  • Chunking Mode: Choose from 4 strategies
    • Fixed: Split at fixed character intervals
    • Semantic: Split at semantic boundaries (paragraphs, sections)
    • Recursive: Hierarchical splitting for nested content
    • Sentence: Split at sentence boundaries
  • Chunk Length: Target size for each chunk
  • Chunk Overlap: Overlap between consecutive chunks

Steps marked "Coming Soon" are visible in the wizard but not yet configurable. Default values are applied automatically.


Uploading Documents

Once you have a datastore, you can upload documents.

Via Dashboard

  1. Navigate to Datastores in the sidebar
  2. Select your datastore
  3. Click Upload Documents
  4. Drag and drop files or click to browse
  5. Wait for processing to complete

Bulk Upload

You can upload multiple files at once:

  • Drag multiple files to the upload area
  • Or use Ctrl/Cmd+Click to select multiple files

Uploading via API

Single File Upload

javascript
1import fs from 'fs';
2
3const document = await client.documents.upload({
4 datastore_id: 'ds_abc123',
5 file: fs.createReadStream('./document.pdf'),
6 name: 'Product Manual', // Optional custom name
7});
8
9console.log('Uploaded:', document.id);
10console.log('Status:', document.status);

Upload with Fetch (Browser)

javascript
1async function uploadDocument(file, datastoreId) {
2 const formData = new FormData();
3 formData.append('datastore_id', datastoreId);
4 formData.append('file', file);
5
6 const response = await fetch('https://api.orka.ai/v1/documents', {
7 method: 'POST',
8 headers: {
9 'Authorization': `Bearer ${apiKey}`,
10 },
11 body: formData,
12 });
13
14 return response.json();
15}
16
17// Usage with file input
18const fileInput = document.querySelector('input[type="file"]');
19fileInput.addEventListener('change', async (e) => {
20 const file = e.target.files[0];
21 const result = await uploadDocument(file, 'ds_abc123');
22 console.log('Uploaded:', result.id);
23});

Bulk Upload

javascript
1async function uploadMultiple(files, datastoreId) {
2 const results = await Promise.all(
3 files.map(file =>
4 client.documents.upload({
5 datastore_id: datastoreId,
6 file: fs.createReadStream(file),
7 })
8 )
9 );
10
11 return results;
12}
13
14const files = ['./doc1.pdf', './doc2.pdf', './doc3.pdf'];
15const uploaded = await uploadMultiple(files, 'ds_abc123');

Document Processing

After upload, documents go through several processing stages:

Upload → Parsing → Chunking → Embedding → Indexing → Ready

Processing Status

| Status | Description | |--------|-------------| | pending | Waiting in queue | | processing | Currently being processed | | completed | Ready for queries | | failed | Processing failed |

Checking Status

javascript
1async function waitForDocument(documentId) {
2 while (true) {
3 const doc = await client.documents.get(documentId);
4
5 switch (doc.status) {
6 case 'completed':
7 console.log('Document ready!');
8 return doc;
9
10 case 'failed':
11 throw new Error(`Processing failed: ${doc.error}`);
12
13 default:
14 console.log(`Status: ${doc.status}...`);
15 await new Promise(r => setTimeout(r, 2000));
16 }
17 }
18}

Processing Time

Typical processing times:

| Document Size | Estimated Time | |--------------|----------------| | < 10 pages | < 30 seconds | | 10-50 pages | 30-60 seconds | | 50-200 pages | 1-3 minutes | | 200+ pages | 3-10 minutes |

Large documents are processed asynchronously. You can continue uploading more documents while others are processing.

Best Practices

Document Preparation

  1. Use native PDFs when possible (not scanned images)
  2. Ensure text is selectable in PDFs
  3. Remove password protection before uploading
  4. Use descriptive filenames for easy identification

Organizing Documents

Create logical datastores for different document types:

javascript
1// Separate datastores by category
2const productDocs = await client.datastores.create({
3 name: 'Product Documentation',
4});
5
6const supportDocs = await client.datastores.create({
7 name: 'Support Articles',
8});
9
10const policyDocs = await client.datastores.create({
11 name: 'Company Policies',
12});

Document Quality

For best results:

  • Clear formatting: Use headings, lists, and paragraphs
  • Consistent structure: Similar documents should follow similar formats
  • Complete content: Include all relevant information
  • No duplicates: Avoid uploading the same document multiple times

Handling Large Documents

Split Large PDFs

For documents over 100 pages, consider splitting:

bash
1# Using pdftk (install separately)
2pdftk large-document.pdf burst output page_%02d.pdf

Upload in Batches

For many documents, upload in batches to avoid rate limits:

javascript
1async function uploadInBatches(files, datastoreId, batchSize = 5) {
2 const results = [];
3
4 for (let i = 0; i < files.length; i += batchSize) {
5 const batch = files.slice(i, i + batchSize);
6
7 const batchResults = await Promise.all(
8 batch.map(file =>
9 client.documents.upload({
10 datastore_id: datastoreId,
11 file: fs.createReadStream(file),
12 })
13 )
14 );
15
16 results.push(...batchResults);
17 console.log(`Uploaded ${results.length}/${files.length}`);
18
19 // Wait between batches
20 if (i + batchSize < files.length) {
21 await new Promise(r => setTimeout(r, 1000));
22 }
23 }
24
25 return results;
26}

Updating Documents

To update a document, delete the old version and upload the new one:

javascript
1async function updateDocument(oldDocId, datastoreId, newFile) {
2 // Delete old version
3 await client.documents.delete(oldDocId);
4
5 // Upload new version
6 const newDoc = await client.documents.upload({
7 datastore_id: datastoreId,
8 file: fs.createReadStream(newFile),
9 });
10
11 return newDoc;
12}

Document deletion is immediate. Ensure you have the new version ready before deleting.

Troubleshooting

Common Errors

| Error | Cause | Solution | |-------|-------|----------| | unsupported_format | File type not supported | Convert to supported format | | file_corrupted | File is damaged | Re-export or recreate the file | | password_protected | PDF has password | Remove password protection | | file_too_large | Exceeds size limit | Split into smaller files | | text_extraction_failed | Cannot read text | Ensure PDF is not image-only |

Scanned PDFs

Scanned PDFs (images) require OCR. For best results:

  1. Use high-quality scans (300 DPI minimum)
  2. Ensure good contrast
  3. Keep pages straight and aligned
  4. Consider using OCR software before upload

Empty Processing Results

If documents process but queries return no results:

  1. Verify the document has readable text
  2. Check that the datastore is connected to your agent
  3. Try a specific query matching document content
  4. Review the document in the dashboard

Next Steps