Uploading Documents
This guide covers everything you need to know about uploading and managing documents in Orka, including supported formats, best practices, and troubleshooting.
Supported File Types
| Format | Extensions | Max Size | Notes |
|--------|------------|----------|-------|
| PDF | .pdf | 50 MB | Best for structured documents |
| Word | .docx, .doc | 25 MB | Preserves formatting |
| Text | .txt, .md | 10 MB | Plain text and Markdown |
| HTML | .html, .htm | 10 MB | Web content |
Need support for additional file types? Contact us at support@orka.ai.
Creating a Datastore
Before uploading documents, you need a datastore. The datastore creation wizard guides you through configuration.
Using the Dashboard
Navigate to Datastores in the sidebar and click Create Datastore to begin.
Step 1: General
Basic information and document upload:
- Name: A descriptive name for the datastore
- Description: What this datastore contains
- File Upload: Drag and drop files or click to browse
- Third-Party Connection (Coming Soon): Connect to external sources like Google Drive, Notion, etc.
Step 2: Metadata Fields (Coming Soon)
Define custom metadata fields for documents:
- Field Name: Identifier for the metadata field
- Field Type: Select from:
- Select: Single choice from predefined options
- Multi-select: Multiple choices from predefined options
- Date: Date values
- Text: Free-form text
Metadata fields enable filtering during retrieval.
Step 3: Parsing (Coming Soon)
Configure document parsing options:
- Figure Captioning: Extract and caption figures/images
- Table Splitting: Split tables into separate chunks
Step 4: Chunking Strategy (Coming Soon)
Control how documents are split into chunks:
- Chunking Mode: Choose from 4 strategies
- Fixed: Split at fixed character intervals
- Semantic: Split at semantic boundaries (paragraphs, sections)
- Recursive: Hierarchical splitting for nested content
- Sentence: Split at sentence boundaries
- Chunk Length: Target size for each chunk
- Chunk Overlap: Overlap between consecutive chunks
Steps marked "Coming Soon" are visible in the wizard but not yet configurable. Default values are applied automatically.
Uploading Documents
Once you have a datastore, you can upload documents.
Via Dashboard
- Navigate to Datastores in the sidebar
- Select your datastore
- Click Upload Documents
- Drag and drop files or click to browse
- Wait for processing to complete
Bulk Upload
You can upload multiple files at once:
- Drag multiple files to the upload area
- Or use Ctrl/Cmd+Click to select multiple files
Uploading via API
Single File Upload
1import fs from 'fs';23const document = await client.documents.upload({4 datastore_id: 'ds_abc123',5 file: fs.createReadStream('./document.pdf'),6 name: 'Product Manual', // Optional custom name7});89console.log('Uploaded:', document.id);10console.log('Status:', document.status);Upload with Fetch (Browser)
1async function uploadDocument(file, datastoreId) {2 const formData = new FormData();3 formData.append('datastore_id', datastoreId);4 formData.append('file', file);56 const response = await fetch('https://api.orka.ai/v1/documents', {7 method: 'POST',8 headers: {9 'Authorization': `Bearer ${apiKey}`,10 },11 body: formData,12 });1314 return response.json();15}1617// Usage with file input18const fileInput = document.querySelector('input[type="file"]');19fileInput.addEventListener('change', async (e) => {20 const file = e.target.files[0];21 const result = await uploadDocument(file, 'ds_abc123');22 console.log('Uploaded:', result.id);23});Bulk Upload
1async function uploadMultiple(files, datastoreId) {2 const results = await Promise.all(3 files.map(file =>4 client.documents.upload({5 datastore_id: datastoreId,6 file: fs.createReadStream(file),7 })8 )9 );1011 return results;12}1314const files = ['./doc1.pdf', './doc2.pdf', './doc3.pdf'];15const uploaded = await uploadMultiple(files, 'ds_abc123');Document Processing
After upload, documents go through several processing stages:
Upload → Parsing → Chunking → Embedding → Indexing → Ready
Processing Status
| Status | Description |
|--------|-------------|
| pending | Waiting in queue |
| processing | Currently being processed |
| completed | Ready for queries |
| failed | Processing failed |
Checking Status
1async function waitForDocument(documentId) {2 while (true) {3 const doc = await client.documents.get(documentId);45 switch (doc.status) {6 case 'completed':7 console.log('Document ready!');8 return doc;910 case 'failed':11 throw new Error(`Processing failed: ${doc.error}`);1213 default:14 console.log(`Status: ${doc.status}...`);15 await new Promise(r => setTimeout(r, 2000));16 }17 }18}Processing Time
Typical processing times:
| Document Size | Estimated Time | |--------------|----------------| | < 10 pages | < 30 seconds | | 10-50 pages | 30-60 seconds | | 50-200 pages | 1-3 minutes | | 200+ pages | 3-10 minutes |
Large documents are processed asynchronously. You can continue uploading more documents while others are processing.
Best Practices
Document Preparation
- Use native PDFs when possible (not scanned images)
- Ensure text is selectable in PDFs
- Remove password protection before uploading
- Use descriptive filenames for easy identification
Organizing Documents
Create logical datastores for different document types:
1// Separate datastores by category2const productDocs = await client.datastores.create({3 name: 'Product Documentation',4});56const supportDocs = await client.datastores.create({7 name: 'Support Articles',8});910const policyDocs = await client.datastores.create({11 name: 'Company Policies',12});Document Quality
For best results:
- Clear formatting: Use headings, lists, and paragraphs
- Consistent structure: Similar documents should follow similar formats
- Complete content: Include all relevant information
- No duplicates: Avoid uploading the same document multiple times
Handling Large Documents
Split Large PDFs
For documents over 100 pages, consider splitting:
1# Using pdftk (install separately)2pdftk large-document.pdf burst output page_%02d.pdfUpload in Batches
For many documents, upload in batches to avoid rate limits:
1async function uploadInBatches(files, datastoreId, batchSize = 5) {2 const results = [];34 for (let i = 0; i < files.length; i += batchSize) {5 const batch = files.slice(i, i + batchSize);67 const batchResults = await Promise.all(8 batch.map(file =>9 client.documents.upload({10 datastore_id: datastoreId,11 file: fs.createReadStream(file),12 })13 )14 );1516 results.push(...batchResults);17 console.log(`Uploaded ${results.length}/${files.length}`);1819 // Wait between batches20 if (i + batchSize < files.length) {21 await new Promise(r => setTimeout(r, 1000));22 }23 }2425 return results;26}Updating Documents
To update a document, delete the old version and upload the new one:
1async function updateDocument(oldDocId, datastoreId, newFile) {2 // Delete old version3 await client.documents.delete(oldDocId);45 // Upload new version6 const newDoc = await client.documents.upload({7 datastore_id: datastoreId,8 file: fs.createReadStream(newFile),9 });1011 return newDoc;12}Document deletion is immediate. Ensure you have the new version ready before deleting.
Troubleshooting
Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| unsupported_format | File type not supported | Convert to supported format |
| file_corrupted | File is damaged | Re-export or recreate the file |
| password_protected | PDF has password | Remove password protection |
| file_too_large | Exceeds size limit | Split into smaller files |
| text_extraction_failed | Cannot read text | Ensure PDF is not image-only |
Scanned PDFs
Scanned PDFs (images) require OCR. For best results:
- Use high-quality scans (300 DPI minimum)
- Ensure good contrast
- Keep pages straight and aligned
- Consider using OCR software before upload
Empty Processing Results
If documents process but queries return no results:
- Verify the document has readable text
- Check that the datastore is connected to your agent
- Try a specific query matching document content
- Review the document in the dashboard
Next Steps
- Creating Agents - Configure agents to use your documents
- Logic - Add domain knowledge to improve retrieval
- Datastores API - Full API documentation
- Chat API - Query your documents