Literature Module¶
The literature module lets you index your own PDF, text, and markdown documents and ask the agent to synthesise across them — without a vector database, embeddings, or any cloud service.
How It Works¶
- Drop files into the project's
literature/folder - Call
index_literature→ builds a plain-text index with 800-character excerpts per document - Call
search_literaturewith a query → returns matching excerpts - The agent synthesises the excerpts into an answer
No chromadb. No sentence-transformers. No API calls to a third-party service. Just text matching and LLM synthesis over your own files.
Setting Up¶
Add files to the literature folder¶
~/.aihydro/projects/new_england_basins/literature/
├── kratzert2018_lstm.pdf
├── addor2017_camels.pdf
├── newman2015_camels.pdf
└── my_notes.md
Supported formats: PDF, txt, md
PDF extraction
PDF text is extracted using pypdf or pdfplumber (installed automatically with aihydro-tools[all]). Scanned PDFs (image-only) will not extract usefully — use text-layer PDFs.
Index the folder¶
The agent calls index_literature(project_name="New England Basins") and builds literature_index.md with one entry per document:
## kratzert2018_lstm.pdf
**Path:** ~/.aihydro/projects/.../kratzert2018_lstm.pdf
**Size:** 892 KB | **Indexed:** 2026-04-10T10:22:00Z
> Rainfall-runoff modelling using Long Short-Term Memory (LSTM) networks.
> We trained a single LSTM model on 241 basins from the CAMELS dataset...
> [800 chars]
---
Searching¶
search_literature returns the top-matching document excerpts, which the agent uses as context for synthesis.
Return full document¶
For short documents (notes, summaries):
Use return_full_content=True when you want the complete file rather than just excerpts.
Limitations¶
| Limitation | Detail |
|---|---|
| Search method | Text substring matching — not semantic/vector search |
| PDF quality | Text-layer PDFs only; scanned PDFs extract poorly |
| Index freshness | Re-run index_literature after adding new files |
| Context window | Very long documents may be truncated to excerpt length |
Future: semantic search
Vector-based semantic search is available as a separate package (aihydro-rag). The folder-based module is intentionally dependency-free and sufficient for most research workflows.
Typical Workflow¶
1. Collect PDFs into the literature/ folder
2. "Index the literature for my project"
3. "What do these papers say about [topic]?"
4. Agent synthesises → you refine → agent searches again
5. "Add a journal entry: the Kratzert paper's approach to basin attributes
aligns well with what I'm seeing in the New England basins."