Chat feature TODOs:
Chat feature TODOs:
Immediate
get initial chat box HTML and JS frontend working (no LLM call yet) set up fastAPI backend set up cerebras LLM API call to gpt-oss-120B deploy on production w/ Fly.io ~~ ~~switch to streaming API remove references to non-streaming chat add conversation history support add health check endpoint and CI/CD - keep Fly.io machine warm - add chat logging - stats on chat logging
build an evaluation harness for accuracy- Q&A questions about my resume- call backend on each question- log raw responses, latency, errors- output summary report- Add RAG with vector embeddings. Should contain
all links in the resume so that the chat model has access to them if it needs- the code used to create the model chat window so that the model can figure out how it is set up
- add reranker to RAG setup
Future (major)
- classifier head that predicts whether a user is asking a question about the resume or not
- reranker into RAG system
- add tool-use: turn the setup into an agent with the sdk, tool-use should read relevant documents as needed with bash commands.
- add live monitoring (tokens/sec, first response latency)
- make faster
dont keep passing conversation history back and forth (store on Fly.io server)- Preload KV cache w/ background context whenever a new user accesses the website
- make sure not to leak info to the HTML frontend
- make more cost-efficient
have fly.io machine start up only when at least one user is on the site (upon entering the site)sliding window on conversation history — full history + system prompt re-sent every turn, unbounded token cost$2/conversation cost cap with per-turn token tracking
- make UI prettier (anthropic-style, rendered as a box inline, scrolling possible)
- add guardrails with the Agents SDK
Done
merge two system messages into one inbuild_messagesadd rate limiting — $2/conversation cost cap with token tracking
future (minor)
- make “send” button and input box lower w.r.t the conversation
- switch to typescript over javascript
- log user id by IP address rather than making a new hash every time
- speed up evaluation by switching to
httpxwithasyncio
