josh wu | ← all projects

docutrack
turn any live screen recording into detailed text documentation

description

DocuTrack turns a live screen recording into structured documentation. Writing setup guides manually is slow, inconsistent, and often skipped — DocuTrack automates the entire process by watching the screen, capturing screenshots around meaningful actions, and using Cohere AI to infer what the user did and generate the corresponding steps.

The system takes screenshots twice a second with extra captures when important actions occur. Screenshots before and after each keystroke are grouped and sent to Cohere, which interprets them into readable documentation steps. Output is formatted as Markdown or LaTeX with embedded screenshots, ready to publish directly in GitHub repos, onboarding docs, or classroom labs.

how it was built
  • Desktop recorder with a Tkinter GUI for recording control
  • Keyboard input flagged as action triggers; screenshots grouped around each keystroke
  • Grouped images sent to Cohere using a multi-chat architecture for more effective event interpretation
  • Embeddings used for screenshot deduplication to avoid redundant steps
  • Output rendered in Markdown or LaTeX with embedded screenshots
accomplishments

Built a full pipeline from raw screen captures to polished LaTeX docs, with output quality good enough to publish directly in GitHub repos.

tech stack
PythonTkinterCohere AILaTeXMarkdown
links