How It Works

The assistant is a small web application. An educator logs in, uploads a lecture — an audio recording, a PDF, or plain text — and the system runs it through a fixed pipeline, posting progress in real time. The heavy lifting is done by two local engines: whisper.cpp for speech-to-text and a local Ollama model for the language tasks.

The architecture

The whole system is a single Perl/Mojolicious application. A browser talks to it over a local port; behind that sit a job queue, the transcription engine, and the language model.

Browser  →  Mojolicious web app
                ├─ SQLite        jobs + users + results
                ├─ whisper.cpp   audio  → transcript
                ├─ Ollama        transcript → summary / notes / quiz
                └─ pdftotext     PDF    → text

The job queue

Processing a lecture takes minutes, not milliseconds, so every upload becomes a queued job that moves through clear states: queued → transcribing → summarizing → quizzing → completed (or failed, with the reason recorded). The work runs in a background subprocess so the web interface never blocks, and the page streams live status updates as each stage finishes.

The stack, deliberately boring

The implementation favours components that are easy to run and audit on University infrastructure:

Perl + Mojolicious — the web app, job queue, and pipeline
whisper.cpp — on-device speech-to-text, no cloud transcription service
Ollama — a quantized instruction-tuned model serving summaries, notes, objectives, and quizzes locally
SQLite — one file for jobs, users, and results
pdftotext — text extraction for slide and document uploads

            Privacy by construction: there is no external API call in the pipeline. The recording, the transcript, and every generated artifact stay on University of Toronto infrastructure from upload to review.
        

Why this design — the UDL lens →

The pipeline

Five formats from one lecture

The architecture

The job queue

The stack, deliberately boring