Adding AI narration to journal articles with ElevenLabs

Listen to this articleAI narration

0:00 / 0:00

Some people prefer reading. Others prefer listening while commuting, cooking, or exercising. Adding audio narration to journal articles opens content to a wider audience - and with modern text-to-speech APIs, the quality is surprisingly good.

The goal: a CLI tool that takes any journal article and generates a natural-sounding MP3 narration, plus an embeddable audio player component for the website.

The Challenge: MDX is Not Plain Text

Journal articles aren't simple text files. They contain SVG diagrams, code blocks, MDX components like AudioPlayer, markdown formatting, and HTML comments. Sending all of that to a text-to-speech API produces garbage audio - the AI would literally try to pronounce SVG path coordinates.

The extraction function strips away everything that shouldn't be spoken. First, it removes SVG blocks entirely using a multiline regex pattern. Then code fences, inline code wrapped in backticks, and any MDX components that start with capital letters. HTML comments go next. Image markdown gets removed, but link text is preserved while stripping the URL. Finally, markdown formatting markers like headers, bold, italic, blockquotes, and list markers are cleaned up, leaving just the readable prose.

patternsClean linksKeep text, remove URLStrip header markers## becomes plain textRemove emphasisbold → boldClean blockquotesRemove > markersClean list markers- and 1. prefixesRemove HR markers--- horizontal rulesCollapse whitespaceMax 2 newlinesClean TextReady for TTSNatural paragraphs

The order matters. SVG blocks must be removed before stripping markdown, otherwise the regex for removing formatting could match parts of SVG path data. Components must go before cleaning links, since some components have URL-like attributes.

ElevenLabs API Integration

The ElevenLabs text-to-speech API is straightforward. You send text and voice configuration, you get back an audio buffer. The endpoint is a POST request to their API base URL followed by text-to-speech and the voice ID.

Three environment variables control the integration. The API key for authentication, the voice ID determining which voice model speaks, and the model ID selecting which underlying TTS model to use. We're using eleven_multilingual_v2 for its natural prosody and language support.

The voice settings took some experimentation. Stability at 0.5 provides a balance - too low and the voice varies unnaturally between sentences, too high and it sounds robotic. Similarity boost at 0.75 keeps the voice consistent with its training data. Style at 0.0 keeps the delivery neutral, appropriate for technical content. Speaker boost improves clarity for longer audio, essential for articles that might run several minutes.

CLI Tooling

The generation script runs from the command line using tsx, a TypeScript executor. It supports three modes: processing a single article by slug, processing all articles with the all flag, or listing available articles with the list flag.

For single article processing, the script finds the entry by matching the slug parameter against available MDX files. It reads the file, parses the frontmatter using gray-matter, extracts the readable text, prepends an introduction with the article title and formatted date, then sends everything to ElevenLabs.

The batch mode is idempotent. Before generating audio for any entry, it checks if the output file already exists. If it does, the entry is skipped. This means you can safely run the all flag repeatedly - it will only generate audio for new articles.

The intro text adds context for listeners. Instead of jumping straight into the content, the audio begins with the article title and a formatted date like "Monday, January 20th, 2026." This helps orient listeners who might be working through a backlog of articles.

The Audio Player Component

The frontend component is a React client component with full playback controls. It manages state for playing status, loading state, current time, total duration, playback rate, and whether the user is currently dragging the scrubber.

The scrubbing implementation deserves attention. When a user drags the progress bar, the component sets isDragging to true and attaches mousemove and mouseup listeners to the document, not just the progress bar element. This allows smooth scrubbing even if the cursor moves outside the progress bar bounds. The timeupdate handler checks isDragging and skips updating currentTime if the user is actively scrubbing - otherwise the playback position would fight with the user's input.

Playback rate cycling goes through the speeds in order: 1x, 1.25x, 1.5x, 1.75x, 2x, then back to 0.75x. Listeners who want to skim through content can speed up, while those who want to slow down for complex sections can go to 0.75x.

Audio Hosting Strategy

Initially, audio files lived in the public/audio directory, served directly by Next.js. This worked for development but had drawbacks for production - large binary files in the repository, slow git operations, and no CDN benefits.

The solution: upload generated audio to S3, serve through CloudFront. The AudioPlayer component now receives CloudFront URLs instead of local paths.

The workflow now involves generating audio locally during development, uploading to S3 manually or through a deployment script, then updating the MDX file to reference the CloudFront URL. The public/audio directory serves as a temporary staging area, not permanent storage.

Handling Edge Cases

Several edge cases required attention during development.

Articles with no readable content after extraction. Some early test files were mostly SVG diagrams. The extraction function can return an empty string, which would fail at the API. The script now validates that extracted text meets a minimum length before attempting generation.

Cached audio metadata on page refresh. The audio element might already have loaded metadata from browser cache when the component mounts. The useEffect hook checks readyState before attaching the loadedmetadata listener - if metadata is already available, it sets duration immediately instead of waiting for an event that already fired.

Rate limiting from ElevenLabs. Processing many articles in quick succession can hit API rate limits. The batch mode processes entries sequentially rather than in parallel, and errors for individual entries don't halt the entire batch.

The Complete Architecture

The final system connects several pieces. The CLI script reads MDX files, extracts readable text, generates audio through ElevenLabs, and saves MP3 files locally. A separate upload step moves files to S3. The MDX components library exports AudioPlayer for use in articles. Each article includes an AudioPlayer component pointing to its CloudFront URL.

Takeaways

First, content extraction is the hard part. The ElevenLabs API is well-documented and reliable. Getting clean, speakable text from rich MDX content requires careful regex work and attention to processing order.

Second, voice settings matter for long-form content. The default settings work fine for short phrases, but stability and speaker boost become important when generating multi-minute narrations.

Third, idempotent tooling saves time. The batch mode checking for existing files means you can run the script repeatedly without regenerating everything. Add one article, run the script, only that article gets processed.

Fourth, CDN delivery is essential for audio. Files measured in megabytes need edge caching. Serving directly from the origin would be slow and expensive.

Fifth, scrubbing UI requires careful state management. The isDragging flag and document-level event listeners create smooth scrubbing that doesn't fight with the audio element's own time updates.

The journal now speaks. Articles that previously required focused reading time can be consumed during commutes, workouts, or household chores. A small quality-of-life improvement that opens content to a wider audience.