Setting up a spellchecker in vscode

2024-08-19

Pretty much as soon I started writing this blog I began getting annoyed with writing in a text editor (vscode, for now). Don't get me wrong, I like writing in a text editor (I use vim - sort of). But while I'm a fast typer, I'm not a super accurate one. So, spellchecking is a requirement. There are tons of spellcheckers out there for vscode, but NIH syndrome is a great excuse for personal projects to learn something new. I don't know how to write a vscode extension and I don't know how to write a spellchecker. So it's worth doing both.

The plan

If you're following along with this blog, you'll notice that I like to start every project with a plan. I use plans not only to keep myself on track and organized, but also to conduct a mini retro at the end of each project. What did I get right? What did I get wrong? I teach this technique to all my new engineers and I think it's an overall great strategy for learning. Maybe I'll write an article about that some day. Anyway, the plan:

Write a vscode extension that loads the document text.
Pass the text into a function that pulls out each word.
Get a dictionary.
Load the dictionary into memory.
Compare each word in the document to the dictionary and return missing words as "misspelled"
Highlight the misspelled words in the document.
Implement suggestions.

Let's jump in.

Basic extension and text loading

To start, I'm not going to work from scratch. Some research lead me to a the official vscode tutorial. I'm going to follow that tutorial and modify it to load the text of the document into a variable.

to start, we need to install yeoman and the vscode extension generator.

npm install --global yo generator-code

Then we can generate the extension.

yo code

Easy. With this basic extension I'll add the basic framework.

import * as vscode from 'vscode';
import { spellCheckDocument } from './spellchecker';

export function activate(context: vscode.ExtensionContext) {
    let disposable = vscode.commands.registerCommand('extension.checkSpelling', () => {
        const editor = vscode.window.activeTextEditor;
        if (editor) {
            const text = editor.document.getText();
            spellCheckDocument(text).then((misspelledWords) => {
                console.log('Misspelled words:', misspelledWords);
            });
        }
    });

    context.subscriptions.push(disposable);
}

And the spellchecker function.

async function spellCheckDocument(text: string): Promise<string[]> {
    return [];
}

This next part tripped me up. We need to modify the package.json file to include a command for the extension. This is what I added:

"contributes": {
    "commands": [
      {
        "command": "extension.checkSpelling",
        "title": "Spell Check"
      },
    ]
}

This adds "Spell Check" to the command palette and activates the extension. Now we can run the extension by pressing F5 (or run) and it will open up a new vscode window. I opened up this very repo and ran Spell Check from the palette. The console printed out:

Misspelled words: []

Jackpot.

Skipping steps

At this point, I realized that my plan had some steps backwards. It made more sense to get the dictionary first, then pull words out to do the matching. So, I'm going to skip to step 3, get a dictionary, load them into memory, and then come back to step 2.

I searched for "spell check word list" and found "SCOWL", or Spell Checker Oriented Word Lists. It's a collection of word lists for spell checking. I went to the repo, pulled it down, and made a full file of all the word lists.

I want to keep things easy, so I'm just going to take all of the lists and put them into a single file, with each word its own newline. This was pretty easily accomplished with some python:

import os

dir = "final" # this is the name of SCOWL's final directory
files = os.listdir(dir)
files.sort()


with open("wordlist.txt", "w", encoding="utf-8") as f:
    for file in  files:
        with open(f"{dir}/{file}") as wordlist:
            for word in wordlist:
                f.write(word)

Or not. We failed because of some encoding issues. So I had to expand the script to handle that. I actually dealt with this before in a previous project so stole some code from that:

def create_single_txt():
    dir = 'final'
    files = os.listdir(final_dir)
    files.sort()

    encodings = ['utf-8', 'iso-8859-1', 'windows-1252']

    with open('wordlist.txt', 'w', encoding='utf-8') as f: 
        for file in files:
            file_path = os.path.join(dir, file)
            file_content = read_file_with_encoding(file_path, encodings)
            if file_content:
                f.write(file_content)
            else:
                print(f"RIP: {file}")

def read_file_with_encoding(file_path, encodings):
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as file:
                return file.read()
        except UnicodeDecodeError:
            continue
    return None

And voilà. Success. We have a single file with all the words. Now let's store them.

The dictionary

Let's take a pause and think about what we're actually going to do here with this spell checker.

We load all the words in the dictionary. We load all the words in the document. We compare each word in the document to the dictionary.

Holy moly is this inefficient. This whole thing is basically a whiteboarding problem, so let's think of it like one.

We know the brute force solution, and while I'm a self-declared "brute force" programmer, let's try to level up and be slightly more efficient.

What is the duplicate work here? We keep on looking at the dictionary. My first thought is thow a cache at things. We can store the dictionary in a map, where each word is a key with a boolean value. This means we iterate through the dictionary once, and then we can check if a word is in the dictionary in O(1) time. Cool.

Let's think one step further. How will this be used?

Well, we're going to want to find similar words. We'll do this using some sort of word distance algorithm. We'll need to compare each word in the document to each word in the dictionary. But do we actually need to compare each word in the dictionary? No, we don't. We can significantly reduce the search space by only comparing words that have similar prefixes. This is inherently a problem for tries to solve!

What we'll do is build a trie of the dictionary. Then, when we're looking for suggestions we can traverse the trie up to a certain depth, calculate the distance of only those words, and return the closest ones. But let's think just a bit more. Do we actually want to do this? The answer is maybe. The reality is that if we have a misspelled word like "ardvark", if we are searching a depth of 2, we'll never find "aardvark". Is only going one level deep enough? Maybe! We'll do it anyway because why not?

The trie

I'm going to use the ternary-search-trie package to build the trie, because I don't feel like doing it myself right now. We'll then import it and build the trie from the dictionary:

import * as fs from 'fs';
import { Trie } from 'ternary-search-trie';
import path from 'path';

const dictionary = new Trie();

function loadDictionary() {
    const filePath = path.join(__dirname, '', 'dictionary.txt');
    const words = fs.readFileSync(filePath, 'utf8').split('\n');

    for (const word of words) {
        const cleanedWord = word.toLowerCase();

        // We do this in case there's a rogue newline in the dictionary. 
        // I don't feel like manually searching for it and removing it.
        if (cleanedWord.length >= 1) {
            dictionary.set(cleanedWord, true);
        }
    }

Pretty straightforward here. We load the file, split it by newline, and add each word to the trie. Now let's find misspelled words. We'll modify our spellCheckDocument function to pull out all the words and check each word against the trie.

function spellCheckDocument(text: string): string[] {
    const words = text.match(/(?<![\w\*\_\-])([a-zA-Z]+)(?![\w\*\_\-])/g) || [];
    const misspelledWords: string[] = [];

    if (words) {
        for (const word of words) {
            if (!dictionary.get(word.toLowerCase())) {
                misspelledWords.push(word);
            }
        }
    }
}

I barely have any idea what this regex is doing. It is grabbing all words, ignore punctuation, ignoring common markdown symbols, and ignoring numbers. My strategy for regex crafting is similar to my strategy for fighting games, bash buttons until it works.

No we can run the extension and see if it works! It does! We have a list of misspelled words. This list is too long! It includes parts of urls. We'll need to filter those out.

const urlRegex = /\b(https?:\/\/|www\.)\S+\b/gi;
const cleanText = text.replace(urlRegex, '');
const words = cleanText.match(/(?<![\w\*\_\-])([a-zA-Z]+)(?![\w\*\_\-])/g) || [];
const misspelledWords = [];

Boom.

But just before we adjourn for this post, let's add a bit of efficiency. We're going to cache our spellcheck results so we don't have to search the dictionary for the same word over and over. We'll add a map to the top of the file:

const cache = new Map<string, boolean | string[]>();

Then we'll use this cache in our spellCheckDocument function. But we're actually going to pull this functionality into its own function for clarity:

function isWordCorrect(word: string): boolean {
    if (cache.has(word)) {
        return cache.get(word) as boolean;
    }

    const isCorrect = !!dictionary.get(word.toLowerCase());
    cache.set(word, isCorrect);

    return isCorrect
}

Then we'll add this to our spellCheckDocument function:

function spellCheckDocument(text: string): { word: string, index: number }[] {
    const urlRegex = /\b(https?:\/\/|www\.)\S+\b/gi;
    const cleanText = text.replace(urlRegex, '');
    const words = cleanText.match(/(?<![\w\*\_\-])([a-zA-Z]+)(?![\w\*\_\-])/g) || [];
    const misspelledWords = [];

    if (words) {
        for (const word of words) {
            if (!isWordCorrect(word)) {
                misspelledWords.push({ word, index });
            }
        }
    }

    return misspelledWords;
}

Now when we run the extension, we only get misspelled words that aren't in urls, we ignore markdown, and we cache the results. Next up, we'll do highlighting and suggestions.