anila.

Benchmarks and Trade offs for Japanese Morphological Analyzer

author avatar of anila website

TL;DR:

For browser based Japanese Morphological Analyzer, Lindera IPADIC (13MB) is the simplest download-and-go option and loads in ~1 second. Sudachi with the Core dictionary (207MB) offers modern vocabulary and multi-granular tokenization, but requires caching for acceptable initial load times when downloading the dictionary. Once initialized it tokenizes about ~2x faster than Lindera in this test. Kagome’s Go WASM build works, but has a large startup cost (~4s here) and slower tokenization.

Introduction

Here's the thing about Japanese: there are no spaces between words. If you're building anything that needs to understand Japanese text, a reading assistant, a vocabulary app, a dictionary lookup tool, etc, you need to figure out where one word ends and the next begins, which makes tokenization essential

Input:  厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る試験方法等を策定し...
Output: 厚生労働省 / で / は / 、 / 世界 / に / 先駆け / て / 、 / 革新 / 的 / な / 医療機器 / ・ / 再生 / 医療 / 等 / 製品 / 等 / の / ...

After some research, I found three morphological analyzers with WebAssembly builds that could run in the browser:

  • Lindera (Rust): Fast, embedded dictionary
  • Sudachi (Rust): Modern vocabulary, multi-granular modes
  • Kagome (Go): Mature, Go ecosystem

Which one should you use? It turns out "fastest" depends on what you're measuring, and "best" depends on what you're building. So I did what any reasonable developer would do: benchmarked all of them.

Test Setup

Before we dive in, here's my setup. I wanted numbers that would actually hold up, so I ran each test many times after a warm-up run.

Test text: 121-character Japanese government announcement about medical device research (chosen because it has technical vocabulary and compound words)

Environment: macOS (Apple Silicon), Node.js for WASM, native Rust/Python/Go for comparison

Iterations: 100 runs per measurement

jsx

const text = "厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の" +
  "有効性・安全性に係る試験方法等を策定し、試験方法等の国際標準化を図り、" +
  "製品の早期実用化とともに、グローバル市場における日本発の製品の普及を推進するための研究課題を公募します。";

Understanding the Dictionaries

It’s necessary to understand that the dictionary matters more than the tokenizer. I spent some time comparing tokenizer performance before realizing that the dictionary is fundamental to the morphological analyzer.

Dictionary Comparison

DictionarySizeLast UpdatedEntriesFeatures
IPADIC~50 MB2007~400KBasic tokenization
UniDic~100 MB2023+~800KLinguistic accuracy, fine segmentation
SudachiDict Small~42 MBJan 2026~700KModern vocabulary
SudachiDict Core207 MBJan 2026~1.5MMulti-granular (A/B/C), normalization
SudachiDict Full~240 MBJan 2026~3MMost comprehensive

Why SudachiDict is 4x Larger

IPADIC stores one segmentation per word. SudachiDict stores three:

IPADIC entry:
  選挙管理委員会 → POS, reading

SudachiDict entry:
  選挙管理委員会 → POS, reading
                 + split_A: [選挙, 管理, 委員, 会]
                 + split_B: [選挙, 管理, 委員会]
                 + split_C: [選挙管理委員会]
                 + normalized_form

This enables Sudachi's killer feature: multi-granular tokenization.

The Multi-Granular Advantage

This is where Sudachi really shines, and honestly, it's what sold me on it.

Sudachi offers three tokenization modes:

ModeGranularityUse Case
AFinestSearch indexing, linguistic analysis
BMiddleGeneral NLP
CLongestDictionary lookup, vocabulary learning

Real-World Example

WordMode AMode BMode C
厚生労働省厚生/労働/省厚生/労働省厚生労働省
医療機器医療/機器医療/機器医療機器
国際標準化国際/標準/化国際/標準化国際/標準化
グローバル市場グローバル/市場グローバル/市場グローバル/市場

For a dictionary app, Mode C is ideal, it keeps compound words together so users can look them up directly.

Critical Finding: Small Dictionary Breaks Mode B/C

During testing, I discovered that SudachiDict Small doesn't support multi-granular tokenization:

jsx

// With SudachiDict Small (42MB)
Mode A: 選挙 / 管理 / 委員 /Mode B: 選挙 / 管理 / 委員 /// Same as A!
Mode C: 選挙 / 管理 / 委員 /// Same as A!

// With SudachiDict Core (207MB)
Mode A: 選挙 / 管理 / 委員 /Mode B: 選挙 / 管理 / 委員会     // Different!
Mode C: 選挙管理委員会           // Different!

Takeaway: If you need Mode B or C, you must use the Core or Full dictionary.

Native Performance Benchmarks

Before looking at WASM performance, let's establish baseline performance and see how fast these tokenizers could actually be when running natively.

This gives us a ceiling to compare against.

Native Rust (Lindera)

=== Lindera (IPADIC) ===
  Dictionary load time: 167 ms
  Tokenization time:    22 µs (avg of 100 runs)
  Token count:          74

=== Lindera (UniDic) ===
  Dictionary load time: 676 ms
  Tokenization time:    29 µs
  Token count:          79

Native Python/Rust (SudachiPy)

SudachiPy wraps the same Rust core as sudachi.rs:

=== Sudachi (Core Dictionary) ===
  Dictionary load time: 246 ms
  Mode A - time: 0.133 ms, tokens: 79
  Mode B - time: 0.074 ms, tokens: 73
  Mode C - time: 0.066 ms, tokens: 71
  Memory usage: ~101 MB

Native Go (Kagome)

Kagome is a Go tokenizer. I measured it with Go 1.25.6 IPA dictionary.

=== Kagome (IPA Dictionary) ===
  Dictionary load time: 675 ms
  Tokenization time:    56 µs (avg of 100 runs)
  Token count:          74

Native Summary

AnalyzerLoad TimeTokenize TimeTokens
Lindera IPADIC167 ms22 µs74
Lindera UniDic676 ms29 µs79
Sudachi Mode A246 ms133 µs79
Sudachi Mode B246 ms74 µs73
Sudachi Mode C246 ms66 µs71
Kagome (IPA)675 ms56 µs74

Key insight: Lindera is the fastest native tokenizer. Sudachi Mode C produces the fewest tokens (71) by preserving compound words, which is exactly what I wanted for dictionary lookup.

WASM Performance Benchmarks

Alright, here are the results that matter: how do these perform in the browser?

WASM Package Sizes

PackageUnpackedGzippedDictionary
lindera-wasm-web-ipadic13 MB~4 MBEmbedded
lindera-wasm-web-unidic47 MB~15 MBEmbedded
sudachi-wasm3332.5 MB~1 MBSeparate
SudachiDict Core207 MB~70 MB-
kagome.wasm15 MB~5 MBEmbedded

The critical difference: Lindera embeds the dictionary in WASM, while Sudachi downloads it separately.

WASM Benchmark Results (Node.js)

This is useful for apples-to-apples comparisons between libraries, and it avoids UI framework overhead.

=== Lindera WASM (IPADIC) ===
  Dictionary load time: 208 ms
  Tokenization time:    1.204 ms (avg of 100 runs)
  Token count:          74

=== Sudachi WASM (CORE dict - 207MB) ===
  Dictionary load time: 273 ms
  Mode A - time: 0.406 ms, tokens: 79
  Mode B - time: 0.369 ms, tokens: 73
  Mode C - time: 0.356 ms, tokens: 71

=== Sudachi WASM (bundled small dict) ===
  Dictionary load time (bundled): 272 ms
  Mode A - time: 0.957 ms, tokens: 79
  Mode B - time: 0.625 ms, tokens: 79 (same as A, small dict limitation)
  Mode C - time: 0.429 ms, tokens: 79 (same as A, small dict limitation)

Sudachi Is Fastest in WASM (Once Loaded)

AnalyzerTokenize Avg (WASM)Notes
Lindera IPADIC1.204 msEmbedded IPADIC in a Rust WASM module
Sudachi Mode B0.369 msBalanced segmentation
Sudachi Mode C0.356 msFast and preserves compounds

In this Node.js run, Sudachi Mode C is ~4x faster than Lindera for tokenization. (This is after everything is already loaded.)

Kagome is a Go WASM module and is best benchmarked in a browser; see the Chrome section next.

WASM Overhead Analysis (Native vs Node.js)

AnalyzerNativeNode.js WASMOverhead
Lindera IPADIC22 µs1.204 ms~55x
Sudachi Mode B74 µs0.369 ms~5x
Sudachi Mode C66 µs0.356 ms~5x

Sudachi keeps the WASM penalty relatively low. Lindera pays a much higher overhead per call.

WASM Benchmark Results (Chrome)

This includes the Go runtime startup cost for Kagome and captures browser-side overhead. It shows a large fixed startup cost (~5–6 seconds) before you can tokenize anything. This makes it less suitable for web applications where fast initialization matters.

LINDERA (IPADIC)
  Load time:      608.2 ms
  Tokenize avg:   0.947 ms
  Tokenize min:   0.845 ms
  Tokenize max:   2.000 ms
  Token count:    74

KAGOME (IPA)
  Load time:      4133.3 ms
  Tokenize avg:   3.274 ms
  Tokenize min:   1.715 ms
  Tokenize max:   12.430 ms
  Token count:    74

SUDACHI (Core Dictionary)
  Load time:      1692.3 ms

  Mode A: avg 0.413 ms (79 tokens)
  Mode B: avg 0.398 ms (73 tokens)
  Mode C: avg 0.484 ms (71 tokens)

Why Node.js vs Browser WASM Results Differ

There are three different “places” you might measure WASM performance, and they answer different questions:

FactorNode.js script (benchmark_wasm.mjs)Browser benchmark page (/benchmark.html)Web app UI (React)
What you measureMostly tokenizer call overhead after everything is already available on diskReal browser runtime: fetch/compile/instantiate + tokenizeReal UX: browser runtime + React updates + app logic
Iterations1001005 (benchmark mode) or 1 (normal)
Warm-up1 + 100 measured1 + 100 measured1 warm-up + 5 measured
JIT (just-in-time) optimizationUsually more stableStable after warm-upNoisy (few runs)
Asset loadingFrom local diskOver HTTP (even on localhost), subject to cachingOver HTTP + app chunking
Sandbox / security modelNone (process has direct access)Browser sandbox + stricter policiesBrowser sandbox + app policies
Overhead around tokenizeMinimalMinimalIncludes async scheduling + React state updates

The biggest “gotchas” are (1) iteration count (JIT + variance) and (2) what you include in load time (disk vs HTTP, cached vs uncached, dev server vs production).

Real World Load Times

Let's talk about what users will actually experience.

Estimated Browser Load Times

Package@ 50 Mbps@ 100 Mbps@ 200 Mbps
Lindera IPADIC (13 MB)~2s~1s~0.5s
Lindera UniDic (47 MB)~7.5s~3.8s~1.9s
Kagome (15 MB)~2.4s~1.2s~0.6s
Sudachi + Core (210 MB)~33s~17s~8.5s
Sudachi + Core (gzipped)~11s~5.5s~2.8s

The Caching Solution

Those numbers for Sudachi look no good as expected, right? 33 seconds at 50 Mbps is brutal. But here's the thing: users only pay that cost once.

The solution is IndexedDB caching:

jsx

async function loadSudachiWithCache() {
  const cachedDict = await idb.get('sudachi-core-dict');

  if (cachedDict) {
    // Return visit: ~500ms
    await sudachi.initialize_from_bytes(cachedDict);
  } else {
    // First visit: show progress bar
    const dict = await fetchWithProgress('/sudachi/system.dic');
    await idb.set('sudachi-core-dict', dict);
    await sudachi.initialize_from_bytes(dict);
  }
}

Result:

  • First visit: 5-15 seconds (with progress indicator)
  • Return visits: ~500ms (essentially instant)

Segmentation Quality Comparison

How do the analyzers actually differ in output?

Token Count Comparison

AnalyzerTokensTendency
Lindera IPADIC74Keeps some compounds
Lindera UniDic79Splits more finely
Sudachi Mode A79Finest splitting
Sudachi Mode B73Balanced
Sudachi Mode C71Preserves compounds
Kagome IPA74Similar granularity to IPADIC

Key Word Segmentation

WordLindera IPADICLindera UniDicSudachi C
厚生労働省厚生/労働省厚生/労働/省厚生労働省
医療機器医療/機器医療/機器医療機器
再生医療再生/医療再生/医療再生/医療
国際標準化国際/標準/化国際/標準/化国際/標準化
グローバル市場グローバル/市場グローバル/市場グローバル/市場

Observations:

  • Lindera IPADIC keeps ~80% of compounds together
  • Lindera UniDic splits more aggressively (better for linguistic analysis)
  • Sudachi Mode C preserves the most compounds (best for dictionary lookup)

Feature Comparison

FeatureLindera IPADICLindera UniDicSudachi CoreKagome
Multi-granular (A/B/C)xxvx
Normalized formxxvx
Modern vocabularyx (2007)vv (2026)x
Compound preservation~80%~60%~95% (Mode C)~80%
Embedded dictionaryvvxv
WASM tokenize speed (Node.js)1.20ms~2ms0.36ms (Mode C)n/a (see Chrome: 3.27ms + ~4.1s init)
Package size13 MB47 MB210 MB15 MB
Best forSpeed-firstLinguisticsDictionary appsGo ecosystem

How to Choose the Right Tool

Choose Lindera IPADIC if:

  • Load time is critical (must be < 2 seconds)
  • You're building a simple tokenizer/search
  • Vocabulary from 2007 is acceptable
  • You want the smallest bundle size

Choose Lindera UniDic if:

  • You need linguistic accuracy
  • Fine-grained segmentation is preferred
  • Load time of 3-4 seconds is acceptable

Choose Sudachi with Core Dictionary if:

  • You're building a dictionary/vocabulary app
  • You need compound words preserved (Mode C)
  • You can implement caching (IndexedDB/Service Worker)
  • Modern vocabulary matters (medical terms, proper nouns)
  • Normalization features are useful (e.g., 附属→付属)

Choose Kagome if:

  • You're already in the Go ecosystem
  • You want a middle-ground option
  • ~15MB download is acceptable

Implementation Recommendations

For Dictionary/Vocabulary Apps

jsx

// Recommended: Sudachi with caching
import { SudachiStateless, TokenizeMode } from 'sudachi-wasm333';

const sudachi = new SudachiStateless();

// First load: show progress
await sudachi.initialize_browser('/sudachi/system.dic', {
  onProgress: (percent) => updateProgressBar(percent)
});

// Use Mode C for dictionary lookup
const tokens = sudachi.tokenize_raw(text, TokenizeMode.C);

// Each token has:
// - surface: "厚生労働省"
// - normalized_form: "厚生労働省"
// - reading_form: "コウセイロウドウショウ"
// - dictionary_form: "厚生労働省"

For Speed Critical Apps

jsx

// Recommended: Lindera IPADIC
import init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';

await init();

const builder = new TokenizerBuilder();
builder.setDictionary('embedded://ipadic');
builder.setMode('normal');
const tokenizer = builder.build();

const tokens = tokenizer.tokenize(text);

Server Configuration

Enable gzip/brotli compression for the dictionary file:

# nginx.conf
location /sudachi/ {
    gzip on;
    gzip_types application/octet-stream;
    gzip_min_length 1000;
}

This reduces the 207MB dictionary to ~70MB transfer.

Conclusion

There's no single "best" Japanese tokenizer, it depends on your priorities:

PriorityRecommendation
Fastest loadLindera IPADIC (13MB, ~1s)
Fastest tokenizationSudachi WASM (Mode C ~0.36ms)
Best for dictionary appsSudachi Mode C
Modern vocabularySudachi Core
Smallest bundleLindera IPADIC
Linguistic accuracyLindera UniDic

Benchmark Data

All benchmarks were run on the following text (121 characters):

厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る
試験方法等を策定し、試験方法等の国際標準化を図り、製品の早期実用化とともに、グローバル市場
における日本発の製品の普及を推進するための研究課題を公募します。

Resources

Here are the projects I tested. All are open source and actively maintained:

contact
contact icon
contact iconcontact iconcontact iconcontact iconcontact icon

ぜひお気軽にフォローやご連絡してください。 お仕事のご相談もお待ちしています。

Copyright © anila. All rights reserved.