
Benchmarks and Trade offs for Japanese Morphological Analyzer

TL;DR:
For browser based Japanese Morphological Analyzer, Lindera IPADIC (13MB) is the simplest download-and-go option and loads in ~1 second. Sudachi with the Core dictionary (207MB) offers modern vocabulary and multi-granular tokenization, but requires caching for acceptable initial load times when downloading the dictionary. Once initialized it tokenizes about ~2x faster than Lindera in this test. Kagome’s Go WASM build works, but has a large startup cost (~4s here) and slower tokenization.
Introduction
Here's the thing about Japanese: there are no spaces between words. If you're building anything that needs to understand Japanese text, a reading assistant, a vocabulary app, a dictionary lookup tool, etc, you need to figure out where one word ends and the next begins, which makes tokenization essential
Input: 厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る試験方法等を策定し... Output: 厚生労働省 / で / は / 、 / 世界 / に / 先駆け / て / 、 / 革新 / 的 / な / 医療機器 / ・ / 再生 / 医療 / 等 / 製品 / 等 / の / ...
After some research, I found three morphological analyzers with WebAssembly builds that could run in the browser:
- Lindera (Rust): Fast, embedded dictionary
- Sudachi (Rust): Modern vocabulary, multi-granular modes
- Kagome (Go): Mature, Go ecosystem
Which one should you use? It turns out "fastest" depends on what you're measuring, and "best" depends on what you're building. So I did what any reasonable developer would do: benchmarked all of them.
Test Setup
Before we dive in, here's my setup. I wanted numbers that would actually hold up, so I ran each test many times after a warm-up run.
Test text: 121-character Japanese government announcement about medical device research (chosen because it has technical vocabulary and compound words)
Environment: macOS (Apple Silicon), Node.js for WASM, native Rust/Python/Go for comparison
Iterations: 100 runs per measurement
jsx
const text = "厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の" + "有効性・安全性に係る試験方法等を策定し、試験方法等の国際標準化を図り、" + "製品の早期実用化とともに、グローバル市場における日本発の製品の普及を推進するための研究課題を公募します。";
Understanding the Dictionaries
It’s necessary to understand that the dictionary matters more than the tokenizer. I spent some time comparing tokenizer performance before realizing that the dictionary is fundamental to the morphological analyzer.
Dictionary Comparison
| Dictionary | Size | Last Updated | Entries | Features |
|---|---|---|---|---|
| IPADIC | ~50 MB | 2007 | ~400K | Basic tokenization |
| UniDic | ~100 MB | 2023+ | ~800K | Linguistic accuracy, fine segmentation |
| SudachiDict Small | ~42 MB | Jan 2026 | ~700K | Modern vocabulary |
| SudachiDict Core | 207 MB | Jan 2026 | ~1.5M | Multi-granular (A/B/C), normalization |
| SudachiDict Full | ~240 MB | Jan 2026 | ~3M | Most comprehensive |
Why SudachiDict is 4x Larger
IPADIC stores one segmentation per word. SudachiDict stores three:
IPADIC entry: 選挙管理委員会 → POS, reading SudachiDict entry: 選挙管理委員会 → POS, reading + split_A: [選挙, 管理, 委員, 会] + split_B: [選挙, 管理, 委員会] + split_C: [選挙管理委員会] + normalized_form
This enables Sudachi's killer feature: multi-granular tokenization.
The Multi-Granular Advantage
This is where Sudachi really shines, and honestly, it's what sold me on it.
Sudachi offers three tokenization modes:
| Mode | Granularity | Use Case |
|---|---|---|
| A | Finest | Search indexing, linguistic analysis |
| B | Middle | General NLP |
| C | Longest | Dictionary lookup, vocabulary learning |
Real-World Example
| Word | Mode A | Mode B | Mode C |
|---|---|---|---|
| 厚生労働省 | 厚生/労働/省 | 厚生/労働省 | 厚生労働省 |
| 医療機器 | 医療/機器 | 医療/機器 | 医療機器 |
| 国際標準化 | 国際/標準/化 | 国際/標準化 | 国際/標準化 |
| グローバル市場 | グローバル/市場 | グローバル/市場 | グローバル/市場 |
For a dictionary app, Mode C is ideal, it keeps compound words together so users can look them up directly.
Critical Finding: Small Dictionary Breaks Mode B/C
During testing, I discovered that SudachiDict Small doesn't support multi-granular tokenization:
jsx
// With SudachiDict Small (42MB) Mode A: 選挙 / 管理 / 委員 / 会 Mode B: 選挙 / 管理 / 委員 / 会 // Same as A! Mode C: 選挙 / 管理 / 委員 / 会 // Same as A! // With SudachiDict Core (207MB) Mode A: 選挙 / 管理 / 委員 / 会 Mode B: 選挙 / 管理 / 委員会 // Different! Mode C: 選挙管理委員会 // Different!
Takeaway: If you need Mode B or C, you must use the Core or Full dictionary.
Native Performance Benchmarks
Before looking at WASM performance, let's establish baseline performance and see how fast these tokenizers could actually be when running natively.
This gives us a ceiling to compare against.
Native Rust (Lindera)
=== Lindera (IPADIC) === Dictionary load time: 167 ms Tokenization time: 22 µs (avg of 100 runs) Token count: 74 === Lindera (UniDic) === Dictionary load time: 676 ms Tokenization time: 29 µs Token count: 79
Native Python/Rust (SudachiPy)
SudachiPy wraps the same Rust core as sudachi.rs:
=== Sudachi (Core Dictionary) === Dictionary load time: 246 ms Mode A - time: 0.133 ms, tokens: 79 Mode B - time: 0.074 ms, tokens: 73 Mode C - time: 0.066 ms, tokens: 71 Memory usage: ~101 MB
Native Go (Kagome)
Kagome is a Go tokenizer. I measured it with Go 1.25.6 IPA dictionary.
=== Kagome (IPA Dictionary) === Dictionary load time: 675 ms Tokenization time: 56 µs (avg of 100 runs) Token count: 74
Native Summary
| Analyzer | Load Time | Tokenize Time | Tokens |
|---|---|---|---|
| Lindera IPADIC | 167 ms | 22 µs | 74 |
| Lindera UniDic | 676 ms | 29 µs | 79 |
| Sudachi Mode A | 246 ms | 133 µs | 79 |
| Sudachi Mode B | 246 ms | 74 µs | 73 |
| Sudachi Mode C | 246 ms | 66 µs | 71 |
| Kagome (IPA) | 675 ms | 56 µs | 74 |
Key insight: Lindera is the fastest native tokenizer. Sudachi Mode C produces the fewest tokens (71) by preserving compound words, which is exactly what I wanted for dictionary lookup.
WASM Performance Benchmarks
Alright, here are the results that matter: how do these perform in the browser?
WASM Package Sizes
| Package | Unpacked | Gzipped | Dictionary |
|---|---|---|---|
lindera-wasm-web-ipadic | 13 MB | ~4 MB | Embedded |
lindera-wasm-web-unidic | 47 MB | ~15 MB | Embedded |
sudachi-wasm333 | 2.5 MB | ~1 MB | Separate |
| SudachiDict Core | 207 MB | ~70 MB | - |
kagome.wasm | 15 MB | ~5 MB | Embedded |
The critical difference: Lindera embeds the dictionary in WASM, while Sudachi downloads it separately.
WASM Benchmark Results (Node.js)
This is useful for apples-to-apples comparisons between libraries, and it avoids UI framework overhead.
=== Lindera WASM (IPADIC) === Dictionary load time: 208 ms Tokenization time: 1.204 ms (avg of 100 runs) Token count: 74 === Sudachi WASM (CORE dict - 207MB) === Dictionary load time: 273 ms Mode A - time: 0.406 ms, tokens: 79 Mode B - time: 0.369 ms, tokens: 73 Mode C - time: 0.356 ms, tokens: 71 === Sudachi WASM (bundled small dict) === Dictionary load time (bundled): 272 ms Mode A - time: 0.957 ms, tokens: 79 Mode B - time: 0.625 ms, tokens: 79 (same as A, small dict limitation) Mode C - time: 0.429 ms, tokens: 79 (same as A, small dict limitation)
Sudachi Is Fastest in WASM (Once Loaded)
| Analyzer | Tokenize Avg (WASM) | Notes |
|---|---|---|
| Lindera IPADIC | 1.204 ms | Embedded IPADIC in a Rust WASM module |
| Sudachi Mode B | 0.369 ms | Balanced segmentation |
| Sudachi Mode C | 0.356 ms | Fast and preserves compounds |
In this Node.js run, Sudachi Mode C is ~4x faster than Lindera for tokenization. (This is after everything is already loaded.)
Kagome is a Go WASM module and is best benchmarked in a browser; see the Chrome section next.
WASM Overhead Analysis (Native vs Node.js)
| Analyzer | Native | Node.js WASM | Overhead |
|---|---|---|---|
| Lindera IPADIC | 22 µs | 1.204 ms | ~55x |
| Sudachi Mode B | 74 µs | 0.369 ms | ~5x |
| Sudachi Mode C | 66 µs | 0.356 ms | ~5x |
Sudachi keeps the WASM penalty relatively low. Lindera pays a much higher overhead per call.
WASM Benchmark Results (Chrome)
This includes the Go runtime startup cost for Kagome and captures browser-side overhead. It shows a large fixed startup cost (~5–6 seconds) before you can tokenize anything. This makes it less suitable for web applications where fast initialization matters.
LINDERA (IPADIC) Load time: 608.2 ms Tokenize avg: 0.947 ms Tokenize min: 0.845 ms Tokenize max: 2.000 ms Token count: 74 KAGOME (IPA) Load time: 4133.3 ms Tokenize avg: 3.274 ms Tokenize min: 1.715 ms Tokenize max: 12.430 ms Token count: 74 SUDACHI (Core Dictionary) Load time: 1692.3 ms Mode A: avg 0.413 ms (79 tokens) Mode B: avg 0.398 ms (73 tokens) Mode C: avg 0.484 ms (71 tokens)
Why Node.js vs Browser WASM Results Differ
There are three different “places” you might measure WASM performance, and they answer different questions:
| Factor | Node.js script (benchmark_wasm.mjs) | Browser benchmark page (/benchmark.html) | Web app UI (React) |
|---|---|---|---|
| What you measure | Mostly tokenizer call overhead after everything is already available on disk | Real browser runtime: fetch/compile/instantiate + tokenize | Real UX: browser runtime + React updates + app logic |
| Iterations | 100 | 100 | 5 (benchmark mode) or 1 (normal) |
| Warm-up | 1 + 100 measured | 1 + 100 measured | 1 warm-up + 5 measured |
| JIT (just-in-time) optimization | Usually more stable | Stable after warm-up | Noisy (few runs) |
| Asset loading | From local disk | Over HTTP (even on localhost), subject to caching | Over HTTP + app chunking |
| Sandbox / security model | None (process has direct access) | Browser sandbox + stricter policies | Browser sandbox + app policies |
| Overhead around tokenize | Minimal | Minimal | Includes async scheduling + React state updates |
The biggest “gotchas” are (1) iteration count (JIT + variance) and (2) what you include in load time (disk vs HTTP, cached vs uncached, dev server vs production).
Real World Load Times
Let's talk about what users will actually experience.
Estimated Browser Load Times
| Package | @ 50 Mbps | @ 100 Mbps | @ 200 Mbps |
|---|---|---|---|
| Lindera IPADIC (13 MB) | ~2s | ~1s | ~0.5s |
| Lindera UniDic (47 MB) | ~7.5s | ~3.8s | ~1.9s |
| Kagome (15 MB) | ~2.4s | ~1.2s | ~0.6s |
| Sudachi + Core (210 MB) | ~33s | ~17s | ~8.5s |
| Sudachi + Core (gzipped) | ~11s | ~5.5s | ~2.8s |
The Caching Solution
Those numbers for Sudachi look no good as expected, right? 33 seconds at 50 Mbps is brutal. But here's the thing: users only pay that cost once.
The solution is IndexedDB caching:
jsx
async function loadSudachiWithCache() { const cachedDict = await idb.get('sudachi-core-dict'); if (cachedDict) { // Return visit: ~500ms await sudachi.initialize_from_bytes(cachedDict); } else { // First visit: show progress bar const dict = await fetchWithProgress('/sudachi/system.dic'); await idb.set('sudachi-core-dict', dict); await sudachi.initialize_from_bytes(dict); } }
Result:
- First visit: 5-15 seconds (with progress indicator)
- Return visits: ~500ms (essentially instant)
Segmentation Quality Comparison
How do the analyzers actually differ in output?
Token Count Comparison
| Analyzer | Tokens | Tendency |
|---|---|---|
| Lindera IPADIC | 74 | Keeps some compounds |
| Lindera UniDic | 79 | Splits more finely |
| Sudachi Mode A | 79 | Finest splitting |
| Sudachi Mode B | 73 | Balanced |
| Sudachi Mode C | 71 | Preserves compounds |
| Kagome IPA | 74 | Similar granularity to IPADIC |
Key Word Segmentation
| Word | Lindera IPADIC | Lindera UniDic | Sudachi C |
|---|---|---|---|
| 厚生労働省 | 厚生/労働省 | 厚生/労働/省 | 厚生労働省 |
| 医療機器 | 医療/機器 | 医療/機器 | 医療機器 |
| 再生医療 | 再生/医療 | 再生/医療 | 再生/医療 |
| 国際標準化 | 国際/標準/化 | 国際/標準/化 | 国際/標準化 |
| グローバル市場 | グローバル/市場 | グローバル/市場 | グローバル/市場 |
Observations:
- Lindera IPADIC keeps ~80% of compounds together
- Lindera UniDic splits more aggressively (better for linguistic analysis)
- Sudachi Mode C preserves the most compounds (best for dictionary lookup)
Feature Comparison
| Feature | Lindera IPADIC | Lindera UniDic | Sudachi Core | Kagome |
|---|---|---|---|---|
| Multi-granular (A/B/C) | x | x | v | x |
| Normalized form | x | x | v | x |
| Modern vocabulary | x (2007) | v | v (2026) | x |
| Compound preservation | ~80% | ~60% | ~95% (Mode C) | ~80% |
| Embedded dictionary | v | v | x | v |
| WASM tokenize speed (Node.js) | 1.20ms | ~2ms | 0.36ms (Mode C) | n/a (see Chrome: 3.27ms + ~4.1s init) |
| Package size | 13 MB | 47 MB | 210 MB | 15 MB |
| Best for | Speed-first | Linguistics | Dictionary apps | Go ecosystem |
How to Choose the Right Tool
Choose Lindera IPADIC if:
- Load time is critical (must be < 2 seconds)
- You're building a simple tokenizer/search
- Vocabulary from 2007 is acceptable
- You want the smallest bundle size
Choose Lindera UniDic if:
- You need linguistic accuracy
- Fine-grained segmentation is preferred
- Load time of 3-4 seconds is acceptable
Choose Sudachi with Core Dictionary if:
- You're building a dictionary/vocabulary app
- You need compound words preserved (Mode C)
- You can implement caching (IndexedDB/Service Worker)
- Modern vocabulary matters (medical terms, proper nouns)
- Normalization features are useful (e.g., 附属→付属)
Choose Kagome if:
- You're already in the Go ecosystem
- You want a middle-ground option
- ~15MB download is acceptable
Implementation Recommendations
For Dictionary/Vocabulary Apps
jsx
// Recommended: Sudachi with caching import { SudachiStateless, TokenizeMode } from 'sudachi-wasm333'; const sudachi = new SudachiStateless(); // First load: show progress await sudachi.initialize_browser('/sudachi/system.dic', { onProgress: (percent) => updateProgressBar(percent) }); // Use Mode C for dictionary lookup const tokens = sudachi.tokenize_raw(text, TokenizeMode.C); // Each token has: // - surface: "厚生労働省" // - normalized_form: "厚生労働省" // - reading_form: "コウセイロウドウショウ" // - dictionary_form: "厚生労働省"
For Speed Critical Apps
jsx
// Recommended: Lindera IPADIC import init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic'; await init(); const builder = new TokenizerBuilder(); builder.setDictionary('embedded://ipadic'); builder.setMode('normal'); const tokenizer = builder.build(); const tokens = tokenizer.tokenize(text);
Server Configuration
Enable gzip/brotli compression for the dictionary file:
# nginx.conf location /sudachi/ { gzip on; gzip_types application/octet-stream; gzip_min_length 1000; }
This reduces the 207MB dictionary to ~70MB transfer.
Conclusion
There's no single "best" Japanese tokenizer, it depends on your priorities:
| Priority | Recommendation |
|---|---|
| Fastest load | Lindera IPADIC (13MB, ~1s) |
| Fastest tokenization | Sudachi WASM (Mode C ~0.36ms) |
| Best for dictionary apps | Sudachi Mode C |
| Modern vocabulary | Sudachi Core |
| Smallest bundle | Lindera IPADIC |
| Linguistic accuracy | Lindera UniDic |
Benchmark Data
All benchmarks were run on the following text (121 characters):
厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る 試験方法等を策定し、試験方法等の国際標準化を図り、製品の早期実用化とともに、グローバル市場 における日本発の製品の普及を推進するための研究課題を公募します。
Resources
Here are the projects I tested. All are open source and actively maintained:
- Lindera - Rust morphological analyzer
- Sudachi - Multi-granular Japanese tokenizer
- Kagome - Go morphological analyzer
- SudachiDict - Dictionary downloads

