Benchmarks and Trade offs for Japanese Morphological Analyzer

Jan 18, 2026

TL;DR:

For browser based Japanese Morphological Analyzer, Lindera IPADIC (13MB) is the simplest download-and-go option and loads in ~1 second. Sudachi with the Core dictionary (207MB) offers modern vocabulary and multi-granular tokenization, but requires caching for acceptable initial load times when downloading the dictionary. Once initialized it tokenizes about ~2x faster than Lindera in this test. Kagome’s Go WASM build works, but has a large startup cost (~4s here) and slower tokenization.

Introduction

Here's the thing about Japanese: there are no spaces between words. If you're building anything that needs to understand Japanese text, a reading assistant, a vocabulary app, a dictionary lookup tool, etc, you need to figure out where one word ends and the next begins, which makes tokenization essential


Input:  厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る試験方法等を策定し...
Output: 厚生労働省 / で / は / 、 / 世界 / に / 先駆け / て / 、 / 革新 / 的 / な / 医療機器 / ・ / 再生 / 医療 / 等 / 製品 / 等 / の / ...

After some research, I found three morphological analyzers with WebAssembly builds that could run in the browser:

Lindera (Rust): Fast, embedded dictionary
Sudachi (Rust): Modern vocabulary, multi-granular modes
Kagome (Go): Mature, Go ecosystem

Which one should you use? It turns out "fastest" depends on what you're measuring, and "best" depends on what you're building. So I did what any reasonable developer would do: benchmarked all of them.

Test Setup

Before we dive in, here's my setup. I wanted numbers that would actually hold up, so I ran each test many times after a warm-up run.

Test text: 121-character Japanese government announcement about medical device research (chosen because it has technical vocabulary and compound words)

Environment: macOS (Apple Silicon), Node.js for WASM, native Rust/Python/Go for comparison

Iterations: 100 runs per measurement

jsx
const text = "厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の" +
  "有効性・安全性に係る試験方法等を策定し、試験方法等の国際標準化を図り、" +
  "製品の早期実用化とともに、グローバル市場における日本発の製品の普及を推進するための研究課題を公募します。";

Understanding the Dictionaries

It’s necessary to understand that the dictionary matters more than the tokenizer. I spent some time comparing tokenizer performance before realizing that the dictionary is fundamental to the morphological analyzer.

Dictionary Comparison

Dictionary	Size	Last Updated	Entries	Features
IPADIC	~50 MB	2007	~400K	Basic tokenization
UniDic	~100 MB	2023+	~800K	Linguistic accuracy, fine segmentation
SudachiDict Small	~42 MB	Jan 2026	~700K	Modern vocabulary
SudachiDict Core	207 MB	Jan 2026	~1.5M	Multi-granular (A/B/C), normalization
SudachiDict Full	~240 MB	Jan 2026	~3M	Most comprehensive

Why SudachiDict is 4x Larger

IPADIC stores one segmentation per word. SudachiDict stores three:


IPADIC entry:
  選挙管理委員会 → POS, reading

SudachiDict entry:
  選挙管理委員会 → POS, reading
                 + split_A: [選挙, 管理, 委員, 会]
                 + split_B: [選挙, 管理, 委員会]
                 + split_C: [選挙管理委員会]
                 + normalized_form

This enables Sudachi's killer feature: multi-granular tokenization.

The Multi-Granular Advantage

This is where Sudachi really shines, and honestly, it's what sold me on it.

Sudachi offers three tokenization modes:

Mode	Granularity	Use Case
A	Finest	Search indexing, linguistic analysis
B	Middle	General NLP
C	Longest	Dictionary lookup, vocabulary learning

Real-World Example

Word	Mode A	Mode B	Mode C
厚生労働省	厚生/労働/省	厚生/労働省	厚生労働省
医療機器	医療/機器	医療/機器	医療機器
国際標準化	国際/標準/化	国際/標準化	国際/標準化
グローバル市場	グローバル/市場	グローバル/市場	グローバル/市場

For a dictionary app, Mode C is ideal, it keeps compound words together so users can look them up directly.

Critical Finding: Small Dictionary Breaks Mode B/C

During testing, I discovered that SudachiDict Small doesn't support multi-granular tokenization:

jsx
// With SudachiDict Small (42MB)
Mode A: 選挙 / 管理 / 委員 / 会
Mode B: 選挙 / 管理 / 委員 / 会  // Same as A!
Mode C: 選挙 / 管理 / 委員 / 会  // Same as A!

// With SudachiDict Core (207MB)
Mode A: 選挙 / 管理 / 委員 / 会
Mode B: 選挙 / 管理 / 委員会     // Different!
Mode C: 選挙管理委員会           // Different!

Takeaway: If you need Mode B or C, you must use the Core or Full dictionary.

Native Performance Benchmarks

Before looking at WASM performance, let's establish baseline performance and see how fast these tokenizers could actually be when running natively.

This gives us a ceiling to compare against.

Native Rust (Lindera)


=== Lindera (IPADIC) ===
  Dictionary load time: 167 ms
  Tokenization time:    22 µs (avg of 100 runs)
  Token count:          74

=== Lindera (UniDic) ===
  Dictionary load time: 676 ms
  Tokenization time:    29 µs
  Token count:          79

Native Python/Rust (SudachiPy)

SudachiPy wraps the same Rust core as sudachi.rs:


=== Sudachi (Core Dictionary) ===
  Dictionary load time: 246 ms
  Mode A - time: 0.133 ms, tokens: 79
  Mode B - time: 0.074 ms, tokens: 73
  Mode C - time: 0.066 ms, tokens: 71
  Memory usage: ~101 MB

Native Go (Kagome)

Kagome is a Go tokenizer. I measured it with Go 1.25.6 IPA dictionary.


=== Kagome (IPA Dictionary) ===
  Dictionary load time: 675 ms
  Tokenization time:    56 µs (avg of 100 runs)
  Token count:          74

Native Summary

Analyzer	Load Time	Tokenize Time	Tokens
Lindera IPADIC	167 ms	22 µs	74
Lindera UniDic	676 ms	29 µs	79
Sudachi Mode A	246 ms	133 µs	79
Sudachi Mode B	246 ms	74 µs	73
Sudachi Mode C	246 ms	66 µs	71
Kagome (IPA)	675 ms	56 µs	74

Key insight: Lindera is the fastest native tokenizer. Sudachi Mode C produces the fewest tokens (71) by preserving compound words, which is exactly what I wanted for dictionary lookup.

WASM Performance Benchmarks

Alright, here are the results that matter: how do these perform in the browser?

WASM Package Sizes

Package	Unpacked	Gzipped	Dictionary
`lindera-wasm-web-ipadic`	13 MB	~4 MB	Embedded
`lindera-wasm-web-unidic`	47 MB	~15 MB	Embedded
`sudachi-wasm333`	2.5 MB	~1 MB	Separate
SudachiDict Core	207 MB	~70 MB	-
`kagome.wasm`	15 MB	~5 MB	Embedded

The critical difference: Lindera embeds the dictionary in WASM, while Sudachi downloads it separately.

WASM Benchmark Results (Node.js)

This is useful for apples-to-apples comparisons between libraries, and it avoids UI framework overhead.


=== Lindera WASM (IPADIC) ===
  Dictionary load time: 208 ms
  Tokenization time:    1.204 ms (avg of 100 runs)
  Token count:          74

=== Sudachi WASM (CORE dict - 207MB) ===
  Dictionary load time: 273 ms
  Mode A - time: 0.406 ms, tokens: 79
  Mode B - time: 0.369 ms, tokens: 73
  Mode C - time: 0.356 ms, tokens: 71

=== Sudachi WASM (bundled small dict) ===
  Dictionary load time (bundled): 272 ms
  Mode A - time: 0.957 ms, tokens: 79
  Mode B - time: 0.625 ms, tokens: 79 (same as A, small dict limitation)
  Mode C - time: 0.429 ms, tokens: 79 (same as A, small dict limitation)

Sudachi Is Fastest in WASM (Once Loaded)

Analyzer	Tokenize Avg (WASM)	Notes
Lindera IPADIC	1.204 ms	Embedded IPADIC in a Rust WASM module
Sudachi Mode B	0.369 ms	Balanced segmentation
Sudachi Mode C	0.356 ms	Fast and preserves compounds

In this Node.js run, Sudachi Mode C is ~4x faster than Lindera for tokenization. (This is after everything is already loaded.)

Kagome is a Go WASM module and is best benchmarked in a browser; see the Chrome section next.

WASM Overhead Analysis (Native vs Node.js)

Analyzer	Native	Node.js WASM	Overhead
Lindera IPADIC	22 µs	1.204 ms	~55x
Sudachi Mode B	74 µs	0.369 ms	~5x
Sudachi Mode C	66 µs	0.356 ms	~5x

Sudachi keeps the WASM penalty relatively low. Lindera pays a much higher overhead per call.

WASM Benchmark Results (Chrome)

This includes the Go runtime startup cost for Kagome and captures browser-side overhead. It shows a large fixed startup cost (~5–6 seconds) before you can tokenize anything. This makes it less suitable for web applications where fast initialization matters.


LINDERA (IPADIC)
  Load time:      608.2 ms
  Tokenize avg:   0.947 ms
  Tokenize min:   0.845 ms
  Tokenize max:   2.000 ms
  Token count:    74

KAGOME (IPA)
  Load time:      4133.3 ms
  Tokenize avg:   3.274 ms
  Tokenize min:   1.715 ms
  Tokenize max:   12.430 ms
  Token count:    74

SUDACHI (Core Dictionary)
  Load time:      1692.3 ms

  Mode A: avg 0.413 ms (79 tokens)
  Mode B: avg 0.398 ms (73 tokens)
  Mode C: avg 0.484 ms (71 tokens)

Why Node.js vs Browser WASM Results Differ

There are three different “places” you might measure WASM performance, and they answer different questions:

Factor	Node.js script (`benchmark_wasm.mjs`)	Browser benchmark page (`/benchmark.html`)	Web app UI (React)
What you measure	Mostly tokenizer call overhead after everything is already available on disk	Real browser runtime: fetch/compile/instantiate + tokenize	Real UX: browser runtime + React updates + app logic
Iterations	100	100	5 (benchmark mode) or 1 (normal)
Warm-up	1 + 100 measured	1 + 100 measured	1 warm-up + 5 measured
JIT (just-in-time) optimization	Usually more stable	Stable after warm-up	Noisy (few runs)
Asset loading	From local disk	Over HTTP (even on localhost), subject to caching	Over HTTP + app chunking
Sandbox / security model	None (process has direct access)	Browser sandbox + stricter policies	Browser sandbox + app policies
Overhead around tokenize	Minimal	Minimal	Includes async scheduling + React state updates

The biggest “gotchas” are (1) iteration count (JIT + variance) and (2) what you include in load time (disk vs HTTP, cached vs uncached, dev server vs production).

Real World Load Times

Let's talk about what users will actually experience.

Estimated Browser Load Times

Package	@ 50 Mbps	@ 100 Mbps	@ 200 Mbps
Lindera IPADIC (13 MB)	~2s	~1s	~0.5s
Lindera UniDic (47 MB)	~7.5s	~3.8s	~1.9s
Kagome (15 MB)	~2.4s	~1.2s	~0.6s
Sudachi + Core (210 MB)	~33s	~17s	~8.5s
Sudachi + Core (gzipped)	~11s	~5.5s	~2.8s

The Caching Solution

Those numbers for Sudachi look no good as expected, right? 33 seconds at 50 Mbps is brutal. But here's the thing: users only pay that cost once.

The solution is IndexedDB caching:

jsx
async function loadSudachiWithCache() {
  const cachedDict = await idb.get('sudachi-core-dict');

  if (cachedDict) {
    // Return visit: ~500ms
    await sudachi.initialize_from_bytes(cachedDict);
  } else {
    // First visit: show progress bar
    const dict = await fetchWithProgress('/sudachi/system.dic');
    await idb.set('sudachi-core-dict', dict);
    await sudachi.initialize_from_bytes(dict);
  }
}

Result:

First visit: 5-15 seconds (with progress indicator)
Return visits: ~500ms (essentially instant)

Segmentation Quality Comparison

How do the analyzers actually differ in output?

Token Count Comparison

Analyzer	Tokens	Tendency
Lindera IPADIC	74	Keeps some compounds
Lindera UniDic	79	Splits more finely
Sudachi Mode A	79	Finest splitting
Sudachi Mode B	73	Balanced
Sudachi Mode C	71	Preserves compounds
Kagome IPA	74	Similar granularity to IPADIC

Key Word Segmentation

Word	Lindera IPADIC	Lindera UniDic	Sudachi C
厚生労働省	厚生/労働省	厚生/労働/省	厚生労働省
医療機器	医療/機器	医療/機器	医療機器
再生医療	再生/医療	再生/医療	再生/医療
国際標準化	国際/標準/化	国際/標準/化	国際/標準化
グローバル市場	グローバル/市場	グローバル/市場	グローバル/市場

Observations:

Lindera IPADIC keeps ~80% of compounds together
Lindera UniDic splits more aggressively (better for linguistic analysis)
Sudachi Mode C preserves the most compounds (best for dictionary lookup)

Feature Comparison

Feature	Lindera IPADIC	Lindera UniDic	Sudachi Core	Kagome
Multi-granular (A/B/C)	x	x	v	x
Normalized form	x	x	v	x
Modern vocabulary	x (2007)	v	v (2026)	x
Compound preservation	~80%	~60%	~95% (Mode C)	~80%
Embedded dictionary	v	v	x	v
WASM tokenize speed (Node.js)	1.20ms	~2ms	0.36ms (Mode C)	n/a (see Chrome: 3.27ms + ~4.1s init)
Package size	13 MB	47 MB	210 MB	15 MB
Best for	Speed-first	Linguistics	Dictionary apps	Go ecosystem

How to Choose the Right Tool

Choose Lindera IPADIC if:

Load time is critical (must be < 2 seconds)
You're building a simple tokenizer/search
Vocabulary from 2007 is acceptable
You want the smallest bundle size

Choose Lindera UniDic if:

You need linguistic accuracy
Fine-grained segmentation is preferred
Load time of 3-4 seconds is acceptable

Choose Sudachi with Core Dictionary if:

You're building a dictionary/vocabulary app
You need compound words preserved (Mode C)
You can implement caching (IndexedDB/Service Worker)
Modern vocabulary matters (medical terms, proper nouns)
Normalization features are useful (e.g., 附属→付属)

Choose Kagome if:

You're already in the Go ecosystem
You want a middle-ground option
~15MB download is acceptable

Implementation Recommendations

For Dictionary/Vocabulary Apps

jsx
// Recommended: Sudachi with caching
import { SudachiStateless, TokenizeMode } from 'sudachi-wasm333';

const sudachi = new SudachiStateless();

// First load: show progress
await sudachi.initialize_browser('/sudachi/system.dic', {
  onProgress: (percent) => updateProgressBar(percent)
});

// Use Mode C for dictionary lookup
const tokens = sudachi.tokenize_raw(text, TokenizeMode.C);

// Each token has:
// - surface: "厚生労働省"
// - normalized_form: "厚生労働省"
// - reading_form: "コウセイロウドウショウ"
// - dictionary_form: "厚生労働省"

For Speed Critical Apps

jsx
// Recommended: Lindera IPADIC
import init, { TokenizerBuilder } from 'lindera-wasm-web-ipadic';

await init();

const builder = new TokenizerBuilder();
builder.setDictionary('embedded://ipadic');
builder.setMode('normal');
const tokenizer = builder.build();

const tokens = tokenizer.tokenize(text);

Server Configuration

Enable gzip/brotli compression for the dictionary file:


# nginx.conf
location /sudachi/ {
    gzip on;
    gzip_types application/octet-stream;
    gzip_min_length 1000;
}

This reduces the 207MB dictionary to ~70MB transfer.

Conclusion

There's no single "best" Japanese tokenizer, it depends on your priorities:

Priority	Recommendation
Fastest load	Lindera IPADIC (13MB, ~1s)
Fastest tokenization	Sudachi WASM (Mode C ~0.36ms)
Best for dictionary apps	Sudachi Mode C
Modern vocabulary	Sudachi Core
Smallest bundle	Lindera IPADIC
Linguistic accuracy	Lindera UniDic

Benchmark Data

All benchmarks were run on the following text (121 characters):


厚生労働省では、世界に先駆けて、革新的な医療機器・再生医療等製品等の有効性・安全性に係る
試験方法等を策定し、試験方法等の国際標準化を図り、製品の早期実用化とともに、グローバル市場
における日本発の製品の普及を推進するための研究課題を公募します。

Resources

Here are the projects I tested. All are open source and actively maintained:

Lindera - Rust morphological analyzer
Sudachi - Multi-granular Japanese tokenizer
Kagome - Go morphological analyzer
SudachiDict - Dictionary downloads

Next.js 16 Caching and Revalidation Cache Components

Learn how Next.js 16's Cache Components work, from the use cache directive and its variants to cacheLife() for timing control and revalidateTag()/updateTag() for on demand invalidation. Covers the four cache layers (Request Memoization, Data Cache, Full Route Cache, Router Cache), when to use fetch() caching vs Cache Components, cache key design for high hit rates, and common pitfalls like the client side Router Cache gotcha.

Nov 30, 2025

React State and Effects that Every Developer Should Know

State owns data. Effects synchronize with the outside world. Most React bugs come from confusing the two. This article covers the patterns you might use daily, derived state, cleanup, server data, and shows you the anti-patterns to avoid.

Jun 15, 2025

anila.