------------------------------------------------------------------------
Stroke frequency of Kanji (Japanese set)
------------------------------------------------------------------------

There are three scripts for Japanese: two phonetic script (hira-
and kata-kana) and about 2,000 Chinese characters. The characters
are built by successive strokes. I asked my Japanese friends if
there was a histogram of the number of strokes for these characters
(e.g. mean, mode, median). I was surprised to find that there was
no readily accessible answer. My Japanese friends at me for even
seeking such a histogram!

Learning kanji is challenging for two very different reasons. First,
is the sheer pictographic memory that has to be developed. However, 
each Kanji can be pronounced in at least two different ways (Chinese
reading, Japanese reading). Sometimes, a character can have even eight
prononciations. 

There are many websites which list the characters. So I set forth 
in deriving the histogram. The UNIX solution is a nice showcase
of the power of Unix -- hence this note.

I started with this database
	https://en.wikipedia.org/wiki/List_of_kanji_by_stroke_count

With routine use of sed I was able to produce this file
1
一
乙
〇
2
丁
七
九
了
二
人
入
八
刀
力
十
又
乃
3
....

Here "1" means one stroke and this is followed by three kanji. "2" stands for two strokes and this is followed by 3-stroke kanji.
We need to produce histogram.

Associative arrays are built to solve this problem.

$ awk '/^[0-9]+/{i=$1};!/^0-9+/{a[i]++};END{for (i in a){print i,a[i]}}' kanji_stroke.txt
1 4
2 14
3 35
4 75
5 106
6 119
7 160
8 212
9 194
10 228
11 224
12 218
13 170
14 124
15 121
16 82
17 47
18 42
19 23
20 11
21 9
22 5
23 2
24 2
29 2
30 3
33 2