------------------------------------------------------------------------ Stroke frequency of Kanji (Japanese set) ------------------------------------------------------------------------ There are three scripts for Japanese: two phonetic script (hira- and kata-kana) and about 2,000 Chinese characters. The characters are built by successive strokes. I asked my Japanese friends if there was a histogram of the number of strokes for these characters (e.g. mean, mode, median). I was surprised to find that there was no readily accessible answer. My Japanese friends at me for even seeking such a histogram! Learning kanji is challenging for two very different reasons. First, is the sheer pictographic memory that has to be developed. However, each Kanji can be pronounced in at least two different ways (Chinese reading, Japanese reading). Sometimes, a character can have even eight prononciations. There are many websites which list the characters. So I set forth in deriving the histogram. The UNIX solution is a nice showcase of the power of Unix -- hence this note. I started with this database https://en.wikipedia.org/wiki/List_of_kanji_by_stroke_count With routine use of sed I was able to produce this file 1 一 乙 〇 2 丁 七 九 了 二 人 入 八 刀 力 十 又 乃 3 .... Here "1" means one stroke and this is followed by three kanji. "2" stands for two strokes and this is followed by 3-stroke kanji. We need to produce histogram. Associative arrays are built to solve this problem. $ awk '/^[0-9]+/{i=$1};!/^0-9+/{a[i]++};END{for (i in a){print i,a[i]}}' kanji_stroke.txt 1 4 2 14 3 35 4 75 5 106 6 119 7 160 8 212 9 194 10 228 11 224 12 218 13 170 14 124 15 121 16 82 17 47 18 42 19 23 20 11 21 9 22 5 23 2 24 2 29 2 30 3 33 2