Lingua-JA-NormalizeText

 view release on metacpan or  search on metacpan

lib/Lingua/JA/NormalizeText.pm  view on Meta::CPAN

  莖螢輕鷄藝擊缺儉劍圈檢權獻硏縣險顯驗嚴效廣恆鑛號國穀黑濟碎齋
  劑櫻册殺雜參慘棧蠶贊殘祉絲視齒兒辭濕實舍寫煮社者釋壽收臭從澁
  獸縱祝肅處暑緖署諸敍奬將涉燒祥稱證乘剩壤孃條淨狀疊讓釀囑觸寢
  愼眞神盡圖粹醉隨髓數樞瀨聲靜齊攝竊節專戰淺潛纖踐錢禪曾祖僧雙
  壯層搜插巢爭瘦總莊裝騷增憎臟藏贈卽屬續墮體對帶滯臺瀧擇澤單嘆
  擔膽團彈斷癡遲晝蟲鑄著廳徵懲聽敕鎭塚遞鐵轉點傳都黨盜燈當鬭德
  獨讀突屆繩難貳惱腦霸廢拜梅賣麥發髮拔繁晚蠻卑碑祕濱賓頻敏甁侮
  福拂佛倂塀竝變邊勉辨瓣辯舖步穗寶襃豐墨沒飜每萬滿免麵默餠戾彌
  藥譯豫餘與譽搖樣謠來賴亂欄覽隆龍虜兩獵綠壘淚類勵禮隸靈齡曆歷
  戀練鍊爐勞廊朗樓郞錄灣堯巖晉槇渚猪琢瑤祐祿禎穰聰遙

OUTPUT FOR INPUT:

  亜悪圧囲為医壱逸稲飲隠営栄衛駅謁円縁艶塩奥応横欧殴黄温穏仮価
  禍画会壊悔懐海絵慨概拡殻覚学岳楽喝渇褐勧巻寛歓漢缶観関陥顔器
  既帰気祈亀偽戯犠旧拠挙虚峡挟狭郷響暁勤謹区駆勲薫径恵掲渓経継
  茎蛍軽鶏芸撃欠倹剣圏検権献研県険顕験厳効広恒鉱号国穀黒済砕斎
  剤桜冊殺雑参惨桟蚕賛残祉糸視歯児辞湿実舎写煮社者釈寿収臭従渋
  獣縦祝粛処暑緒署諸叙奨将渉焼祥称証乗剰壌嬢条浄状畳譲醸嘱触寝
  慎真神尽図粋酔随髄数枢瀬声静斉摂窃節専戦浅潜繊践銭禅曽祖僧双
  壮層捜挿巣争痩総荘装騒増憎臓蔵贈即属続堕体対帯滞台滝択沢単嘆
  担胆団弾断痴遅昼虫鋳著庁徴懲聴勅鎮塚逓鉄転点伝都党盗灯当闘徳
  独読突届縄難弐悩脳覇廃拝梅売麦発髪抜繁晩蛮卑碑秘浜賓頻敏瓶侮
  福払仏併塀並変辺勉弁弁弁舗歩穂宝褒豊墨没翻毎万満免麺黙餅戻弥
  薬訳予余与誉揺様謡来頼乱欄覧隆竜虜両猟緑塁涙類励礼隷霊齢暦歴
  恋練錬炉労廊朗楼郎録湾尭巌晋槙渚猪琢瑶祐禄禎穣聡遥


=head2 tab2space

Converts CHARACTER TABULATION (U+0009) into SPACE (U+0020).

=head2 remove_controls

Removes the following control characters:

  U+0000 .. U+0008
  U+000B
  U+000C
  U+000E .. U+001F
  U+007F .. U+009F

Note that this option does not remove the following characters:

  U+0009  CHARACTER TABULATION
  U+000A  LINE FEED
  U+000D  CARRIAGE RETURN


=head2 remove_DFC

Removes the following Directional Formatting Characters:

  U+061C  ARABIC LETTER MARK
  U+2066  LEFT-TO-RIGHT ISOLATE
  U+2067  RIGHT-TO-LEFT ISOLATE
  U+2068  FIRST STRONG ISOLATE
  U+2069  POP DIRECTIONAL ISOLATE
  U+200E  LEFT-TO-RIGHT MARK
  U+200F  RIGHT-TO-LEFT MARK
  U+202A  LEFT-TO-RIGHT EMBEDDING
  U+202B  RIGHT-TO-LEFT EMBEDDING
  U+202C  POP DIRECTIONAL FORMATTING
  U+202D  LEFT-TO-RIGHT OVERRIDE
  U+202E  RIGHT-TO-LEFT OVERRIDE

See L<http://www.unicode.org/reports/tr9/> for more information about Directional Formatting Characters.


=head2 remove_spaces

Removes SPACE (U+0020) and IDEOGRAPHIC SPACE (U+3000).

=head2 dakuon_normalize, handakuon_normalize, all_dakuon_normalize

See L<Lingua::JA::Dakuon>.

Note that Lingua::JA::NormalizeText enables $Lingua::JA::Dakuon::EnableCombining flag.

=head2 square2katakana, circled2kana, circled2kanji

See L<Lingua::JA::Moji>.

=head2 decompose_parenthesized_kanji

Decomposes the following parenthesized kanji:

  ㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩㈪㈫㈬㈭㈮㈯㈰㈱㈲㈳㈴㈵㈶㈷㈸㈹㈺㈻㈼㈽㈾㈿㉀㉁㉂㉃


=head1 AUTHOR

pawa E<lt>pawapawa@cpan.orgE<gt>

=head1 SEE ALSO

L<新旧字体表|http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html>

L<康熙字典|http://ja.wikipedia.org/wiki/%E5%BA%B7%E7%86%99%E5%AD%97%E5%85%B8>

L<Lingua::JA::Regular::Unicode>

L<Lingua::JA::Dakuon>

L<Lingua::JA::Moji>

L<Unicode::Normalize>

L<Unicode::Number>

L<HTML::Entities>

L<HTML::Scrubber>

=head1 LICENSE

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=cut



( run in 1.078 second using v1.01-cache-2.11-cpan-71847e10f99 )