I'm writing a function using ICU to parse an Unicode string which consists of kanji numeric character(s) and want to return the integer value of the string.
"五" => 5
"三十一" => 31
"五千九百七十二" => 5972
I'm setting the locale to Locale::getJapan() and using the NumberFormat::parse() to parse the character string. However, whenever I pass it any Kanji characters, the parse() method is returning U_INVALID_FORMAT_ERROR.
Does anyone know if ICU supports Kanji character strings in the NumberFormat::parse() method? I was hoping that since I'm setting the Locale to Japanese that it would be able to parse Kanji numeric values.
Thanks!
#include <iostream>
#include <unicode/numfmt.h>
using namespace std;
int main(int argc, char **argv) {
const Locale &jaLocale = Locale::getJapan();
UErrorCode status = U_ZERO_ERROR;
NumberFormat *nf = NumberFormat::createInstance(jaLocale, status);
UChar number[] = {0x4E94}; // Character for '5' in Japanese '五'
UnicodeString numStr(number);
Formattable formattable;
nf->parse(numStr, formattable, status);
if (U_FAILURE(status)) {
cout << "error parsing as number: " << u_errorName(status) << endl;
return(1);
}
cout << "long value: " << formattable.getLong() << endl;
}
-
I was inspired by your question to solve this problem using Python.
If you don't find a C++ solution, it shouldn't be too hard to adapt this to C++.
-
Hi, I created a small perl module to do this a while back. it can convert arabic<=>japanese and though I haven't tested it exhaustively i think it's pretty comprehensive. feel free to improve it.
package kanjiArabic; use strict; use warnings; our $VERSION = "1.00"; use utf8; our %big = ( 十 => 10,百 => 100,千 => 1000, ); our %bigger = ( 万 => 10000,億 => 100000000, 兆 => 1000000000000,京 => 10000000000000000, 垓 => 100000000000000000000, ); #precompile regexes our $qr = qr/[0-9]/; our $bigqr = qr/[十百千]/; our $biggerqr = qr/[万億兆京垓]/; #this routine does most of the real work. sub kanji2arabic{ $_ = shift; tr/〇一二三四五六七八九/0123456789/; #optionally precompile for performance boost s/(?<=${qr})(${bigqr})/\*${1}/g; s/(?<=${bigqr})(${bigqr})/\+${1}/g; s/(${bigqr})(?=${qr})/${1}\+/g; s/(${bigqr})(?=${bigqr})/${1}\+/g; s/(${bigqr})/${big{$1}}/g; s/([0-9\+\*]+)/\(${1}\)/g; s/(? "〇", 1 => "一", 2 => "二", 3 => "三", 4 => "四", 5 => "五", 6 => "六", 7 => "七", 8 => "八", 9 => "九", ); our %places = ( 1 => 10, 2 => 100, 3 => 1000, 4 => 10000, 8 => 100000000, 12 => 1000000000000, 16 => 10000000000000000, 20 => 100000000000000000000, ); our %abig = ( 10 => "十", 100 => "百", 1000 => "千", 10000 => "万", 100000000 => "億", 1000000000000 => "兆", 10000000000000000 => "京", 100000000000000000000 => "垓", ); our $MAX = 24; #We only support numbers up to 24 digits! sub arabic2kanji{ my @number = reverse(split(//,$_[0])); my @kanji; for(my $i=$#number;$i>=0;$i--){ if( $i==0 ){push(@kanji,$asmall{$number[$i]});} elsif( $i % 4 == 0 ){ if( $number[$i] !~ m/[01]/ ){ push(@kanji,$asmall{$number[$i]}); } push(@kanji,$abig{$places{$i}}); }else{ my $p = $i % 4; if( $number[$i]==0 ){ next; }elsif( $number[$i]==1 ){ push(@kanji,$abig{$places{$p}}); }else{ push(@kanji,$asmall{$number[$i]}); push(@kanji,$abig{$places{$p}}); } } } return join("",@kanji); } sub eval_k2a{ #feed me utf-8! if($_[0] !~ m/^[〇一二三四五六七八九十百千万億兆京垓]+$/){ print "Error: ".$_[0]. " not a Kanji number.\n" if defined($_[1])&&$_[1]==1; return -1; } my $expression = kanji2arabic($_[0]); print $expression."\n" if defined($_[1])&&$_[1]==1; return eval($expression); } 1;
you'd then call it from another script like so,
#!/usr/bin/perl -w use strict; use warnings; use Encode; use kanjiArabic; my $kanji = kanjiArabic::arabic2kanji($ARGV[0]); print "Kanji: ".encode("utf8",$kanji)."\n"; my $arabic = kanjiArabic::eval_k2a($kanji); print "Back to arabic...\n"; print "Arabic: ".$arabic."\n";
and use this script like so,
kettle:~/k2a$ ./k2a.pl 5000215 Kanji: 五百万二百十五 Back to arabic... Arabic: 5000215
rock on.
-
You can use the ICU Rule Based Number Format (RBNF) module rbnf.h (C++) or for C, in unum.h with the UNUM_SPELLOUT option, both with the "ja" locale for Japanese.
Artyom : This is the correct answer: instread: `NumberFormat::createInstance(jaLocale, status);` use `new RuleBasedNumberFormat(URBNF_SPELLOUT,jaLocale, status);` -
This is actually quite difficult, especially if you start looking at the obsucre kanji for very large numbers.
In perl, there is a very complete implementaion in Lingua::JA::Numbers. It's source might be inspirational if you want to port it to C++.
0 comments:
Post a Comment