Thursday, May 5, 2011

How to parse kanji numeric characters using ICU?

I'm writing a function using ICU to parse an Unicode string which consists of kanji numeric character(s) and want to return the integer value of the string.

"五" => 5
"三十一" => 31
"五千九百七十二" => 5972

I'm setting the locale to Locale::getJapan() and using the NumberFormat::parse() to parse the character string. However, whenever I pass it any Kanji characters, the parse() method is returning U_INVALID_FORMAT_ERROR.

Does anyone know if ICU supports Kanji character strings in the NumberFormat::parse() method? I was hoping that since I'm setting the Locale to Japanese that it would be able to parse Kanji numeric values.

Thanks!

#include <iostream>
#include <unicode/numfmt.h>

using namespace std;

int main(int argc, char **argv) {
    const Locale &jaLocale = Locale::getJapan();
    UErrorCode status = U_ZERO_ERROR;
    NumberFormat *nf = NumberFormat::createInstance(jaLocale, status);

    UChar number[] = {0x4E94}; // Character for '5' in Japanese '五'
    UnicodeString numStr(number);
    Formattable formattable;
    nf->parse(numStr, formattable, status);
    if (U_FAILURE(status)) {
        cout << "error parsing as number: " << u_errorName(status) << endl;
        return(1);
    }
    cout << "long value: " << formattable.getLong() << endl;
}
From stackoverflow
  • I was inspired by your question to solve this problem using Python.

    If you don't find a C++ solution, it shouldn't be too hard to adapt this to C++.

  • Hi, I created a small perl module to do this a while back. it can convert arabic<=>japanese and though I haven't tested it exhaustively i think it's pretty comprehensive. feel free to improve it.

     
    package kanjiArabic;
    use strict;
    use warnings;
    our $VERSION = "1.00";
    use utf8;
    
    our %big = (
        十 => 10,百 => 100,千 => 1000,
        );
    our %bigger = (
        万 => 10000,億 => 100000000,
        兆 => 1000000000000,京 => 10000000000000000,
        垓 => 100000000000000000000,
        );
    #precompile regexes                                                                                                          
    our $qr = qr/[0-9]/;
    our $bigqr = qr/[十百千]/;
    our $biggerqr = qr/[万億兆京垓]/;
    
    #this routine does most of the real work.
    sub kanji2arabic{
        $_ = shift;
    
        tr/〇一二三四五六七八九/0123456789/;
        #optionally precompile for performance boost                                                                             
        s/(?<=${qr})(${bigqr})/\*${1}/g;
        s/(?<=${bigqr})(${bigqr})/\+${1}/g;
        s/(${bigqr})(?=${qr})/${1}\+/g;
        s/(${bigqr})(?=${bigqr})/${1}\+/g;
        s/(${bigqr})/${big{$1}}/g;
    
        s/([0-9\+\*]+)/\(${1}\)/g;
    
        s/(? "〇", 1 => "一", 2 => "二", 3 => "三", 4 => "四",
        5 => "五", 6 => "六", 7 => "七", 8 => "八", 9 => "九",
        );
    our %places = (
        1 => 10, 
        2 => 100, 
        3 => 1000, 
        4 => 10000, 
        8 => 100000000, 
        12 => 1000000000000,
        16 => 10000000000000000, 
        20 => 100000000000000000000,
        );
    our %abig   = (
        10 => "十", 
        100 => "百", 
        1000 => "千", 
        10000 => "万", 
        100000000 => "億",
        1000000000000 => "兆", 
        10000000000000000 => "京", 
        100000000000000000000 => "垓",
        );
    our $MAX = 24; #We only support numbers up to 24 digits!                                                                     
    
    
    sub arabic2kanji{
        my @number = reverse(split(//,$_[0]));
        my @kanji;
        for(my $i=$#number;$i>=0;$i--){
            if( $i==0 ){push(@kanji,$asmall{$number[$i]});}
            elsif( $i % 4 == 0 ){
                if( $number[$i] !~ m/[01]/ ){
                    push(@kanji,$asmall{$number[$i]});
                }
                push(@kanji,$abig{$places{$i}});
        }else{
                my $p = $i % 4;
                if( $number[$i]==0 ){
                    next;
                }elsif( $number[$i]==1 ){
                    push(@kanji,$abig{$places{$p}});
                }else{
                    push(@kanji,$asmall{$number[$i]});
         push(@kanji,$abig{$places{$p}});
                }
        }
        }
        return join("",@kanji);
    }
    
    
    sub eval_k2a{
        #feed me utf-8!                                                                                                          
        if($_[0] !~ m/^[〇一二三四五六七八九十百千万億兆京垓]+$/){
            print "Error: ".$_[0].
                  " not a Kanji number.\n" if defined($_[1])&&$_[1]==1;
            return -1;
        }
        my $expression = kanji2arabic($_[0]);
        print $expression."\n" if defined($_[1])&&$_[1]==1;
        return eval($expression);
    }
    
    
    
    1;
    

    you'd then call it from another script like so,

    
    #!/usr/bin/perl -w
    use strict;
    use warnings;
    use Encode;
    use kanjiArabic;
    
    my $kanji = kanjiArabic::arabic2kanji($ARGV[0]);
    print "Kanji: ".encode("utf8",$kanji)."\n";
    my $arabic =  kanjiArabic::eval_k2a($kanji);
    print "Back to arabic...\n";
    print "Arabic: ".$arabic."\n";
    

    and use this script like so,

    
    kettle:~/k2a$ ./k2a.pl 5000215
    Kanji: 五百万二百十五
    Back to arabic...
    Arabic: 5000215
    

    rock on.

  • You can use the ICU Rule Based Number Format (RBNF) module rbnf.h (C++) or for C, in unum.h with the UNUM_SPELLOUT option, both with the "ja" locale for Japanese.

    Artyom : This is the correct answer: instread: `NumberFormat::createInstance(jaLocale, status);` use `new RuleBasedNumberFormat(URBNF_SPELLOUT,jaLocale, status);`
  • This is actually quite difficult, especially if you start looking at the obsucre kanji for very large numbers.

    In perl, there is a very complete implementaion in Lingua::JA::Numbers. It's source might be inspirational if you want to port it to C++.

0 comments:

Post a Comment