On Tue, 19 Feb 2008 16:42:03 +0100, Xavier Roche
Post by Xavier RocheEuh, vous pouvez éventuellement séparer les katakana, mais pour les
Kanji (idéogrammes), il est totalement impossible de séparer les mots
composés (de kanjis et/ou de kanas) des kanjis "isolés" puisque ce sont
les même caractères (au sens Unicode du terme) sans appliquer des
traitements (très) complexes.
En fait je bosse sur un logiciel. La "complexité" ici dépendra
seulement du temps d'exécution des instructions. Il suffit de
programmer l'algorithme pour le voir.
Post by Xavier Roche(Pour prendre un exemple, le "?" de "?"(?) (blanc) et le "?" de "?
?" (riz blanc) ou de "??" ("amusant") sont identiques)
Ca ne m'éclaire que partiellement.
Si je montre ce tableau de la langue des 65536 premiers unicodes, on
peut les localiser ?
TLanguage =
(Basic_Latin,Latin_1_Supplement,Latin_Extended_A,Latin_Extended_B,IPA_Extensions,Spacing_Modifier_Letters,
Combining_Diacritical_Marks,Greek_and_Coptic,Cyrillic,Cyrillic_Supplement,Armenian,Hebrew,Arabic,Syriac,Arabic_Supplement,
Thaana,Devanagari,Bengali,Gurmukhi,Gujarati,Oriya,Tamil,Telugu,Kannada,Malayalam,Sinhala,Thai,Lao,Tibetan,Myanmar,Georgian,
Hangul_Jamo,Ethiopic,Ethiopic_Supplement,Cherokee,Unified_Canadian_Aboriginal_Syllabics,Ogham,Runic,Tagalog,Hanunoo,Buhid,
Tagbanwa,Khmer,Mongolian,Limbu,Tai_Le,New_Tai_Lue,Khmer_Symbols,Buginese,Phonetic_Extensions,Phonetic_Extensions_Supplement,
Combining_Diacritical_Marks_Supplement,Latin_Extended_Additional,Greek_Extended,General_Punctuation,Superscripts_and_Subscripts,
Currency_Symbols,Combining_Diacritical_Marks_for_Symbols,Letterlike_Symbols,Number_Forms,Arrows,Mathematical_Operators,
Miscellaneous_Technical,Control_Pictures,Optical_Character_Recognition,Enclosed_Alphanumerics,Box_Drawing,Block_Elements,
Geometric_Shapes,Miscellaneous_Symbols,Dingbats,Miscellaneous_Mathematical_Symbols_A,Supplemental_Arrows_A,Braille_Patterns,
Supplemental_Arrows_B,Miscellaneous_Mathematical_Symbols_B,Supplemental_Mathematical_Operators,Miscellaneous_Symbols_and_Arrows,
Glagolitic,Coptic,Georgian_Supplement,Tifinagh,Ethiopic_Extended,Supplemental_Punctuation,CJK_Radicals_Supplement,Kangxi_Radicals,
Ideographic_Description_Characters,CJK_Symbols_and_Punctuation,Hiragana,Katakana,Bopomofo,Hangul_Compatibility_Jamo,Kanbun,
Bopomofo_Extended,CJK_Strokes,Katakana_Phonetic_Extensions,Enclosed_CJK_Letters_and_Months,CJK_Compatibility,CJK_Unified_Ideographs_Extension_A,
Yijing_Hexagram_Symbols,CJK_Unified_Ideographs,Yi_Syllables,Yi_Radicals,Modifier_Tone_Letters,Syloti_Nagri,Hangul_Syllables,
High_Surrogates,High_Private_Use_Surrogates,Low_Surrogates,Private_Use_Area,CJK_Compatibility_Ideographs,Alphabetic_Presentation_Forms,
Arabic_Presentation_Forms_A,Variation_Selectors,Vertical_Forms,Combining_Half_Marks,CJK_Compatibility_Forms,Small_Form_Variants,
Arabic_Presentation_Forms_B,Halfwidth_and_Fullwidth_Forms,Specials);
Sachant que chaque type de caractère est défini ainsi:
Lu Letter, Uppercase
Ll Letter, Lowercase
Lt Letter, Titlecase
Lm Letter, Modifier
Lo Letter, Other
Mn Mark, Nonspacing
Mc Mark, Spacing Combining
Me Mark, Enclosing
Nd Number, Decimal Digit
Nl Number, Letter
No Number, Other
Pc Punctuation, Connector
Pd Punctuation, Dash
Ps Punctuation, Open
Pe Punctuation, Close
Pi Punctuation, Initial quote
Pf Punctuation, Final quote
Po Punctuation, Other
Sm Symbol, Math
Sc Symbol, Currency
Sk Symbol, Modifier
So Symbol, Other
Zs Separator, Space
Zl Separator, Line
Zp Separator, Paragraph
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned
Merci encore pour toute aide.
--
Jean-Phil