![]() This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed. Return the base character of char, by "removing" anyĭiacritics like accents or curls and strokes and the like. This handles not only accents, but also "strokes" (as in ø etc.): import unicodedata as ud ![]() If you have a byte string, then you must decode it into a unicode string like this: encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you useīyte_string = b"café" # or simply "café" before python 3. bining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.Įdit 2: remove_accents expects a unicode string, not a byte string. The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.Įdit: this does the trick: import unicodedata It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). Only_ascii = nfkd_form.encode('ASCII', 'ignore') ![]() ![]() Nfkd_form = unicodedata.normalize('NFKD', input_str) I just found this answer on the Web: import unicodedata
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |