| | 317 | |
| | 318 | === Turkish letters İ and ı === |
| | 319 | |
| | 320 | In Turkish, the letters {{{I}}} and {{{i}}} are not a upper/lowercase pair. Instead, there are two pairs {{{(İ, i)}}} and {{{(I, ı)}}}, i.e. one with and one without the dot above. |
| | 321 | |
| | 322 | According to the Unicode spec, the lowercase pendant for {{{İ}}} is a sequence of two unicode characters, namely the {{{i}}} (with the dot) and the code point U0307 which mean "with dot above". The latter is there to preserve the information about the dot for the conversion back to uppercase. |
| | 323 | |
| | 324 | Python-2 did not implement the U0307 character, so it converted the letters like this: |
| | 325 | {{{#!python |
| | 326 | >>> u"İ".lower().upper() |
| | 327 | u'I' |
| | 328 | >>> u"ı".upper().lower() |
| | 329 | u'i' |
| | 330 | |
| | 331 | # NB with utf-8-encoded str, Python-2 doesn't "İ".lower() at all! |
| | 332 | >>> print "İ".lower() |
| | 333 | İ |
| | 334 | }}} |
| | 335 | |
| | 336 | Python-3 does implement the U0307 character, so the behavior is different: |
| | 337 | {{{#!python |
| | 338 | >>> "İ".lower().upper() |
| | 339 | 'İ' |
| | 340 | >>> "ı".upper().lower() |
| | 341 | 'i' |
| | 342 | }}} |
| | 343 | |
| | 344 | Critically, the U0307 character changes the string length (it's an extra character!): |
| | 345 | {{{#!python |
| | 346 | # Python-2 |
| | 347 | >>> len(u"İ".lower()) |
| | 348 | 1 |
| | 349 | |
| | 350 | # Python-3 |
| | 351 | >>> len("İ".lower()) |
| | 352 | 2 |
| | 353 | }}} |
| | 354 | |
| | 355 | This is just something to keep in mind - an actual forward/backward compatibility pattern must be developed for the specific use-case. Neither the Python-2 nor the Python-3 are particularly helpful for generalization, the Turkish I's always need special treatment. |