JCharset - Java Charset package
What is the Java Charset package?The Java Charset package is an open-source implementation of character sets that were missing from the standard Java platform.
How do I use the Java Charset package?The Java Charset package is written in pure Java, runs on JDK 1.5 or later, and requires no special installation - just add jcharset.jar to your classpath, or place it in any of the usual extension directories.
The JVM will recognize the supported character sets automatically, and they will be available anywhere character sets are used in the Java platform.
As an example, you can take a look at java.lang.String's constructor and getBytes() method, both of which have an overloaded version that receives a charset name as an argument.
Note: Some web/mail containers run each application in its own JVM context. In this case check the container documentation for information on where and how to configure the classpath, such as in WEB-INF/lib, shared/lib, jre/lib/ext, etc. You may need to restart the server for changes to take effect. However, if you use Sun's JRE, it will work only if you put it in the jre/lib/ext extension directory, or in the container's classpath. This is due to a bug in Sun's JRE implementation (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4619777). Voting for the bug may expedite its fix, so please do...
Which charsets are supported?
"UTF-7" (a.k.a. "UTF7", "UNICODE-1-1-UTF-7", "csUnicode11UTF7", "UNICODE-2-0-UTF-7")
The 7-bit Unicode character encoding defined in RFC 2152. The O-set characters are encoded as a shift sequence. Both O-set flavors (direct and shifted) are decoded.
"UTF-7-OPTIONAL" (a.k.a. "UTF-7O", "UTF7O", "UTF-7-O")
The 7-bit Unicode character encoding defined in RFC 2152. The O-set characters are directly encoded. Both O-set flavors (direct and shifted) are decoded.
"SCGSM" (a.k.a. "GSM-default-alphabet", "GSM_0338", "GSM_DEFAULT", "GSM7", "GSM-7BIT")
The GSM default charset as specified in GSM 03.38, used in SMPP for encoding SMS text messages.
Additional flavors of the GSM charset are "CCGSM", "SCPGSM" and "CCPGSM":
The CC prefix signifies mapping the Latin capital letter C with cedilla character, the SC prefix signifies mapping the Latin small letter c with cedilla character, and the P prefix signifies the packed form (8 characters packed in 7 bytes), as specified by the spec. See javadocs for details.
"hp-roman8" (a.k.a. "roman8", "r8", "csHPRoman8", "X-roman8")
The HP Roman-8 charset, as provided in RFC 1345.
"ISO-8859-8-BIDI" (a.k.a. "csISO88598I", "ISO-8859-8-I", "ISO_8859-8-I",
"csISO88598E", "ISO-8859-8-E", "ISO_8859-8-E")
The ISO 8859-8 charset implementation exists in the standard JRE. However, it is lacking the i/e aliases, which specify whether bidirectionality is implicit or explicit. The charsets conversions themselves are similar. This charset complements the standard one.
"ISO-8859-6-BIDI" (a.k.a. "csISO88596I", "ISO-8859-6-I", "ISO_8859-6-I",
"csISO88596E", "ISO-8859-6-E", "ISO_8859-6-E")
The ISO 8859-6 charset implementation exists in the standard JRE. However, it is lacking the i/e aliases, which specify whether bidirectionality is implicit or explicit. The charsets conversions themselves are similar. This charset complements the standard one.
"KOI8-U" (a.k.a. "KOI8-RU")
The KOI8-U Ukrainian charset, as defined in RFC 2319.
The MIK cyrillic code page, commonly used by DOS applications in Bulgaria.
What's New?In version 1.5:
- Fixed GSMCharset encoding of non-breakable space character (0x00A0), which shouldn't be encoded.
- Fixed PackedGSMCharset decoder edge case of handling overflow continuation for large strings (>256) when calling decoder directly (not via String methods).
- Fixed PackedGSMCharset decoder edge case of string size which is a multiple of internal buffer size (256) greater than 256 and has escaped characters on decoded buffer boundaries.
- Simplified CharsetProvider.charsetForName flow.
In version 1.4:
- Dropped support for JDK 1.4 and earlier.
- Added MIK charset.
- Added KOI8_U as a KOI8-U alias.
- Optimized EscapedByteLookupCharset encoding buffer allocation for strings with no escape chars.
- Added ByteLookupCharset.updateInverseLookupTable convenience method.
- Improved docs.
In version 1.3:
- Added X-roman8 as an hp-roman8 alias.
- Added the generic EscapedByteLookupCharset to simplify implementation of single-escape-byte charsets.
- Created two flavors of the GSM charset: CCGSMCharset (mapping the Latin capital letter C with cedilla) and SCGSMCharset (mapping the Latin small letter c with cedilla). See javadocs for details.
- Added support for Packed GSM charset, with the two flavors as well.
- Renamed the canonical charset name for the new GSM family, to make the flavor choices explicit.
In version 1.2.1:
- Fixed a combined JavaMail-JCharset bug that could cause an infinite loop on some inputs.
- Updated the ISO-8859-8-i/e mapping for the MACRON character. The incorrect mapping in the JDK's implementation of ISO-8859-8 is fixed as of JDK 1.5 (see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4760496). We now determine the running JDK version, and if it's JDK 1.5 or higher we use the correct mapping. This way we remain consistent with the running JDK ISO-8859-8 charset implementation.
In version 1.2:
- Added KOI8-U charset.
In version 1.1:
- Added ByteLookupCharset class to simplify implementation of single byte charsets.
- Added GSM-default-alphabet charset (used in SMPP).
- Added hp-roman8 charset.
- Added ISO-8859-8-i/e charset.
- Added ISO-8859-6-i/e charset.
In version 1.0:
- This is the first release of the Java Charset package.
LicenseThe JCharset Package is provided under the GNU General Public License agreement.
For non-GPL commercial licensing please contact the author.
DonatePlease help support this project by making a donation. These donations are not meant to make the author rich, but to try and offset the costs of creating and maintaining the project. Any amount will help!
Contactyou can contact the author via e-mail at:
Please write in to report bugs, problems, suggestions, ideas, questions, answers, source code queries and especially just to let me know you've found the JCharset Package useful. Getting feedback will encourage me to continue development and add some advanced features I have in mind...
For updates and additional information, you can always visit the website at: