JCharset - Java Charset package
What is the Java Charset package?The Java Charset package is an open-source implementation of character sets that were missing from the standard Java platform.
It has been in use in many production systems around the world for over a decade, including products by small start-ups, large open-source service providers, and well-known multinational corporations.
How do I use the Java Charset package?The Java Charset package is written in pure Java, runs on JDK 1.5 or later, and requires no special installation - just add the jar file to your classpath, or place it in any of the usual extension directories.
It is also available on Maven Central at the artifact coordinates
The JVM will recognize the supported character sets automatically, and they will be available anywhere character sets are used in the Java platform.
As an example, you can take a look at java.lang.String's constructor and getBytes() method, both of which have an overloaded version that receives a charset name as an argument.
A command-line utility is included which supports converting files
between charsets. For help on usage and available options, run it using
the command '
java -jar jcharset-2.1.jar -h'.
Note: Some web/mail containers run each application in its own JVM context. In this case check the container documentation for information on where and how to configure the classpath, such as in WEB-INF/lib, shared/lib, jre/lib/ext, etc. You may need to restart the server for changes to take effect. However, if you use Oracle's JRE, it will work only if you put it in the jre/lib/ext extension directory, or in the container's classpath. This is due to a bug in Oracle's JRE implementation (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4619777).
Which charsets are supported?
"UTF-7" (a.k.a. "UTF7", "UNICODE-1-1-UTF-7", "csUnicode11UTF7", "UNICODE-2-0-UTF-7")
The 7-bit Unicode character encoding defined in RFC 2152. The O-set characters are encoded as a shift sequence. Both O-set flavors (direct and shifted) are decoded.
"UTF-7-OPTIONAL" (a.k.a. "UTF-7O", "UTF7O", "UTF-7-O")
The 7-bit Unicode character encoding defined in RFC 2152. The O-set characters are directly encoded. Both O-set flavors (direct and shifted) are decoded.
"SCGSM" (a.k.a. "GSM-default-alphabet", "GSM_0338", "GSM_DEFAULT", "GSM7", "GSM-7BIT")
The GSM default charset as specified in GSM 03.38, used in SMPP for encoding SMS text messages.
Additional flavors of the GSM charset are "CCGSM", "SCPGSM", "CCPGSM", "CRSCPGSM" and "CRCCPGSM":
The CC prefix signifies mapping the Latin capital letter C with cedilla character, the SC prefix signifies mapping the Latin small letter c with cedilla character, the P prefix signifies the packed form (8 characters packed in 7 bytes), and the CR prefix signifies padding with CR instead of zeros to avoid ambiguity, all as specified by the spec. See javadocs for details.
"hp-roman8" (a.k.a. "roman8", "r8", "csHPRoman8", "X-roman8")
The HP Roman-8 charset, as provided in RFC 1345.
ISO/IEC 646 National Variants
"ISO646-DE" ("ISO-IR-21", "DIN_66003")
"ISO646-FI" ("ISO646-SE", "ISO-IR-10")
"ISO646-IRV" ("ISO-IR-2", "ISO_646.IRV:1983")
"ISO646-JAO" ("ISO646-JP-OCR-B", "ISO-IR-92")
"ISO646-US" ("ISO-IR-6", "ISO_646.irv:1991")
"ISO-8859-8-BIDI" (a.k.a. "csISO88598I", "ISO-8859-8-I", "ISO_8859-8-I",
"csISO88598E", "ISO-8859-8-E", "ISO_8859-8-E")
The ISO 8859-8 charset implementation exists in the standard JRE. However, it is lacking the i/e aliases, which specify whether bidirectionality is implicit or explicit. The charsets conversions themselves are similar. This charset complements the standard one.
"ISO-8859-6-BIDI" (a.k.a. "csISO88596I", "ISO-8859-6-I", "ISO_8859-6-I",
"csISO88596E", "ISO-8859-6-E", "ISO_8859-6-E")
The ISO 8859-6 charset implementation exists in the standard JRE. However, it is lacking the i/e aliases, which specify whether bidirectionality is implicit or explicit. The charsets conversions themselves are similar. This charset complements the standard one.
"KOI8-U" (a.k.a. "KOI8-RU", "KOI8_U")
The KOI8-U Ukrainian charset, as defined in RFC 2319.
"KZ-1048" (a.k.a. "STRK1048-2002", "RK1048", "csKZ1048")
The KZ-1048 charset, which is the Kazakhstan national standard.
The MIK cyrillic code page, commonly used by DOS applications in Bulgaria.
What's New?In version 2.1:
- Added CR padding support to PackedGSMCharset.
- Added CRCCPackedGSMCharset and CRSCPackedGSMCharset packed GSM variants with CR padding enabled.
- Added KZ-1048 charset, with aliases STRK1048-2002, RK1048, csKZ1048.
- Improved javadocs.
In version 2.0:
- Added 32 national variants of the ISO/IEC 646 charset.
- Moved GSM classes to separate sub-package.
- Changed UTF-7 decoding to be lenient in accepting trailing zero bits in shift sequences.
- Changed UTF7Charset.contains to reflect full Unicode equivalency.
- Added a command-line utility supporting file charset conversion.
- Added ByteLookupCharset.createTable utility method.
- Generalized createInverseLookupTableDefinition to Utils.toInverseLookupTableDefinition.
- Applied many refactorings, simplifications, clarifications and clean-ups.
- Applied various optimizations to encode/decode loops.
- Improved docs.
In version 1.6:
- Migrated to Maven build system, directory structure and artifact conventions.
- Added OSGi headers to jar manifest.
- Fixed javadoc errors when building with JDK 8.
- Improved javadocs and misc. minor refactorings.
In version 1.5:
- Fixed GSMCharset encoding of non-breakable space character (0x00A0), which shouldn't be encoded.
- Fixed PackedGSMCharset decoder edge case of handling overflow continuation for large strings (>256) when calling decoder directly (not via String methods).
- Fixed PackedGSMCharset decoder edge case of string size which is a multiple of internal buffer size (256) greater than 256 and has escaped characters on decoded buffer boundaries.
- Simplified CharsetProvider.charsetForName flow.
In version 1.4:
- Dropped support for JDK 1.4 and earlier.
- Added MIK charset.
- Added KOI8_U as a KOI8-U alias.
- Optimized EscapedByteLookupCharset encoding buffer allocation for strings with no escape chars.
- Added ByteLookupCharset.updateInverseLookupTable convenience method.
- Improved docs.
In version 1.3:
- Added X-roman8 as an hp-roman8 alias.
- Added the generic EscapedByteLookupCharset to simplify implementation of single-escape-byte charsets.
- Created two flavors of the GSM charset: CCGSMCharset (mapping the Latin capital letter C with cedilla) and SCGSMCharset (mapping the Latin small letter c with cedilla). See javadocs for details.
- Added support for Packed GSM charset, with the two flavors as well.
- Renamed the canonical charset name for the new GSM family, to make the flavor choices explicit.
In version 1.2.1:
- Fixed a combined JavaMail-JCharset bug that could cause an infinite loop on some inputs.
- Updated the ISO-8859-8-i/e mapping for the MACRON character. The incorrect mapping in the JDK's implementation of ISO-8859-8 is fixed as of JDK 1.5 (see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4760496). We now determine the running JDK version, and if it's JDK 1.5 or higher we use the correct mapping. This way we remain consistent with the running JDK ISO-8859-8 charset implementation.
In version 1.2:
- Added KOI8-U charset.
In version 1.1:
- Added ByteLookupCharset class to simplify implementation of single byte charsets.
- Added GSM-default-alphabet charset (used in SMPP).
- Added hp-roman8 charset.
- Added ISO-8859-8-i/e charset.
- Added ISO-8859-6-i/e charset.
In version 1.0:
- This is the first release of the Java Charset package.
LicenseThe JCharset Package is provided under the GNU General Public License agreement.
For non-GPL commercial licensing please contact the author.
DonateIf you like it, why not give something back?
ContactYou can contact the author via e-mail at:
Please write in with any bugs, suggestions, fixes, contributions, or just to drop a good word and let me know you've found the JCharset Package useful and you'd like it to keep being maintained.
For updates and additional information, you can always visit the website at: