So You Want to Be a Hacker? Part IV: Compression Formats

Thu, Jul 27, 2006
26-minute read

Part IV: Compression Formats

Game data archives almost always employ some type of compression for their contents, and sometimes will even mix and match different algorithms for different types of files. Understanding the typical compression formats is therefore crucial to the success of a game hacker.

Moreover, you need to be able to recognize the common algorithms just from their compressed data alone, so when you’re staring at hex dumps, you will know how to proceed. In today’s installment, we’ll go through some of the most popular formats, how they work, and how you can recognize them “in the wild”.

First of all, note that this task is in theory quite hard. The ideal compression algorithm produces data that is essentially indistinguishable from random noise (which counterintuitively contains the highest density of information). Fortunately, games have two additional requirements: to be able to access data quickly, and to be programmed by normal coders as opposed to Ph.D. computer scientists.

Both of these requirements means games typically use either industry-standard formats or relatively simple, quick algorithms that anyone can code from a book. The former often include extra identifying information we can spot, and the latter compress data poorly enough that it looks like “compressed data” instead of looking like random noise.

zlib

Let’s start with the industry standard, zlib. This is an open-source compression library which implements the classic ‘deflate’ algorithm used in .zip and .gz files. It’s pretty fast and pretty decent, and since it’s already written and completely free, it gets used all over the place, including in game archives.

How can you recognize it? 0×78 0×9C. The first two bytes of a standard zlib-compressed chunk of data will be those two bytes, which specify the default settings of the algorithm (’deflate’ with 32KB buffer, default compression level). Alternately, you will often see 0×78 0xDA, which is the same except using the maximum compression level. If you see those two bytes at the start of where you expect a file to be, rejoice, since you’ve just solved a big mystery with next to zero effort.

Decoding this format is also pretty easy, since virtually every modern language will have a zlib library for it. In C, you want to just link against libz and call:

#include <zlib.h>
uncompress(new_buffer, &new_size, comp_buffer, comp_size);

Be sure to allocate enough memory for your expanded data: hopefully the archive index will have already provided you with the original file size. The function will return a status code and additionally update new_size with the amount of data that was uncompressed.

In Python, dealing with zlib is just embarrassingly easy:

new_data = comp_data.decode('zlib')

One of the built-in string-encoding methods (like ASCII, UTF8, Shift-JIS, etc.) is just zlib encoding, so if you have your data as a string you can just expand it like that. Alternately you can import zlib and use more direct function calls for extra control.

Compressing data is just as easy, with the compress() function in C — or compress2() if you want to specify the compression level — and the encode('zlib') string method in Python (or zlib library calls).

I don’t want to say much about the inner workings of the deflate algorithm, since that really doesn’t come up very often: you can safely treat it like a black box. However, there is one extra facet I’ve run across: the Adler32 checksum. This is a very simple 32-bit checksum algorithm (like CRC32) which is included in the zlib library, and therefore gets also used by games a bit. Additionally, the zlib format specifies that an Adler checksum is appended to the end of a compressed file for error-checking purposes.

However, some games twist their zlib implementation slightly by either leaving off the checksum (in favor of using their own elsewhere in the archive) or moving it into the archive index instead. This will cause the zlib uncompress call to return an error, even though it actually uncompressed the data successfully.

So, a word to the wise: if you’re sure that the game is using zlib but you keep getting errors when you try to expand the data, look for this case. You may have to do a little twiddling of the compressed data to add the expected checksum at the end, or just ignore the zlib error codes and continue as normal.

Run-length compression

This is the simplest kind of home-grown compression you’re likely to run across. It shows up in image compression a lot, sometimes as part of a larger sequence of processing. Basically the idea is to start with a sequence of bytes and chunk them up whenever you run across a repeated value:

  31 92 24 24 24 24 24 C5 00 00
= 31 92 5*24 C5 2*00

Exactly how you represent the chunked-up data varies a bit from algorithm to algorithm, depending on what you expect the sequences to look like.

Escape byte. You might designate a byte, say 0xFF as a flag for designating a run of repeated bytes, and follow it by a count and a value. So the above data would be:

  31 92 FF 05 24 C5 FF 02 00
= 31 92 5*24 C5 2*00

If the flag byte actually appears in your data, you have to unescape it by, say, having a length of 0 in the next byte.

Repeated bytes. Here you just start running through your data normally, and whenever you have two bytes in a row that are the same, you replace all the rest of them (the third and thereafter) with a count byte:

  31 92 24 24 03 C5 00 00 00
= 31 92 24 24 3*24 C5 00 00 0*00

If you don’t have a third repeated value, you’ll need to waste a byte to give a count of 0.

Alternating types. Here you assume that your data alternates between runs of raw values and runs of repeated bytes, and prepend length counts to each type:

  02 31 92 05 24 01 C5 02 00
= 2 (31 92), 5*24, 1 (C5), 2*00

Naturally, if you have two repeated runs in a row, you’ll have to waste a byte to insert a 0-length raw sequence between them. A special case of this I’ve run across is when you expect to have long runs of zero in particular instead of any random byte, so you just alternate between runs of zeroes (with just a bare count value) and runs of raw data.

Note, of course, that there is some subtlety which can be involved depending on the variant you run across. For instance, it’s often the case that pairs of bytes aren’t efficient to encode, so they’re just treated as raw data. Also, rather than giving lengths themselves, sometimes you encode, say, length-3, if length values of 0, 1, and 2 aren’t ever needed. In some cases you might also run across multi-byte length values (controlled, say, by the high bit in the first length byte).

For images, you may have pixels instead of bytes which are the fundamental unit of repetition. In that case, even two RGB pixels in a row which are the same can be successfully compressed.

In any event, how do you recognize this format? The general principle is that all of these variations have to fall back on including raw bytes in the file a lot, so you want to try to look for those identifiable sequences (RGB triplets in image formats are good to key off of) interspersed with control codes. It’s often helpful to have an uncompressed version of an image to compare against, which you can recover from a screenshot or from snooping the game’s memory in a debugger (a topic for later articles).

LZSS

One step up from run-length encoding is to be able to do something useful with whole sequences of data that are repeated instead of single bytes. Here, the algorithm keeps track of the data it’s already seen, and if some chunk is repeated, it just encodes a back-reference to that section of the file instead:

  I love compression.  This is compressed!
= I love compression.  This [is ][compress]ed!
= I love compression.  This [3,-3][8,-22]ed!

The bracketed sections indicate runs of characters that have been seen before, so you just give a length and a backwards offset for where to copy them from. A lot of compression algorithms, zlib included, are based on this general principle, but one version that seems to crop up a lot is LZSS.

The special feature of this format is how it controls switching between raw bytes and back-references. It uses one bit in a control byte to determine this, often a 1 for a raw byte and a 0 for a back-reference sequence. So one control byte will determine the interpretation of the next 8 pieces:

  I love compression.  This [3,-3][8,-22]ed!
= FF "I love c" FF "ompressi" FF "on.  Thi" 73 "s " 03 03 08 16 "ed!"

The 0xFF control bytes just say “8 raw bytes follow”, and the 0×73 byte is binary 01110011: reading from least-significant bit, that’s 2 raw bytes, 2 back-references, and then 3 raw bytes.

Recognizing this format in the wild rests on the control bytes, and you can spot it most easily in script files. If you see text which looks liFFke this,FF with reFFadable tFFext plus some junk characters in every 9th byte, you’re dealing with LZSS. You can also spot this in image formats, since the natural rhythm of RGB triplets will get interrupted by the control bytes.

Note that the farther in the file you go, the harder this gets to recognize, since the proportion of back-references tends to climb once the algorithm has a larger dictionary of previously-seen data to draw upon.

The major hassle with this format is the nature of the back-references. There are a lot of subtle variants of this. One of the most popular ones uses a 4096-byte sliding window, and encodes back-references as a 12-bit offset in the window and a 4-bit length. However, is the length the real length or length-3? Is the offset relative to the current position, or is it an array offset in a separate 4096-byte ring buffer? Is the control byte read most- or least-significant bit first? I’ve even run across an example where there were several different back-reference formats: a 1-byte one for short runs in a small window, a 2-byte one for medium-length runs in a decent window, and a 3-byte one for large runs over a huge window. You will just need to experiment a little bit to see exactly what the particular game is doing, unfortunately.

One subtle point is that you may be allowed to specify a back-reference which overlaps with data you haven’t seen yet. By that I mean a length larger than the negative offset involved:

  This is freeeeeeeaky!
= This [is ]fre[eeeeee]aky!
= This [3,-3]fre[6,-1]aky!

The [6,-1] back-reference works because you are copying the bytes one at a time: first you copy the second ‘e’ from the first, and now you can copy the third ‘e’ from the second, etc. Be aware of this subtlety when you implement your own algorithms, since (a) this can preclude you from doing certain types of memory copying or string slicing, (b) not all games will be able to understand this type of reference, so don’t encode that way unless you know yours can.

Huffman encoding

From one point of view, this is easier than other algorithms since it only works on single bytes (or symbols, in general) at a time, but it’s also more tricky since the compressed data is a bitstream rather than being easy-to-digest bytes and control codes.

It works by figuring out the frequencies of all the bytes in a file, and encoding the more common ones with fewer than 8 bits, and the less common ones with more than 8, so you end up with a smaller file on average. This is very closely related to concepts of entropy, since each symbol generally gets encoded with a number of bits equal to its own entropy (as determined by its frequency).

Let’s be specific. Consider the string “abracadabra”. The letter breakdown is:

a : 5/11  ~ 1.14 bits
b : 2/11  ~ 2.46 bits
c : 1/11  ~ 3.46 bits
d : 1/11  ~ 3.46 bits
r : 2/11  ~ 2.46 bits

Where I’ve given the number of bits of entropy each frequency corresponds to (i.e. if you have a 25% chance of having a certain letter, it has a 2-bit entropy since you need to give one of 4 values, say 00, to specify it out of the other 75% of possibilities: 01, 10, 11). Unfortunately we can’t use fractional bits, so we may have to round up or down from these theoretical values.

How do we choose the right codes? Well, the best way is to build up a tree, starting from the least-likely values. That is, we treat, say, “c or d” as a single symbol with a frequency of 2/11, and say that if we get that far we know we can just spend one extra bit to figure out whether we mean c or d:

a : 5
0=c + 1=d : 2
b : 2
r : 2

Then we continue doing the same thing. At each step we combine the two least-weight items together, adding one bit to the front of their codes as we go. In the case of ties, we pick the ones with shorter already-assigned codes, or alphabetically first values:

a : 5
0=b + 1=r : 4
0=c + 1=d : 2

00=b + 01=r + 10=c + 11=d : 6
a : 5

000=b + 001=r + 010=c + 011=d + 1=a : 11

So the codes we end up with are:

a : 1
b : 000
c : 010
d : 011
r : 001

You will notice an excellent property of these codes: they are not ambiguous. That is, you don’t have 1 for ‘a’ and 100 for ‘b’… as soon as you hit that first 1, you know you can stop and go on to the next symbol without needing to read any more. Therefore, “abracadabra” just gets encoded as:

a  b   r  a  c  a  d  a  b   r  a
1 000 001 1 010 1 011 1 000 001 1
= 10000011 01010111 00000110
=    83       57       06

We’ve compressed 88 bits (11 bytes) down to 23 bits (just under 3 bytes). Almost always the bits are packed most- to least-significant in a byte.

One subtlety is the exact method of tree creation, which assigns the codes. The method described above is “canonical”, but sometimes games will use their own idiosynchratic methods which you will have to match exactly to avoid getting garbage.

How do you recognize this in a data file? Well, the decompressor needs to know the codes, and the easiest way to specify this is to give it the frequencies (or more easily, the bit weights) of the values so it can construct its own tree.

Therefore, the compressed data will usually start with, say, a 256-element table of bit weights. So if you see 256 bytes of 05 06 08 07 0C 0B 06 — values that are around 8 plus or minus a few — followed by horrendous random junk, you’re probably looking at Huffman encoding.

Sometimes instead of bit weights you’ll have the actual frequency counts instead, which might need to have a multi-byte encoding scheme if they’re above 256. In that case, you’re mainly looking for a few hundred bytes of “stuff” followed by a sharp transition to much more random data.

Other algorithms

Needless to say, the algorithms covered here are not the full range of compression formats out there. I’ll just briefly mention some others in case you run across them, though I haven’t really seen them in the wild.

Arithmetic encoding

This is vaguely related to Huffman encoding, in that you are working strictly with single bytes (or symbols) and trying to stuff the most frequent ones into fewer bits. However, instead of being restricted to an integral number of bits for each one, here you are allowed to be fractional on average.

This works by breaking up the numerical interval [0,1) into subranges corresponding to each symbol: the more common symbols correspond to larger ranges, in proportion to their frequency. You start with [0,1), and the first byte resticts you to the subrange for that symbol. Then the second byte restricts you to a sub-subrange, the third byte a sub-sub-range, etc. Your final encoded data is any single numerical value inside the tiny range you end up in: just pick the number in that range you can represent in the least number of bits as a binary fractional value.

Needless to say there are some good tricks for implementing this without using ludicrously-high-precision math, but I won’t go into that.

LZ77 (Lempel-Ziv ’77)

This is the core of the zlib deflate algorithm, but you’ll sometimes see variants outside of that standard, so it’s useful to know a little about. It’s basically a combination of standard back-references as in LZSS, plus Huffman encoding. You just treat the back-reference command “copy 8 bytes” as a special symbol, like a byte value of 256+8=264.

Then, with this mix of raw data bytes and back-reference symbols, you run it through a Huffman encoding to get the final compressed output. Typically you will do something different with the back-reference offsets: either leave them as raw data, or encode them in their own separate Huffman table.

LZW (Lempel-Ziv-Welch)

When taught correctly, this is an algorithm with a mind-blowing twist at the end. As it runs through the file, it builds up an incremental dictionary of previously-seen strings and outputs codes corresponding to the dictionary entries. And then, at the end, when you start to wonder how to encode this big dictionary so the decompressor can use it to make sense of the codes, you just throw the dictionary away. Cute.

Of course it turns out that things are cleverly designed so that the decompressor can build up an identical dictionary as it goes along, so there’s no problem. This algorithm was patent-encumbered for a while, so it didn’t get as widely adopted as it might otherwise have been, but you might start seeing more of it these days.

Conclusion

I’ve focused here on lossless general-purpose compression: the sorts of things that are done to data at the archive level. There is also a stage below this, where data can be compressed before even being put into the archive: making raw images into JPEGs, PNGs, or other compressed image formats, and converting sounds to MP3s, OGGs, and so forth. In many cases those compression steps are just a lossy approximation to the original data, which is okay for graphics and sounds but bad for other files.

In a later installment, I’ll be tackling image formats in particular in more detail, since you will tend to run across custom ones a lot, some of which include image-specific processing steps (like, say, subtracting pixels from their neighbors) which wouldn’t make a lot of sense in a more general-purpose compression algorithm. Encryption is another later topic, since sometimes that will keep you from being able to recognize compressed data for what it is.

And naturally, if you’ve run across other general compression algorithms used in games you’ve looked at, please mention them in the comments, since I don’t pretend to have investigated all the games out there… I’m still being surprised all the time.

Comments

Haeleth @ August 7th, 2006 | 6:02 pm

Pretty comprehensive, though you forgot to mention at the start that this is where most people’s brains explode. ^^

The only other common form of compression I’ve encountered in my own hacking is dictionary compression (trivial substitution of certain codes for a fixed set of particularly common strings, stored separately). And I’ve never seen that in a computer game, only in antiquated console titles, so it’s not really worth covering in any depth.

September @ August 31st, 2006 | 11:59 pm

Thanks for the guide, it’s been really helpful.

However, I’m at the aforementioned “brain explosion” stage trying to deal with (I think) an LZSS compressed file.

Some things came up that weren’t mentioned in your guide… First of all, the control bytes don’t always occur every 9th place, it seems to be taking 2 bytes for every back reference. I noticed the control bit is followed by 9 bytes instead of 8 if 7 of them are designated raw by the control bit.

Also, there’s a 0×157F long section at the beginning of the file with lots of rather long and nearly identical repeating patterns. BB AE B6 AB BO B7 AC honestly appears in this section like a hundred times.

Any further clarification on this compression format would help alot, or perhaps some direction to where I could get more info.

Edward Keyes @ September 2nd, 2006 | 11:11 pm

Yeah, the “every 9th byte” is only in the case where all 8 data bytes are raw instead of back-references, which is what you usually see at the very beginning of a file with a lot of unpredictable structure, like a script file. In the general case you can have a control byte after anywhere from 8 to 16 data bytes (or more, if the back references can span 3 bytes, though 2 is much more typical). The value of the control byte will tell you how many data bytes to expect.

I’m not sure what to tell you about the repeating sequences. If it is LZSS, chances are the repeating sequences are a sign that the underlying file is also very repetitive, so you’re getting the same back references over and over again. But my best advice there, if you have the option, is to look at some other files in the archive as well, rather than being solely focused on one: a lot of the time you’ll find examples of files that compressed more or less well, giving you some extra examples to generalize from. If you’re handy enough with a debugger to try to get access to the uncompressed data in the game’s memory, that’s also an awesome way to understand what’s going on.

As for LZSS references, there’s the Wikipedia entry of course, and you can search for tons of more info and code samples floating around the web, but the main trouble is that there isn’t a single “LZSS format” like there is with zlib… I’ve almost never seen exactly the same variant twice, as everybody has a different way to do the back references (the most common is 12-bit offset and 4-bit length, though that gets arranged in a few ways). So it’s going to be difficult to find a reference that matches your particular variant exactly.

September @ September 6th, 2006 | 11:37 pm

I think I’m beginning to see how this file is working, or at least how an LZSS file is supposed to work. One weird thing though is every other reference has really high values. Like a reference of 0xF5F0 at offset 0×47 in the file. Have you ever seen anything like this before? Maybe I’m supposed to flip it or something :-/

sorry, I know this isn’t a help thread. thanks for being helpful to a newbie.

Edward Keyes @ September 7th, 2006 | 1:50 am

Well, one thing you may need to work out is how the reference is encoding its values. Is 0xF5F0 a length of 0xF and an offset of 0×5F0, or a length of 0×0 and an offset of 0xF5F, or maybe if there’s an endian swap going on (as you noted) it could really be 0xF0F5, with a similar breakdown of length and offset. Plus a length of 0xF may really mean, say, 17 or 18 bytes since lengths of 0, 1, or 2 might not ever be used, etc. All the different possibilities get troublesome fast, so it’s best to work with file examples where you can make decent guesses about the uncompressed data if possible. And of course if you’re handy with a debugger you can grab the uncompressed data directly or just examine the uncompression assembly itself.

September @ September 9th, 2006 | 1:31 am

thanks for the info. I tried a couple debug programs and I’m honestly not even familiar enough with the terminology to understand what was going on.

but just in case you or anyone else was curious or has a similar experience in the future, what was happening to me is that the real offset was offset + 0×14. So if the real offset is less than 0×14 this gets wrapped back around. The control byte I was talking about, for example, [F5,F0] had an offset of 0xFF5 and a real offset of 0×9. Also, the offset is calculated from the start of the buffer, not spaces back. And the real length was length + 3.

man, that gave me headaches for days. but it really feels great to have figured it out. now I just need to learn how to code… :-P

Edward Keyes @ September 9th, 2006 | 8:12 pm

Congrats!

Slow Fourier Transform » A reverse-engineering puzzle @ May 26th, 2007 | 2:50 pm

[…] There’s three compression types. 0×00 is raw storage, which was obvious. 0×01 is an LZSS flavor, which took me some 20 hours straight to decipher — it always starts with a control byte, stores references as AABC where BAA is the reference to a 0×1000 ring buffer (with a 0×12 offset for some silly reason) and C is the run length-3. Thankfully I had a set of files which were repackaged by a translator of the game who neglected to write a compression procedure, so I knew what to compare with and soon after I started writing actual code, it all came together, so I have extracted the majority of the content, including even a few forgotten PSD files. […]

Criptych @ May 28th, 2007 | 9:42 am

I’ve seen a few of these formats before, especially the LZSS variants. Perhaps the strangest one I’ve come across (not counting SPB, used by NScripter) is the ZBM image format from the original X-Change; strange not because it’s propietary or complicated, but because you can decompress them with the Windows “expand” utility - it’s the exact same format!

WinKiller Studio @ June 9th, 2007 | 1:43 am

ZBM is nothing but BMP with the first 100 bytes XORed with $FF (compression is standard LWA or SZDD by Microsoft. They have used it for DOS and Windows 3.1 releases. It’s really rare and even too old).

Currently i’m working on the real “universal” tool for the japanese visual novels fan-translators - Anime Editing Tools (available for downloading).

I want to decode Ever17 GCPS graphical data. It’s some kind of zlib compression, but… my head is already blown to bits - help me if you can, guys!

WinKiller Studio @ June 9th, 2007 | 1:50 am

I’ve forgot to say this lately, but my tool is Open Source Delphi project.

So, i will appreciate any help (the author of IrfanView, Irfan Skiljan, is already helped me with RLE compression documentation).

More info on my site (the one english page there).

WinKiller Studio @ June 11th, 2007 | 12:07 am

I know, i’m a newbie (maybe even too much of spoiled newbie, forgive me. And forgive for ugly english too, ok?). Well, i’ve worked lately on decoding GCPS (or just CPS) graphical data format that is used in Ever17 - the out of infinity (© KID\HIRAMEKI Int.) (and, that’s my guess, in Memories Off 2nd too, coz they’re both built on the same engine. I even found PARTS OF FORGOTTEN Memories Off 2nd’s code right in the EXE! :) ). The format is really strange. The game IS USING alpha channels. But, when i’ve looked at the header…

Header structure (Object Pascal defs here):<br><br>

<b>const 'CPS'+#0</b> - CPS header (4 bytes)<br>
<b>dword 4 bytes </b> - CPS file size<br>
<b>dword 4 bytes </b> - strange thing for sure! always equals to 16842854 or 0x66 0x00 0x01 0x01. Possibly a count of used colors (16.8 million) or version number (11.102), or maybe a DOS date stamp (01.08.1980, 00:03:12. WOW, if it's the really a date stamp, then the PC that was used for compilation had problems with CMOS battery...)?<br>
<b>dword 4 bytes </b> - bitmap result stream size (not sure)<br>
<b>const 'bmp'+#0</b> - bmp header (4 bytes)

Come on, ask me “What’s so strange here and what you’re unsure of?”. I’ll tell you:

I’m using WinHex 10.4, so i was able to dump the loaded game data to hard drive. Then i’ve coded “GrapS” (now it’s a part of AnimED), the simple RAW scanner that is able to extract bitmap data from dump (with preview, sizing and jumping controls of course :) )…

But, the thing i’ve never expected to see is that the size of resulting alpha-channelled 32-bit bitmap DOESN’T MATCH WITH CPS header record (resampled 24-bit images is nearly matches, lacks of few (10-36) bytes).

I was (and still i am) confused and frustrated. Does that means that the game stores alphas in the separate files? The old games (such as Tokimeki Check-In! and X-Change 1-2) have used this trick, when the picture is divided into “the image” and “alpha” sections.

But, not there! I’ve estimated the size & other possible stuff… If alpha is stored in the same file, then the size MUST be the same as in the 32-bit bitmaps, because: the size i’ve calculated was 800*600*(24 div 8) = 1440000 (raw, w\o header), but the “visible” part of bitmap is 800×600x24 itself without the alpha, so there’s definitely no space to store extra data.

Downsampled colors? No, i’m an PC artist and would notice this trick very fast (believe me, i do).

Header lies. It only tells about the 24-bit image size, but excludes alpha channel? Or… what about those “extra” 36 bytes?

Alpha is stored elsewhere or combined from RGB channels on the fly? Hehe, not the second one, that’s too complicated for sure.

Or, maybe the source is the 16-bit image with 8-bit alpha? I.e. A8R5G5B5. Non-standard, yes. And some “extra” byte here… but, poof! I’ve checked several dumps - the count of used colors is more than 65535 (checked with IrfanView 4.00), so it’s not possible…

So, where’s the alpha is stored? Until i figure it out, i won’t be able to write CPS (de)compressor, coz i’m not a skilled programmer, only a designer-translator and can do nothing good with the bruteforcing and huffman-things. :(

Well, i’m not exactly useless. Something is coming up to my mind. Every CPS file is ended with 0×53 0×07 or 0×54 0×07, and sometimes there a descending byte-sequences between the blocks. I think that the stream here should be readed reversible…

P.S. Thank you, Edward, for the very good and simple-to-understand articles (i will create the russian translation of it when i’ll get your permission). STILL WAITING FOR THE PART V - “Graphical Data”. ^_^

Maybe you’ll show how to crack this tricky format… at least, i hope so…

Dmitri Poguliayev aka WinKiller Studio. Greetings from Russia, Kemerovo. A lot of Russian people loves anime. :)

P.P.S. About the game archives. There’s something that bothers me. Did you’ve seen the “Peach Princess” ARC files? I understand that the data optimization is something important, but… the filenames and their extensions is stored separately. I think it’s stupid, because it only saves a 262 kilobytes (12*65535 (filename with extension * maximum of possible file records) - 8*65535).

P.P.P.S. There is one really intriguing theme - Scripts, texts and Russian language.

Come to think of it, fans usually hacks japanese games to support latin, but where’s the cyrillic characters support?! Ahh, it’s not needed, right? NO, IT’S NEEDED. We are the normal humanoid creatures ;) , not aliens, and want to translate \ play visual novels in our native language as well. I know the English, it’s not the personally a problem of myself, but the others… I feel sad for them. :(

You’re probably don’t even know how it’s hard to find a good japanese game outside of the Moscow. The most of buyable stuff is a pirated ripped releases (translated with Socrat, by the way) of the really old (1996-2004) h-games. I didn’t even seen new titles such as X-Change 3 & X-Change Alternative. It’s a surprise that the licensed US version of Ever17 was available not so long time ago ($5 cost, that’s really cheap, even for Kemerovo)…