The History of LZ compression

Research, ARC, Zip, and litigation

Feb 24, 2023

David Albert Huffman was born on the 9th of August in 1925 in Ohio. He earned his B.S. in electrical engineering from Ohio State University in 1944. He served in maintenance in the US Navy on a destroyer that cleared mines in Japanese and Chinese waters following WWII. He then went back to school and received his M.S. from OSU in 1949, and then his Ph.D. from MIT in 1953. In 1951, the students of the MIT information theory class were given the choice of a term paper or final exam by Professor Robert M. Fano. The term paper assignment was to find the most efficient binary tree coding. Huffman was frustrated and ready to start studying for the exam, but before he gave up he realized he could utilize a frequency-sorted binary tree, and he then proved this method was the most efficient. Professor Fano had worked with Claude Shannon on a similar code, but Shannon-Fano coding is less efficient. So, what is this Huffman coding? Specifically, it’s a prefix code used in lossless data compression. The most important part of this is the algorithm that Huffman developed for finding the most optimal coding for data. He worked at MIT from 1953 to ‘67, and then went to the University of California at Santa Cruz where he was a founding faculty member of the CS department. He was the chair from ‘70 to ‘73, and he retired in 1994 but kept right on teaching. He made many advancements in the field of information theory.

Picture of David Huffman — David Albert Huffman, photo by Matthew Mulbry

Abraham Lempel was born on the 10th of February in 1936 in Lwow in Poland (which is now Lviv, Ukraine). He moved from Poland to Israel in 1959. He attended the Israel Institute of Technology from which he earned his B.Sc, M.Sc, D.Sc (1963, 1965, 1967 respectively), and he then held the title of professor there for many years before going to work for HP in 1993. He achieved many different things, but among those were compression algorithms that have been used in… well… everything.

Professor Abraham Lempel, image courtesy of technion.ac.il

Jacob Ziv was born on the 27th of November in 1931 in Tiberias in the British mandate of Palestine. He was the son of Russian immigrants to the area, and he was fascinated by electricity and gadgets as a kid. He played violin and piano as youngster, but this apparently didn’t really appeal too much to him as he turned his music stand into a lamp. He also tried to make a Marconi transmitter from the family’s player piano, but this tripped the breakers in their house, and apparently, he never got the thing to work. Ziv was in highschool in the ‘48 Arab-Israeli war, and he was drafted into the IDF. He was briefly on the front. There was a mothers’ protest that wanted the youngest soldiers sent elsewhere, so Ziv was reassigned to the IAF where he served as a radar technician. Following the war he attended the Israel Institute of Technology, receiving his B.Sc and M.Sc there before going to MIT where he received his D.Sc in 1962. Ziv also worked at the university after achieving his degrees, and he too achieved many great things, but among those achievements were those same compression algorithms that are used in pretty much everything.

Jacob Ziv, image courtesy of technion.ac.il

Compression comes in two types. There’s lossy compression like MP3 or JPEG, in which those bits that are deemed irrelevant are removed making the file smaller. There’s also lossless compression in which data is removed with information embedded in the compressed file to recreate the data exactly.

Ziv once stated:

I knew all about information theory and statistics, and Abraham was well equipped in Boolean algebra and computer science.

Lempel and Ziv created the LZ77 algorithm for lossless compression in… wait for it… 1977. This wasn’t the first lossless compression algorithm, but it was the first that operated in a single pass. In Huffman coding, you first calculate the statistical features of the file, and then go over the file again to encode the data. The data dictionary is stored along with the encoded data which increases the size of the compressed file. In the LZ algorithm, data is compressed while it is looking for unique sequences, and it uses pointers to refer to stuff it’s already seen. When decompressing data, the decoder doesn’t need to identify unique sequences. It finds of the locations of the sequences by use of the pointers, and replaces each pointer with the sequence to which the pointer refers. This was refined the following year into LZ78. These two were so excited by the results of their work that they really didn’t think about whether or not it would make much impact. This didn’t really occur to them until others began working on extensions and implementations. LZ78 was patented with number 4,464,650 listing Willard Eastman, Abraham Lempel, Jacob Ziv, and Martin Cohen. The patent was originally assigned to Sperry Corporation (Univac) in 1981, and then re-assigned to AT&T in 1989. The patent expired in 2001.

Terry Archer Welch was born on the 20th of January in 1939. He attended MIT where he earned his B.S, M.S, and Ph.D in electrical engineering. He then taught at University of Texas at Austin. From there he went to work for Honeywell doing computer design in Waltham, Massachusetts. Following his employment at Honeywell, he went to work for DEC. With Welch, the LZ algorithm was taken from theory to practice. In 1984, the LZW algorithm was published. This was then used in UNIX’s compress. As UNIX was always shipped with sources, compress (and by extension LZW and LZ78) was widely studied and reimplemented. LZW is patented by Sperry as US4558302A and credits Welch as the inventor. As far as I can tell, this was never used to go after implementors.

Thom Henderson was born in Nassawadox in Virginia and grew up on the Nassawadox Creek. He attended the U.S. Merchant Marine Academy in Kings Point, New York. He got his start with computing on a mainframe operated by Dartmouth College with DTSS and BASIC, and he fell in love with computing. He graduated from the Merchant Marine Academy in 1977 with a B.S. in marine transportation and a minor in computer science. He worked as a Third Mate for several companies, and between ships he worked as a computer consultant. As he began to sail less and work with computers more, he eventually switched to computing full time. He and his brother-in-law, Andy Foray, founded System Enhancement Associates (SEA), which (among other products) published a data compression utility called ARC in 1985 that built upon LZW via compress. At it’s height, SEA consisted of Henderson, his wife, Foray, and 5 or so employees.

At this time, networked computing was the realm of dial-up bulletin board systems (BBSs). FidoNet (a global network of BBSs) was invented by Tom Jennings as an add-on to his bulletin board software in 1984. This network of BBSs grew to over 250 nodes within the first year, but all of the data was being transmitted (primarily) by 1200 baud modems. A compression system was desperately needed. ARC was the solution to this. Now any data being sent as a file could be compressed efficiently, and file sizes were cut in half. SYSOPs (system operators, the folks who ran and maintained BBS servers), however, were very much opposed to profit seeking and commercial software; preferring non-commercial software wherever possible. Additionally, many expected source to be distributed with any application (as was common in the mainframe world). So, ARC was distributed as shareware, and the source was provided. Rights were granted via the software license to port and share the software freely without charge, but no other rights were granted. This was enough to spread ARC far and wide. Commercial licenses were also available.

Phillip Walter Katz was born on the 3rd of November in 1962 in Milwaukee, Wisconsin. Katz was a bit of a loaner, and in his afternoons he liked to play chess with his father, and in his evenings to write code for programmable calculators. He learned early to optimize code for speed an memory use as calculators had neither. As he entered college, his parents bought him his first computer, an IBM 5150. That machine took him into the world of BBSs, and of PC software development. He graduated from the University of Wisconsin-Milwaukee with his B.S. in computer science in 1984, and then went to work for Allen-Bradley writing software for programmable logic controllers. These controllers were primarily used in manufacturing equipment. He left Allen-Bradley to work for Graysoft in Milwaukee in 1986. While working there, Katz created PKARC and PKXARC at night in his spare time as freeware. It became popular quickly. According to his mother Hildegard:

People kept calling him saying, “we would like to use your software, and we want to pay you money for it.”

And then, the checks started coming in the mail. Katz was a good programmer and not a good businessman, and he asked his mother for help on that front. Most of the time, he worked at his mother’s kitchen table. The phone calls that came in were handled by an answering service, and the first real employee was hired in 1988, a former colleague from Graysoft named Steve Burg. Unfortunately, this derivative work wasn’t exactly legal. The source code was made available with PK(X)ARC, and within it comments, spelling errors, and large portions of the code were unchanged from SEA’s ARC. The primary difference was that the compression/decompression routines were rewritten in x86 assembly and well optimized making the program much faster. As the application exploded in popularity, Katz left Allen-Bradley and started PKWARE.

SEA approached PKWARE offering a licensing deal, but PKWARE refused. SEA sued PKWARE, and with the evidence that SEA had of theft, it would have been a very easy win. Katz then positioned this lawsuit as a sort of underdog story in which his small company (consisting of four people at this point) was besieged by a megacorp. This positioning ignored the license of ARC, and it was clearly untrue as both were very small companies, but this made no difference. The story was set on BBSs and usenet.

From comp.sys.ibm.pc 14th of June in 1988:

This is really incredible! SEA is suing Phil Katz, the author of PK(X)ARC. I am so angry about this that I am deleting all copies of SEA's ARC program. It's time to send in your support to Phil for his vastly superior archiving program. I am sending my check today - in an amount greater than the donation he suggests (his program is free for non-commercial use).
Keith Petersen
Maintainer of the CP/M and MSDOS archives at SIMTEL20.ARPA [26.0.0.74]

The negotiated settlement was announced on the 2nd of August in 1988. PKWARE was granted a license by SEA for all ARC compatible programs beginning with the first release of PKXARC and continuing through July of 1988. This back-payment amounted to $22,500 ($56,900.46 in 2023) with an additional compensation payment for expenses of $40,000 ($101,156.38 in 2023). They also gained a license to continue selling SEA ARC compatible software until the 31st of January in 1989. In exchange, SEA was likewise licensed to use PKWARE’s ARC sources. For either party, the royalties due to one another under the cross licensing agreement was 6.5%. PKWARE also agreed to cease the use of SEA’s trademark “ARC” and change the names of their products. Both PKWARE and SEA had to keep quiet about it all by the terms of the agreement.

Following the settlement, PKWARE only released one further version of PKARC and PKXARC with the names changed to PKPAK an PKUNPAK. The attention of Kats and PKWARE shifted to creating an entirely new product utilizing the LZW algorithm, and in the court of public opinion SEA was somewhat ruined. Katz continued to encourage the impression that SEA was a big faceless corporation and that PKWARE was just the vulnerable small-time family operation. There was also a prevalent attitude that SEA was trying to retroactively make the ARC file format a proprietary one, and this hurt them. SEA won the battle in the court room, but lost the war for market share. The company was sold in 1992 to a Japanese company whose name I cannot find.

PKWARE’s new compression file format was called Zip, and the format was dedicated to the public domain. The programs were titled PKZIP and PKUNZIP. From DISCLAIM.DOC in the first release in 1989:

The file format of the files created by these programs, which file format is original with the first release of this software, is hereby dedicated to the public domain. Further, the filename extension of ".ZIP", first used in connection with data compression software on the first release of this software, is also hereby dedicated to the public domain, with the fervent and sincere hope that it will not be attempted to be appropriated by anyone else for their exclusive use, but rather that it will be used to refer to data compression and librarying software in general, of a class or type which creates files having a format generally compatible with this software.

To further support this release to the public domain, a document called the “Application Note” was supplied with the software, and in it was a detailed explanation of how the format works.

PKZIP was wildly popular and successful. It became a standard within both the BBS community and the PC world more generally. This was shareware with a registration price $25 ($60.32 in 2023) or $47 ($113.40 in 2023) if you wanted a manual.

Version 2.04c of PKZIP switched the compression algorithm to a new one called DEFLATE. This is a combination of Huffman coding and LZ77. PKWARE’s was granted the patent US5051745A for DEFLATE. This format was more formally specified in RFC 1951 in 1996 by L Peter Deutsch. DEFLATE is now the single most popular compression algorithm on Earth. It is used in Java’s .jar files, in GNU’s gzip, in PNG files, in zlib and zlib-ng, in 7zip, in Plan 9’s libflate, in MNG files, in Android’s .apk files, in Microsoft Windows’ compressed folders, and Microsoft’s .docx .xlsx .pptx and other office files. They’re all essentially zips.

Katz so well documented the format that he did create competition. A group called Info-ZIP formed and published open source versions of both a zip compressor and decompressor. These saw widespread use on UNIX-like platforms, and remains fairly popular on those systems to this day. From their website:

Info-ZIP is a diverse, Internet-based workgroup of about 20 primary authors and over one hundred beta-testers, formed in 1990 as a mailing list hosted by Keith Petersen on the original SimTel site at the White Sands Missile Range in New Mexico.
Info-ZIP's purpose is to provide free, portable, high-quality versions of the Zip and UnZip compressor-archiver utilities that are compatible with the DOS-based PKZIP by PKWARE, Inc.

The open source nature of Info-ZIP’s code then lead to the creation of zlib by Jean-loup Gailly and Mark Adler. This allowed the further use of the file format. Accordingly, gzip replaced compress, and PNG replaced GIF.

Version 2.04c in 1992 was followed quickly by 2.04g. Both of these were command line MS-DOS programs. No major released occurred until December of 1996 when PKWARE released version 2.50. This was the first version to support Microsoft Windows, and PKWARE left nothing to chance. They had support for Windows 95, NT, and 3.1.

PKWARE’s website on the 24th of December in 1996

As far as I can tell, Phil Katz didn’t much care for Windows. This late release for that market hurt the company badly. WinZip had been able to steal the Windows market, and the Windows market was absolutely huge. Loss in this area meant that PKWARE was essentially out of the PC file compression market. Phil Katz died on the 14th of April in 2000. His family sold the company to a team led by George Haddix backed by the investment firm Grace Matthews Inc. Under this new leadership, the company changed focus to security tools, but also ported their zip programs to every major platform of the day. New owners came in 2009 with backing from Novacap Technologies and Maranon Capital, and they doubled down on security products. The company changed hands again on the 13th of May 2020, and now resides in the hands of Thompson Street Capital Partners. The focus continues to be on security products.

Nico Mak was working for Mansfield Software Group in Connecticut writing software for IBM OS/2. He rather disliked that he needed to open a DOS prompt anytime he needed to zip or unzip something, and he decided to solve this problem. He created a graphical frontend for PKZIP/PKUNZIP for OS/2’s Presentation Manager called PMZip and released it as shareware in 1990. Windows was clearly winning in the OS market, and Mak developed WinZip, which saw it’s first release on the 3rd of April in 1991.

Screenwhot of Winzip 1.00 — WinZip 1.0 in Windows 3.11, image from hanshq.net

The first release provided a graphical user interface for PKZIP. With version 2, WinZip added support for ARC files, self-extracting zips, and virus scanning. Virus scanning support was extended in 3.0 supporting more virus scanning programs, and support for LZH files was added. Version 3.1 brought a drag-and-drop interface. In version 3.1, the need for PKUNZIP was removed and WinZip got it’s own decompression. Version 4 brought support for the ARJ format. Version 5 of WinZip removed the need for PKZIP. With version 5.6, WinZip added support for tar, gzip, and compress files. All of this was done before PKWARE released PKZIP 2.5.

WinZip 6.3 on Windows 3.11, image from WinWorldPC

WinZip 7 on Windows 95, image from WinWorldPC

WinZip was among the most popular shareware programs of the 1990s. Its undoing was essentially that Mircrosoft Windows gained native zip file support. Additionally, programs like 7zip and WinRAR came out supporting more formats, and these quickly became popular in filesharing communities. WinZip was sold to Corel Corporation on the 2nd of May in 2006.

Without the LZ algorithms and Huffman coding, the modern world likely would have unfolded quite differently. Early network communications relied very heavily on compression due to low bandwidth. Additionally, our images, text, and hybrids of the two like PDF all rely on LZ. Software distribution is likewise now packaged into LZ archives and sent over the internet often on communication streams that are themselves LZ compressed (by nginx or Apache utilizing gzip). Today, most people wouldn’t give a thought to compressing a file. They’d create a zip on Windows or macOS, attach it to an email and forget about it completely by the end of the day. We owe that ability to all of these people and their unique genius.

Tzvika Barenholz

I enjoyed reading. Abraham Lempel passed away a few weeks ago, sadly. Jacob Ziv still teaches at the Technion at 91. What giants of the field!

Expand full comment

1 reply by Bradford Morgan White

Feb 26, 2023

You're right, I will learn that from you (about not mentioning etc. etc.)

1 more comment...

Abort Retry Fail

Discussion about this post