Grimoire-
Command
.es

GNU+Linux command memo

vim : Convert file encoding to UTF8 or latin9

Convertir l’encodage d’un fichier vers UTF-8 ou latin1 (voire latin9 de préférence)

:set fileencoding=utf8 (1)
:wq (2)
1 Specify UTF8 as a file encoding
2 Write file and quits

Using latin9 file encoding and unix line endings saves 1,10% of an ordinary french database dump.

$ file a.csv
a.csv: UTF-8 Unicode text, with very long lines, with CRLF line terminators
$ ls -l
-rw-r--r-- 1 grim grim 10394653 nov.  11 23:11 a.csv
$ vim a.csv
:set fileencoding=latin9
:wq
$ ls -l
-rw-r--r-- 1 grim grim 10299473 nov.  11 23:11 a.csv (1)
$ vim a.csv
:set ff=unix
:wq
$ ls -l
-rw-r--r-- 1 grim grim 10279957 nov.  11 23:12 a.csv (2)
$ file a.csv
a.csv: Non-ISO extended-ASCII text, with very long lines
1 Writing a french database dump of user records (first names, last names, addresses…) in latin9 saves 0,91% of the file size
2 Writing the file with unix end lines (one CR character instead of two : CR+LF) saves 0,18% more, so a total of 1,10% ; here it’s 114ko for a 10Mo file.

More compression can be achieved using compression specific tools :

$ gzip a.csv
$ ls -l
-rw-r--r-- 1 grim grim 1045070 nov.  11 23:14 a.csv.gz (1)
$ gunzip a.csv
$ tar cJf a.csv.tar.xz
$ ls -l
-rw-r--r-- 1 grim grim 10279957 nov.  11 23:12 a.csv
-rw-r--r-- 1 grim grim 447556 nov.  11 23:15 a.csv.tar.xz (2)
1 compressed with GZip it’s 10% of the original size
2 compressed with XZ it’s 5% of the original size