Convertir l’encodage d’un fichier vers UTF-8 ou latin1 (voire latin9 de préférence)
:set fileencoding=utf8 (1)
:wq (2)
1 | Specify UTF8 as a file encoding |
2 | Write file and quits |
Using latin9
file encoding and unix
line endings saves 1,10% of an ordinary french database dump.
$ file a.csv
a.csv: UTF-8 Unicode text, with very long lines, with CRLF line terminators
$ ls -l
-rw-r--r-- 1 grim grim 10394653 nov. 11 23:11 a.csv
$ vim a.csv
:set fileencoding=latin9
:wq
$ ls -l
-rw-r--r-- 1 grim grim 10299473 nov. 11 23:11 a.csv (1)
$ vim a.csv
:set ff=unix
:wq
$ ls -l
-rw-r--r-- 1 grim grim 10279957 nov. 11 23:12 a.csv (2)
$ file a.csv
a.csv: Non-ISO extended-ASCII text, with very long lines
1 | Writing a french database dump of user records (first names, last names, addresses…) in latin9 saves 0,91% of the file size |
2 | Writing the file with unix end lines (one CR character instead of two : CR+LF ) saves 0,18% more, so a total of 1,10% ; here it’s 114ko for a 10Mo file. |
More compression can be achieved using compression specific tools :
$ gzip a.csv
$ ls -l
-rw-r--r-- 1 grim grim 1045070 nov. 11 23:14 a.csv.gz (1)
$ gunzip a.csv
$ tar cJf a.csv.tar.xz
$ ls -l
-rw-r--r-- 1 grim grim 10279957 nov. 11 23:12 a.csv
-rw-r--r-- 1 grim grim 447556 nov. 11 23:15 a.csv.tar.xz (2)
1 | compressed with GZip it’s 10% of the original size |
2 | compressed with XZ it’s 5% of the original size |
More information in french here : Memo_8 : Archives, compression et décompression de fichiers.