Sửa lỗi encoding khi export file json trong Scrapy
![Sửa lỗi encoding khi export file json trong Scrapy](/content/images/size/w2000/2018/08/Scrapy-Logo-big-1.png)
Khi bạn chạy spider của mình để crawl dữ liệu từ website tiếng Nhật nào đó và xuất ra file json với option -o test.json -t json
lúc chạy lệnh spider. Nhưng file json bạn nhận được lại có định dạng như ở dưới:
{"title": "\u53f0\u98a820\u53f7 \u3042\u3059\u56db\u56fd\u30fb\u8fd1\u757f\u306b\u63a5\u8fd1\u3057\u4e0a\u9678\u3078 \u8a18\u9332\u7684\u306a\u5927\u96e8\u306e\u304a\u305d\u308c", "time": "8\u670822\u65e516\u664216\u5206", "content": "\u975e\u5e38\u306b\u5f37\u3044\u53f0\u98a820\u53f7\u306f23\u65e5\u306e\u5348\u5f8c\u4ee5\u964d\u3001\u5f37\u3044\u52e2\u529b\u3092\u4fdd\u3063\u305f\u307e\u307e\u56db\u56fd\u3084\u8fd1\u757f\u306b\u304b\u306a\u308a\u63a5\u8fd1\u3057\u3001\u305d\u306e\u5f8c\u3001\u4e0a\u9678\u3059\u308b\u898b\u8fbc\u307f\u3067\u3001\u964d\u308a\u59cb\u3081\u304b\u3089\u306e\u96e8\u91cf\u304c\u5c40\u5730\u7684\u306b1000\u30df\u30ea\u524d\u5f8c\u306e\u8a18\u9332\u7684\u306a\u5927\u96e8\u306b\u306a\u308b\u304a\u305d\u308c\u304c\u3042\u308a\u307e\u3059\u3002\u571f\u7802\u707d\u5bb3\u3084\u5ddd\u306e\u6c3e\u6feb\u306a\u3069\u306b\u8b66\u6212\u3057\u3001\u4e8b\u614b\u304c\u60aa\u5316\u3059\u308b\u524d\u306b\u907f\u96e3\u3067\u304d\u308b\u3088\u3046\u65e9\u3081\u306e\u5099\u3048\u3092\u9032\u3081\u3066\u304f\u3060\u3055\u3044\u3002"}
Lỗi ở đây là do Scrapy đã encode('utf-8')
trước khi bạn gán giá trị để export ra file json. Scrapy xuất ra file json với argument ensure_ascii=True
nên các kí tự Unicode lưu dưới dạng chuỗi \uXXXX.
View file json trên*(đặt là file test.json đi)* terminal
$ cat test.json
{"title": "\u53f0\u98a820\u53f7 \u3042\u3059\u56db\u56fd\u30fb\u8fd1\u757f\u306b\u63a5\u8fd1\u3057\u4e0a\u9678\u3078 \u8a18\u9332\u7684\u306a\u5927\u96e8\u306e\u304a\u305d\u308c", "time": "8\u670822\u65e516\u664216\u5206", "content": "\u975e\u5e38\u306b\u5f37\u3044\u53f0\u98a820\u53f7\u306f23\u65e5\u306e\u5348\u5f8c\u4ee5\u964d\u3001\u5f37\u3044\u52e2\u529b\u3092\u4fdd\u3063\u305f\u307e\u307e\u56db\u56fd\u3084\u8fd1\u757f\u306b\u304b\u306a\u308a\u63a5\u8fd1\u3057\u3001\u305d\u306e\u5f8c\u3001\u4e0a\u9678\u3059\u308b\u898b\u8fbc\u307f\u3067\u3001\u964d\u308a\u59cb\u3081\u304b\u3089\u306e\u96e8\u91cf\u304c\u5c40\u5730\u7684\u306b1000\u30df\u30ea\u524d\u5f8c\u306e\u8a18\u9332\u7684\u306a\u5927\u96e8\u306b\u306a\u308b\u304a\u305d\u308c\u304c\u3042\u308a\u307e\u3059\u3002\u571f\u7802\u707d\u5bb3\u3084\u5ddd\u306e\u6c3e\u6feb\u306a\u3069\u306b\u8b66\u6212\u3057\u3001\u4e8b\u614b\u304c\u60aa\u5316\u3059\u308b\u524d\u306b\u907f\u96e3\u3067\u304d\u308b\u3088\u3046\u65e9\u3081\u306e\u5099\u3048\u3092\u9032\u3081\u3066\u304f\u3060\u3055\u3044\u3002"}
Bây giờ, mình sẽ xử lý nội dung file json với jq.
$ cat test.json | jq .
{
"title": "台風20号 あす四国・近畿に接近し上陸へ 記録的な大雨のおそれ",
"time": "8月22日16時16分",
"content": "非常に強い台風20号は23日の午後以降、強い勢力を保ったまま四国や近畿にかなり接近し、その後、上陸する見込みで、降り始めからの雨量が局地的に1000ミリ前後の記録的な大雨になるおそれがあります。土砂災害や川の氾濫などに警戒し、事態が悪化する前に避難できるよう早めの備えを進めてください。"
}
Bây giờ, mình xử lý ghi file Json với utf-8. Tạo file exporters.py
trong thư mục project scrapy.
$ cat yourproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
sau đó thêm nó vào trong settings.py để nó thay thế item xuất file json mặc định
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Chạy lại chương trình để export lại file json.