Sửa lỗi encoding khi export file json trong Scrapy

Khi bạn chạy spider của mình để crawl dữ liệu từ website tiếng Nhật nào đó và xuất ra file json với option -o test.json -t json lúc chạy lệnh spider. Nhưng file json bạn nhận được lại có định dạng như ở dưới:

{"title": "\u53f0\u98a820\u53f7 \u3042\u3059\u56db\u56fd\u30fb\u8fd1\u757f\u306b\u63a5\u8fd1\u3057\u4e0a\u9678\u3078 \u8a18\u9332\u7684\u306a\u5927\u96e8\u306e\u304a\u305d\u308c", "time": "8\u670822\u65e516\u664216\u5206", "content": "\u975e\u5e38\u306b\u5f37\u3044\u53f0\u98a820\u53f7\u306f23\u65e5\u306e\u5348\u5f8c\u4ee5\u964d\u3001\u5f37\u3044\u52e2\u529b\u3092\u4fdd\u3063\u305f\u307e\u307e\u56db\u56fd\u3084\u8fd1\u757f\u306b\u304b\u306a\u308a\u63a5\u8fd1\u3057\u3001\u305d\u306e\u5f8c\u3001\u4e0a\u9678\u3059\u308b\u898b\u8fbc\u307f\u3067\u3001\u964d\u308a\u59cb\u3081\u304b\u3089\u306e\u96e8\u91cf\u304c\u5c40\u5730\u7684\u306b1000\u30df\u30ea\u524d\u5f8c\u306e\u8a18\u9332\u7684\u306a\u5927\u96e8\u306b\u306a\u308b\u304a\u305d\u308c\u304c\u3042\u308a\u307e\u3059\u3002\u571f\u7802\u707d\u5bb3\u3084\u5ddd\u306e\u6c3e\u6feb\u306a\u3069\u306b\u8b66\u6212\u3057\u3001\u4e8b\u614b\u304c\u60aa\u5316\u3059\u308b\u524d\u306b\u907f\u96e3\u3067\u304d\u308b\u3088\u3046\u65e9\u3081\u306e\u5099\u3048\u3092\u9032\u3081\u3066\u304f\u3060\u3055\u3044\u3002"}

Lỗi ở đây là do Scrapy đã encode('utf-8') trước khi bạn gán giá trị để export ra file json. Scrapy xuất ra file json với argument ensure_ascii=True nên các kí tự Unicode lưu dưới dạng chuỗi \uXXXX.

View file json trên(đặt là file test.json đi) terminal

$ cat test.json 
{"title": "\u53f0\u98a820\u53f7 \u3042\u3059\u56db\u56fd\u30fb\u8fd1\u757f\u306b\u63a5\u8fd1\u3057\u4e0a\u9678\u3078 \u8a18\u9332\u7684\u306a\u5927\u96e8\u306e\u304a\u305d\u308c", "time": "8\u670822\u65e516\u664216\u5206", "content": "\u975e\u5e38\u306b\u5f37\u3044\u53f0\u98a820\u53f7\u306f23\u65e5\u306e\u5348\u5f8c\u4ee5\u964d\u3001\u5f37\u3044\u52e2\u529b\u3092\u4fdd\u3063\u305f\u307e\u307e\u56db\u56fd\u3084\u8fd1\u757f\u306b\u304b\u306a\u308a\u63a5\u8fd1\u3057\u3001\u305d\u306e\u5f8c\u3001\u4e0a\u9678\u3059\u308b\u898b\u8fbc\u307f\u3067\u3001\u964d\u308a\u59cb\u3081\u304b\u3089\u306e\u96e8\u91cf\u304c\u5c40\u5730\u7684\u306b1000\u30df\u30ea\u524d\u5f8c\u306e\u8a18\u9332\u7684\u306a\u5927\u96e8\u306b\u306a\u308b\u304a\u305d\u308c\u304c\u3042\u308a\u307e\u3059\u3002\u571f\u7802\u707d\u5bb3\u3084\u5ddd\u306e\u6c3e\u6feb\u306a\u3069\u306b\u8b66\u6212\u3057\u3001\u4e8b\u614b\u304c\u60aa\u5316\u3059\u308b\u524d\u306b\u907f\u96e3\u3067\u304d\u308b\u3088\u3046\u65e9\u3081\u306e\u5099\u3048\u3092\u9032\u3081\u3066\u304f\u3060\u3055\u3044\u3002"}

Bây giờ, mình sẽ xử lý nội dung file json với jq.

$ cat test.json | jq .
{
    "title": "台風20号 あす四国・近畿に接近し上陸へ 記録的な大雨のおそれ",
    "time": "8月22日16時16分",
    "content": "非常に強い台風20号は23日の午後以降、強い勢力を保ったまま四国や近畿にかなり接近し、その後、上陸する見込みで、降り始めからの雨量が局地的に1000ミリ前後の記録的な大雨になるおそれがあります。土砂災害や川の氾濫などに警戒し、事態が悪化する前に避難できるよう早めの備えを進めてください。"
  }

Bây giờ, mình xử lý ghi file Json với utf-8. Tạo file exporters.py trong thư mục project scrapy.

$ cat yourproject/exporters.py 
from scrapy.exporters import JsonItemExporter

class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)

sau đó thêm nó vào trong settings.py để nó thay thế item xuất file json mặc định

FEED_EXPORTERS = {  
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}

Chạy lại chương trình để export lại file json.

Related article