面向大语言模型的结构化数据格式比较

当我们开始将大语言模型作为智能体进行训练时,需要思考如何最有效地在模型与现实世界环境之间传递信息。例如,当模型调用外部函数时,应该如何传递参数?环境中的数据又该如何输入给模型?最简单(也最通用)的解决方案是通过结构化数据格式,比如 JSON。这类格式能够编码任意嵌套的数据结构,并支持多种数据类型。

但 JSON 是否就是最佳选择?我们其实有很多选项,比如 TOML、YAML、XML 等。本文将探讨并衡量一些关键指标,以帮助我们做出合适的选择。

令牌效率

当前大语言模型的一个基本限制是有限的上下文。我们无法无限制地将整个数据世界随时间流入模型。这意味着我们需要用最少的令牌传递最有用的信息。

因此,让我们测试哪种结构化数据格式在令牌化时效率最高,无需实际遥测数据。为此,我们生成结构化数据(嵌套的 dictlist),其随机键使用系统字典构建。例如,在 JSON 中:

[
  {
    "leptomeninx_xylitone": [
      147.396,
      { "asellus_spinelike_triliterality": null }
    ]
  },
  {
    "costively_zoetic": [["decipherably_wheat"]],
    "neurectome_sorcery_tangleproof": null
  }
]

采样过程包括选择一个树大小 ,并递归地随机选择容器类型和终端值。您可能会注意到,使用的结构令牌数量取决于我们处理的数据类型。如果您的智能体输入不处理任意嵌套数据,则更简单的规范可能就足够了。因此,我们定义了一组形状,正如其名:

  • nested:以标量值结尾的深层字典/列表组合。
    示例
    [
      {
        "comforter": {
          "dosadh_disruption_prosodiac": {
            "unsnatch_moslem": 837
          },
          "tone_redefine": {
            "cribrose_aoul": [
              [
                "christianization-casuariiformes-overbravery-chevronel"
              ]
            ]
          }
        }
      },
      {
        "bovarysm": [
          "oropharynx_consentant_fibronuclear",
          "bajardo-liquidy-calibered-belucki"
        ],
        "materialistic": {
          "paleostylic": -27.23,
          "praediality_juvenilify_benempt": 104,
          "roquelaure": -407
        }
      },
      {
        "filicites": [
          "unpalatableness-allocaffeine",
          126.204,
          {
            "manesheet": "emery_tricyclene"
          }
        ],
        "imposing_elchee_mentation": 3,
        "inadvisability": -12.726
      }
    ]
    
  • sparse:主要为空值,在嵌套布局中罕见地出现数值或文本标量。
    示例
    [
      {
        "areole_auramine_kojiki": {
          "hyperabsorption_uraniscorrhaphy": -776
        },
        "maplebush_piete": [
          {
            "shadowgraphist": null,
            "stakeholder_busybodyness_crebrity": 644
          }
        ],
        "preadamite": null
      },
      {
        "bellmaking_brachydont": {
          "jalapin_chandelier_accelerando": null,
          "mandative": -79,
          "totora_peristaphylitis_graphy": null
        },
        "subferryman_dephlegmator": [
          {
            "manuka_uncriminally_archdeceiver": null
          }
        ]
      },
      {
        "daytime": [
          {
            "overfeminine_catholicist": -242.239,
            "sulfophthalein_irreciprocal": null
          }
        ],
        "gata": null,
        "macaranga_circuitman": null,
        "ostraciidae_subsidiariness": "throneward"
      }
    ]
    
  • tabular:基于列的表格,包含标量值行和共享模式。
    示例
    {
      "columns": [
        "viragoish_isogonality_swarming",
        "supralocally_nuncioship",
        "zoomorph",
        "cavitary_visie",
        "permutableness_impunity_bipack",
        "forby_archly",
        "rivinian",
        "unheal_annelidian_samurai"
      ],
      "rows": [
        [
          true,
          false,
          "cincinnatia-cyanhidrosis-auto",
          false,
          true,
          null,
          "acetosoluble nonexclamatory homogangliate croupal",
          -219
        ],
        [
          null,
          836,
          -904,
          "metasomatic-mundanism-hotchpotchly-secantly",
          null,
          309.642,
          "floodgate-baluchitherium-unimaginary-sheepkeeper",
          -396
        ],
        [
          "postcritical-tug",
          true,
          -948,
          0.135,
          399.166,
          -123,
          "palaeoniscus",
          true
        ]
      ]
    }
    

我们考虑以下格式:

  • json:完全压缩的 JSON,具有排序的键和紧凑的分隔符。
    示例
    [{"backlotter_overboast":"calligraphist_megabar_uninstructively","landspout_souper":[null],"liquefier_unconvicting":-151.898,"unbegot":[961],"unreformedness":-189.15},{"detriment_muckender":[469.486,{"aspergillum_sharebroker_akebia":337},-302.978],"heeder_aerophyte_unbase":499.655,"metamer_powsoddy":null},{"fascicled_fibrous_bajardo":{"octaeterid_pharmacolite_tentativeness":{"underfellow":83.76},"plethysmography_unchangeably_positioned":432.985,"transvestitism":82},"mirror":{"uninfallibility_benny":null}}]
    
  • yaml:块样式的 YAML 序列化,具有确定的键顺序。
    示例
    - backlotter_overboast: calligraphist_megabar_uninstructively
      landspout_souper:
      - null
      liquefier_unconvicting: -151.898
      unbegot:
      - 961
      unreformedness: -189.15
    - detriment_muckender:
      - 469.486
      - aspergillum_sharebroker_akebia: 337
      - -302.978
      heeder_aerophyte_unbase: 499.655
      metamer_powsoddy: null
    - fascicled_fibrous_bajardo:
        octaeterid_pharmacolite_tentativeness:
          underfellow: 83.76
        plethysmography_unchangeably_positioned: 432.985
        transvestitism: 82
      mirror:
        uninfallibility_benny: null
    
  • toml:将记录包装在记录数组下的 TOML 文档,空值被字符串化。
    示例
    [[records]]
    landspout_souper = [
        "null",
    ]
    backlotter_overboast = "calligraphist_megabar_uninstructively"
    liquefier_unconvicting = -151.898
    unreformedness = -189.15
    unbegot = [
        961,
    ]
    
    [[records]]
    detriment_muckender = [
        469.486,
        { aspergillum_sharebroker_akebia = 337 },
        -302.978,
    ]
    heeder_aerophyte_unbase = 499.655
    metamer_powsoddy = "null"
    
    [[records]]
    
    [records.fascicled_fibrous_bajardo]
    transvestitism = 82
    plethysmography_unchangeably_positioned = 432.985
    
    [records.fascicled_fibrous_bajardo.octaeterid_pharmacolite_tentativeness]
    underfellow = 83.76
    
    [records.mirror]
    uninfallibility_benny = "null"
    
  • xml:使用语义标签和显式类型名称的详细 XML 树。
    示例
    <records>
      <object name="record" index="0">
        <array name="landspout_souper">
          <null name="0" />
        </array>
        <string name="backlotter_overboast">calligraphist_megabar_uninstructively</string>
        <number name="liquefier_unconvicting">-151.898</number>
        <number name="unreformedness">-189.15</number>
        <array name="unbegot">
          <number name="0">961</number>
        </array>
      </object>
      <object name="record" index="1">
        <array name="detriment_muckender">
          <number name="0">469.486</number>
          <object name="1">
            <number name="aspergillum_sharebroker_akebia">337</number>
          </object>
          <number name="2">-302.978</number>
        </array>
        <number name="heeder_aerophyte_unbase">499.655</number>
        <null name="metamer_powsoddy" />
      </object>
      <object name="record" index="2">
        <object name="fascicled_fibrous_bajardo">
          <number name="transvestitism">82</number>
          <object name="octaeterid_pharmacolite_tentativeness">
            <number name="underfellow">83.76</number>
          </object>
          <number name="plethysmography_unchangeably_positioned">432.985</number>
        </object>
        <object name="mirror">
          <null name="uninfallibility_benny" />
        </object>
      </object>
    </records>
    
  • csv:从表格记录生成的带标题的逗号分隔行。
    示例
    bicellular_russification_unsinister,crude_paynim,isoetales,postembryonic_encrisp
    braza apology catalufa tofu,,rampager,triformous
    ,True,481.226,
    421.281,868,photodysphoria,escortage
    

现在,对于每种格式,然后对于每种形状,我们可以绘制每个节点的平均令牌数的热图。令牌计数来自 Qwen 3、Llama 3.2 和 gpt-oss 分词器的平均值。

一眼看去,我们可以发现 csv 是表格数据的明显赢家,而 json 在平均表现上最佳。 为了更清晰地了解情况,我们可以对每种形状进行平均,以查看每种格式的平均令牌数。

这表明,仅就令牌效率而言,排名是 json > yaml > toml > xml。 然而,格式紧凑并不意味着它。但我们如何量化这一点呢? 什么使一个格式对大语言模型来说是好的?我提出了一个简单的度量标准,它恰好也作为一个长上下文/精度的基准,概括了这一点。

格式直观性

直观的格式便于语言模型解析生成。为衡量直观性,我们提出以下基准测试。所有运行均使用 DeepSeek V3(2025-09)的原始聊天模式且不调用工具,因此模型需在脑中执行 Python 代码片段。

  • 给定格式 、输入树规模 和输出树规模
  • 生成包含 个节点的输入数据树
  • 生成 Python 程序,定义变量 target,其求值结果为包含 个节点的嵌套数据树,且需查询输入数据树
  • 提示模型以格式 序列化生成 target
JSON 示例提示

格式:json_min 输入节点数:8 目标输出节点数:9

指令:

  1. 将数据集解析为名为 data 的 Python 变量
  2. 执行下方 Python 代码片段填充变量 target
  3. 使用原始格式(json_min)序列化 target,并将结果置于标记为 json 的代码块中
  4. 代码块应仅包含序列化数据
  5. 请务必确保格式和结构完全匹配

示例: 示例1: 数据集:

{"results":{"lo_unaddicted":[{"fleeting_geneserine_desmodynia":[-163.354]},{"subcrepitation_maddeningly":{"homoanisic":-3}},"helminth_vengeable"],"touchiness":[{"cataphyllum_educand":"remilitarize","unhumiliated_poorwill_oryctognostically":"resound","herrnhuter":false},["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]],"ichthyornithes_revisionary":{"alcogel_freckle":{"inquisition":"lehi"},"oniomaniac_flamineous_ledgerdom":{"tylotoxeate":-141,"hemeralopia":272.837},"unremember":[false,[-30],true]},"amphiumidae":{"unenterprised_meltage":[149],"psilanthropist_garrulinae":{"averrable_deporter":399.228,"riotproof_terebratuloid_monophyodontism":-22},"coed":{"indigoid_pulicid":"airbrush_oenothera","paillasse":"rutelinae"},"inhume_photoprinting_pasturability":["chiselly_backfilling"],"route_anisopogonous":[{"kotal_schematization_zestfulness":-91}]},"unexcised_seamless_intwist":{"cordaitean":-108,"unrising":"monarchist"}}}

Python 片段:

target = [
    data["amphiumidae"]["route_anisopogonous"][0],
    data["amphiumidae"]["inhume_photoprinting_pasturability"],
    data["touchiness"][1],
]

响应:

{"results":[{"kotal_schematization_zestfulness":-91},["chiselly_backfilling"],["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]]}

示例2: 数据集:

{"results":[[["selachostomous",88.259,"altair_assiniboin",{"samphire_symbolology":{"scarfed_wambutti":-28}},"bocca_ponerid"],[["gibberosity","footway_antecardium",[true],["myxosporous"],"repopulate"]],{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154}],{"tartary_loculose":[[{"counterwind":"endophasic"}],[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}]],"hydrophore":[{"insubvertible":119,"overwomanize":{"cobble_orography_caprice":-127},"queriman_episcopally_railway":{"unadoration":["weedage"]},"stactometer_toggle_cleavability":[453.262]},{"forejudge_tacnode":{"undersupport":105},"floorward":-170,"dormer_abysmal_occasional":-484.491,"wheatgrower":346.849,"phobism_intendingly":91.698}]},{"conirostres":[{"monorhymed_kioway":"taxlessly","ungloriousness_urosternite":true},["pendanting_allegation",-30],["hemiobol","monont_paradoxial"]],"sistrum":[{"untaintable_polladz":true},[-162,true],{"preclassic_standoffishness_pagina":true}]},[{"earlock_unmantled":{"philoradical_micranthropos":-10,"derout":["unfrock",90.415]},"hepatologist_unrushed":-270.882},[[["argyrol_art"]],["daftness"],[-12,149.452]],[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]],{"hexacanth":[[-3,-19]]}]]}

Python 片段:

target = [
    data[1]["tartary_loculose"][0][0],
    data[1]["hydrophore"][1]["wheatgrower"],
    data[1]["tartary_loculose"][1],
    data[1]["hydrophore"][0]["queriman_episcopally_railway"]["unadoration"],
    data[0][1][0][1],
    data[0][2],
    data[2]["conirostres"][0]["monorhymed_kioway"],
    data[3][2],
]

响应:

{"results":[{"counterwind":"endophasic"},346.849,[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}],["weedage"],"footway_antecardium",{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154},"taxlessly",[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]]]}

数据集:

{"results":["relict",{"intolerant_ignify":"cragginess_reapprobation","detriment_wholesalely_spillway":-49},true,"stewardess",-94]}

Python 片段:

target = [
    data[1]["intolerant_ignify"],
    data[4],
    data[1]["detriment_wholesalely_spillway"],
    data[2],
    data[3],
    data[1],
]

由于 XML 极其冗长,我们将予以省略。 针对每个输入输出规模,我们生成 5 个数据树并提示大语言模型。通过绘制正确答案比例,我们得到

JSON 最低准确度矩阵
JSON 最低准确度矩阵
JSON
YAML 块样式最低准确度矩阵
YAML 块样式最低准确度矩阵
YAML
TOML 最低准确度矩阵
TOML 最低准确度矩阵
TOML
JSON 与 TOML 表现相近,但 TOML 更易阅读。YAML 对 Deepseek 而言生成难度较高

图表可解读如下:若沿 Y 轴方向呈现绿色上升趋势,说明该格式对大规模输入具有良好扩展性,即易读性强;若沿 X 轴方向绿色延伸较远,则说明该格式对大规模输出树具有良好扩展性,即易于生成。与我认为 YAML 是更符合人体工学的直觉相反,其表现出乎意料地差。模型似乎对 TOML 和 JSON 的偏好程度相当。

然而,使用精确匹配作为度量标准可能过于严苛。我们可以改为对与参考答案共享更多结构的尝试给予更高评分,通过计算杰卡德系数(即提交答案与参考答案的交并比)来实现。使用与前图相同数据绘制此指标,我们得到

JSON 最低杰卡德矩阵
JSON 最低杰卡德矩阵
JSON
YAML 块样式最低杰卡德矩阵
YAML 块样式最低杰卡德矩阵
YAML
TOML 最低杰卡德矩阵
TOML 最低杰卡德矩阵
TOML
杰卡德系数提供了更平滑的准确度表征。可见 TOML 表现显著优异,JSON 紧随其后

我们观察到 JSON 和 TOML 的性能差异更为明显。与 JSON 相比,模型使用 TOML 时能与正确答案产生更高重叠度。YAML 则持续表现不佳。

结论

对我来说,这些数据的主要启示是不要使用 YAML。我见过很多网友说它比 JSON 更适合大语言模型,但这显然不正确。它平均多消耗约 19% 的 token,可读性和可写性也更差。与 JSON 相比,TOML 的读写性能似乎更具扩展性,但编码相同数据需要多消耗约 44% 的 token。对于大多数用途而言,JSON 似乎是最佳选择。

使用以下代码复现结果:https://github.com/nathom/token-efficiency